Spaces:

FocusGuard
/

integration_test2

Running

App Files Files Community

integration_test2 / evaluation /THRESHOLD_JUSTIFICATION.md

Abdelrahman Almatrooshi

FocusGuard with L2CS-Net gaze estimation

7b53d75 5 days ago

preview code

raw

history blame contribute delete

8.2 kB

Threshold Justification Report

Auto-generated by evaluation/justify_thresholds.py using LOPO cross-validation over 9 participants (~145k samples).

1. ML Model Decision Thresholds

Thresholds selected via Youden's J statistic (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions.

Model	LOPO AUC	Optimal Threshold (Youden's J)	F1 @ Optimal	F1 @ 0.50
MLP	0.8624	0.228	0.8578	0.8149
XGBoost	0.8804	0.377	0.8585	0.8424

2. Precision, Recall and Tradeoff

At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions:

Model	Threshold	Precision	Recall	F1	Accuracy
MLP	0.228	0.8187	0.9008	0.8578	0.8164
XGBoost	0.377	0.8426	0.8750	0.8585	0.8228

Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate).

3. Confusion Matrix (Pooled LOPO)

At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused).

MLP

	Pred 0	Pred 1
True 0	38065 (TN)	17750 (FP)
True 1	8831 (FN)	80147 (TP)

TN=38065, FP=17750, FN=8831, TP=80147.

XGBoost

	Pred 0	Pred 1
True 0	41271 (TN)	14544 (FP)
True 1	11118 (FN)	77860 (TP)

TN=41271, FP=14544, FN=11118, TP=77860.

4. Per-Person Performance Variance (LOPO)

One fold per left-out person; metrics at optimal threshold.

MLP — per held-out person

Person	Accuracy	F1	Precision	Recall
Abdelrahman	0.8628	0.9029	0.8760	0.9314
Jarek	0.8400	0.8770	0.8909	0.8635
Junhao	0.8872	0.8986	0.8354	0.9723
Kexin	0.7941	0.8123	0.7965	0.8288
Langyuan	0.5877	0.6169	0.4972	0.8126
Mohamed	0.8432	0.8653	0.7931	0.9519
Yingtao	0.8794	0.9263	0.9217	0.9309
ayten	0.8307	0.8986	0.8558	0.9459
saba	0.9192	0.9243	0.9260	0.9226

XGBoost — per held-out person

Person	Accuracy	F1	Precision	Recall
Abdelrahman	0.8601	0.8959	0.9129	0.8795
Jarek	0.8680	0.8993	0.9070	0.8917
Junhao	0.9099	0.9180	0.8627	0.9810
Kexin	0.7363	0.7385	0.7906	0.6928
Langyuan	0.6738	0.6945	0.5625	0.9074
Mohamed	0.8868	0.8988	0.8529	0.9498
Yingtao	0.8711	0.9195	0.9347	0.9048
ayten	0.8451	0.9070	0.8654	0.9528
saba	0.9393	0.9421	0.9615	0.9235

Summary across persons

Model	Accuracy mean ± std	F1 mean ± std	Precision mean ± std	Recall mean ± std
MLP	0.8271 ± 0.0968	0.8580 ± 0.0968	0.8214 ± 0.1307	0.9067 ± 0.0572
XGBoost	0.8434 ± 0.0847	0.8682 ± 0.0879	0.8500 ± 0.1191	0.8981 ± 0.0836

5. Confidence Intervals (95%, LOPO over 9 persons)

Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons.

Model	F1	Accuracy	Precision	Recall
MLP	0.8580 [0.7835, 0.9326]	0.8271 [0.7526, 0.9017]	0.8214 [0.7207, 0.9221]	0.9067 [0.8626, 0.9507]
XGBoost	0.8682 [0.8005, 0.9358]	0.8434 [0.7781, 0.9086]	0.8500 [0.7583, 0.9417]	0.8981 [0.8338, 0.9625]

6. Geometric Pipeline Weights (s_face vs s_eye)

Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J.

Face Weight (alpha)	Mean LOPO F1
0.2	0.7926
0.3	0.8002
0.4	0.7719
0.5	0.7868
0.6	0.8184
0.7	0.8195 <-- selected
0.8	0.8126

Best: alpha = 0.7 (face 70%, eye 30%)

7. Hybrid Pipeline: MLP vs Geometric

Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3).

MLP Weight (w_mlp)	Mean LOPO F1
0.3	0.8409 <-- selected
0.4	0.8246
0.5	0.8164
0.6	0.8106
0.7	0.8039
0.8	0.8016

Best: w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409

8. Hybrid Pipeline: XGBoost vs Geometric

Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb.

XGBoost Weight (w_xgb)	Mean LOPO F1
0.3	0.8639 <-- selected
0.4	0.8552
0.5	0.8451
0.6	0.8419
0.7	0.8382
0.8	0.8353

Best: w_xgb = 0.3 → mean LOPO F1 = 0.8639

Which hybrid is used in the app?

XGBoost hybrid is better (F1 = 0.8639 vs MLP hybrid F1 = 0.8409).

Logistic regression combiner (replaces heuristic weights)

Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a logistic regression combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output.

Method	Mean LOPO F1
Heuristic weight grid (best w)	0.8639
LR combiner	0.8241

The app uses the saved LR combiner when combiner_path is set in hybrid_focus_config.json.

5. Eye and Mouth Aspect Ratio Thresholds

EAR (Eye Aspect Ratio)

Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold.

Our thresholds define a linear interpolation zone around this established value:

Constant	Value	Justification
`ear_closed`	0.16	Below this, eyes are fully shut. 16.3% of samples fall here.
`EAR_BLINK_THRESH`	0.21	Blink detection point; close to the 0.2 reference. 21.2% of samples below.
`ear_open`	0.30	Above this, eyes are fully open. 70.4% of samples here.

Between 0.16 and 0.30 the _ear_score function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff.

MAR (Mouth Aspect Ratio)

Constant	Value	Justification
`MAR_YAWN_THRESHOLD`	0.55	Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives.

10. Other Constants

Constant	Value	Rationale
`gaze_max_offset`	0.28	Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge.
`max_angle`	22.0 deg	Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees.
`roll_weight`	0.5	Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%.
`EMA alpha`	0.3	Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker.
`grace_frames`	15	~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score.
`PERCLOS_WINDOW`	60 frames	2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998).
`BLINK_WINDOW_SEC`	30 s	Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997).