integration_test2 / evaluation /THRESHOLD_JUSTIFICATION.md
Abdelrahman Almatrooshi
FocusGuard with L2CS-Net gaze estimation
7b53d75

Threshold Justification Report

Auto-generated by evaluation/justify_thresholds.py using LOPO cross-validation over 9 participants (~145k samples).

1. ML Model Decision Thresholds

Thresholds selected via Youden's J statistic (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions.

Model LOPO AUC Optimal Threshold (Youden's J) F1 @ Optimal F1 @ 0.50
MLP 0.8624 0.228 0.8578 0.8149
XGBoost 0.8804 0.377 0.8585 0.8424

MLP ROC

XGBoost ROC

2. Precision, Recall and Tradeoff

At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions:

Model Threshold Precision Recall F1 Accuracy
MLP 0.228 0.8187 0.9008 0.8578 0.8164
XGBoost 0.377 0.8426 0.8750 0.8585 0.8228

Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate).

3. Confusion Matrix (Pooled LOPO)

At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused).

MLP

Pred 0 Pred 1
True 0 38065 (TN) 17750 (FP)
True 1 8831 (FN) 80147 (TP)

TN=38065, FP=17750, FN=8831, TP=80147.

XGBoost

Pred 0 Pred 1
True 0 41271 (TN) 14544 (FP)
True 1 11118 (FN) 77860 (TP)

TN=41271, FP=14544, FN=11118, TP=77860.

Confusion MLP

Confusion XGBoost

4. Per-Person Performance Variance (LOPO)

One fold per left-out person; metrics at optimal threshold.

MLP — per held-out person

Person Accuracy F1 Precision Recall
Abdelrahman 0.8628 0.9029 0.8760 0.9314
Jarek 0.8400 0.8770 0.8909 0.8635
Junhao 0.8872 0.8986 0.8354 0.9723
Kexin 0.7941 0.8123 0.7965 0.8288
Langyuan 0.5877 0.6169 0.4972 0.8126
Mohamed 0.8432 0.8653 0.7931 0.9519
Yingtao 0.8794 0.9263 0.9217 0.9309
ayten 0.8307 0.8986 0.8558 0.9459
saba 0.9192 0.9243 0.9260 0.9226

XGBoost — per held-out person

Person Accuracy F1 Precision Recall
Abdelrahman 0.8601 0.8959 0.9129 0.8795
Jarek 0.8680 0.8993 0.9070 0.8917
Junhao 0.9099 0.9180 0.8627 0.9810
Kexin 0.7363 0.7385 0.7906 0.6928
Langyuan 0.6738 0.6945 0.5625 0.9074
Mohamed 0.8868 0.8988 0.8529 0.9498
Yingtao 0.8711 0.9195 0.9347 0.9048
ayten 0.8451 0.9070 0.8654 0.9528
saba 0.9393 0.9421 0.9615 0.9235

Summary across persons

Model Accuracy mean ± std F1 mean ± std Precision mean ± std Recall mean ± std
MLP 0.8271 ± 0.0968 0.8580 ± 0.0968 0.8214 ± 0.1307 0.9067 ± 0.0572
XGBoost 0.8434 ± 0.0847 0.8682 ± 0.0879 0.8500 ± 0.1191 0.8981 ± 0.0836

5. Confidence Intervals (95%, LOPO over 9 persons)

Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons.

Model F1 Accuracy Precision Recall
MLP 0.8580 [0.7835, 0.9326] 0.8271 [0.7526, 0.9017] 0.8214 [0.7207, 0.9221] 0.9067 [0.8626, 0.9507]
XGBoost 0.8682 [0.8005, 0.9358] 0.8434 [0.7781, 0.9086] 0.8500 [0.7583, 0.9417] 0.8981 [0.8338, 0.9625]

6. Geometric Pipeline Weights (s_face vs s_eye)

Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J.

Face Weight (alpha) Mean LOPO F1
0.2 0.7926
0.3 0.8002
0.4 0.7719
0.5 0.7868
0.6 0.8184
0.7 0.8195 <-- selected
0.8 0.8126

Best: alpha = 0.7 (face 70%, eye 30%)

Geometric weight search

7. Hybrid Pipeline: MLP vs Geometric

Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3).

MLP Weight (w_mlp) Mean LOPO F1
0.3 0.8409 <-- selected
0.4 0.8246
0.5 0.8164
0.6 0.8106
0.7 0.8039
0.8 0.8016

Best: w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409

Hybrid MLP weight search

8. Hybrid Pipeline: XGBoost vs Geometric

Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb.

XGBoost Weight (w_xgb) Mean LOPO F1
0.3 0.8639 <-- selected
0.4 0.8552
0.5 0.8451
0.6 0.8419
0.7 0.8382
0.8 0.8353

Best: w_xgb = 0.3 → mean LOPO F1 = 0.8639

Hybrid XGBoost weight search

Which hybrid is used in the app?

XGBoost hybrid is better (F1 = 0.8639 vs MLP hybrid F1 = 0.8409).

Logistic regression combiner (replaces heuristic weights)

Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a logistic regression combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output.

Method Mean LOPO F1
Heuristic weight grid (best w) 0.8639
LR combiner 0.8241

The app uses the saved LR combiner when combiner_path is set in hybrid_focus_config.json.

5. Eye and Mouth Aspect Ratio Thresholds

EAR (Eye Aspect Ratio)

Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold.

Our thresholds define a linear interpolation zone around this established value:

Constant Value Justification
ear_closed 0.16 Below this, eyes are fully shut. 16.3% of samples fall here.
EAR_BLINK_THRESH 0.21 Blink detection point; close to the 0.2 reference. 21.2% of samples below.
ear_open 0.30 Above this, eyes are fully open. 70.4% of samples here.

Between 0.16 and 0.30 the _ear_score function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff.

EAR distribution

MAR (Mouth Aspect Ratio)

Constant Value Justification
MAR_YAWN_THRESHOLD 0.55 Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives.

MAR distribution

10. Other Constants

Constant Value Rationale
gaze_max_offset 0.28 Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge.
max_angle 22.0 deg Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees.
roll_weight 0.5 Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%.
EMA alpha 0.3 Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker.
grace_frames 15 ~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score.
PERCLOS_WINDOW 60 frames 2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998).
BLINK_WINDOW_SEC 30 s Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997).