Spaces:
Running
Threshold Justification Report
Auto-generated by evaluation/justify_thresholds.py using LOPO cross-validation over 9 participants (~145k samples).
1. ML Model Decision Thresholds
Thresholds selected via Youden's J statistic (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions.
| Model | LOPO AUC | Optimal Threshold (Youden's J) | F1 @ Optimal | F1 @ 0.50 |
|---|---|---|---|---|
| MLP | 0.8624 | 0.228 | 0.8578 | 0.8149 |
| XGBoost | 0.8804 | 0.377 | 0.8585 | 0.8424 |
2. Precision, Recall and Tradeoff
At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions:
| Model | Threshold | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|
| MLP | 0.228 | 0.8187 | 0.9008 | 0.8578 | 0.8164 |
| XGBoost | 0.377 | 0.8426 | 0.8750 | 0.8585 | 0.8228 |
Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate).
3. Confusion Matrix (Pooled LOPO)
At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused).
MLP
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 38065 (TN) | 17750 (FP) |
| True 1 | 8831 (FN) | 80147 (TP) |
TN=38065, FP=17750, FN=8831, TP=80147.
XGBoost
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 41271 (TN) | 14544 (FP) |
| True 1 | 11118 (FN) | 77860 (TP) |
TN=41271, FP=14544, FN=11118, TP=77860.
4. Per-Person Performance Variance (LOPO)
One fold per left-out person; metrics at optimal threshold.
MLP — per held-out person
| Person | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|
| Abdelrahman | 0.8628 | 0.9029 | 0.8760 | 0.9314 |
| Jarek | 0.8400 | 0.8770 | 0.8909 | 0.8635 |
| Junhao | 0.8872 | 0.8986 | 0.8354 | 0.9723 |
| Kexin | 0.7941 | 0.8123 | 0.7965 | 0.8288 |
| Langyuan | 0.5877 | 0.6169 | 0.4972 | 0.8126 |
| Mohamed | 0.8432 | 0.8653 | 0.7931 | 0.9519 |
| Yingtao | 0.8794 | 0.9263 | 0.9217 | 0.9309 |
| ayten | 0.8307 | 0.8986 | 0.8558 | 0.9459 |
| saba | 0.9192 | 0.9243 | 0.9260 | 0.9226 |
XGBoost — per held-out person
| Person | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|
| Abdelrahman | 0.8601 | 0.8959 | 0.9129 | 0.8795 |
| Jarek | 0.8680 | 0.8993 | 0.9070 | 0.8917 |
| Junhao | 0.9099 | 0.9180 | 0.8627 | 0.9810 |
| Kexin | 0.7363 | 0.7385 | 0.7906 | 0.6928 |
| Langyuan | 0.6738 | 0.6945 | 0.5625 | 0.9074 |
| Mohamed | 0.8868 | 0.8988 | 0.8529 | 0.9498 |
| Yingtao | 0.8711 | 0.9195 | 0.9347 | 0.9048 |
| ayten | 0.8451 | 0.9070 | 0.8654 | 0.9528 |
| saba | 0.9393 | 0.9421 | 0.9615 | 0.9235 |
Summary across persons
| Model | Accuracy mean ± std | F1 mean ± std | Precision mean ± std | Recall mean ± std |
|---|---|---|---|---|
| MLP | 0.8271 ± 0.0968 | 0.8580 ± 0.0968 | 0.8214 ± 0.1307 | 0.9067 ± 0.0572 |
| XGBoost | 0.8434 ± 0.0847 | 0.8682 ± 0.0879 | 0.8500 ± 0.1191 | 0.8981 ± 0.0836 |
5. Confidence Intervals (95%, LOPO over 9 persons)
Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons.
| Model | F1 | Accuracy | Precision | Recall |
|---|---|---|---|---|
| MLP | 0.8580 [0.7835, 0.9326] | 0.8271 [0.7526, 0.9017] | 0.8214 [0.7207, 0.9221] | 0.9067 [0.8626, 0.9507] |
| XGBoost | 0.8682 [0.8005, 0.9358] | 0.8434 [0.7781, 0.9086] | 0.8500 [0.7583, 0.9417] | 0.8981 [0.8338, 0.9625] |
6. Geometric Pipeline Weights (s_face vs s_eye)
Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J.
| Face Weight (alpha) | Mean LOPO F1 |
|---|---|
| 0.2 | 0.7926 |
| 0.3 | 0.8002 |
| 0.4 | 0.7719 |
| 0.5 | 0.7868 |
| 0.6 | 0.8184 |
| 0.7 | 0.8195 <-- selected |
| 0.8 | 0.8126 |
Best: alpha = 0.7 (face 70%, eye 30%)
7. Hybrid Pipeline: MLP vs Geometric
Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3).
| MLP Weight (w_mlp) | Mean LOPO F1 |
|---|---|
| 0.3 | 0.8409 <-- selected |
| 0.4 | 0.8246 |
| 0.5 | 0.8164 |
| 0.6 | 0.8106 |
| 0.7 | 0.8039 |
| 0.8 | 0.8016 |
Best: w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409
8. Hybrid Pipeline: XGBoost vs Geometric
Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb.
| XGBoost Weight (w_xgb) | Mean LOPO F1 |
|---|---|
| 0.3 | 0.8639 <-- selected |
| 0.4 | 0.8552 |
| 0.5 | 0.8451 |
| 0.6 | 0.8419 |
| 0.7 | 0.8382 |
| 0.8 | 0.8353 |
Best: w_xgb = 0.3 → mean LOPO F1 = 0.8639
Which hybrid is used in the app?
XGBoost hybrid is better (F1 = 0.8639 vs MLP hybrid F1 = 0.8409).
Logistic regression combiner (replaces heuristic weights)
Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a logistic regression combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output.
| Method | Mean LOPO F1 |
|---|---|
| Heuristic weight grid (best w) | 0.8639 |
| LR combiner | 0.8241 |
The app uses the saved LR combiner when combiner_path is set in hybrid_focus_config.json.
5. Eye and Mouth Aspect Ratio Thresholds
EAR (Eye Aspect Ratio)
Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold.
Our thresholds define a linear interpolation zone around this established value:
| Constant | Value | Justification |
|---|---|---|
ear_closed |
0.16 | Below this, eyes are fully shut. 16.3% of samples fall here. |
EAR_BLINK_THRESH |
0.21 | Blink detection point; close to the 0.2 reference. 21.2% of samples below. |
ear_open |
0.30 | Above this, eyes are fully open. 70.4% of samples here. |
Between 0.16 and 0.30 the _ear_score function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff.
MAR (Mouth Aspect Ratio)
| Constant | Value | Justification |
|---|---|---|
MAR_YAWN_THRESHOLD |
0.55 | Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives. |
10. Other Constants
| Constant | Value | Rationale |
|---|---|---|
gaze_max_offset |
0.28 | Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge. |
max_angle |
22.0 deg | Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees. |
roll_weight |
0.5 | Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%. |
EMA alpha |
0.3 | Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker. |
grace_frames |
15 | ~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score. |
PERCLOS_WINDOW |
60 frames | 2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998). |
BLINK_WINDOW_SEC |
30 s | Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997). |








