Spaces:
Running
Running
| # Threshold Justification Report | |
| Auto-generated by `evaluation/justify_thresholds.py` using LOPO cross-validation over 9 participants (~145k samples). | |
| ## 1. ML Model Decision Thresholds | |
| Thresholds selected via **Youden's J statistic** (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions. | |
| | Model | LOPO AUC | Optimal Threshold (Youden's J) | F1 @ Optimal | F1 @ 0.50 | | |
| |-------|----------|-------------------------------|--------------|-----------| | |
| | MLP | 0.8624 | **0.228** | 0.8578 | 0.8149 | | |
| | XGBoost | 0.8804 | **0.377** | 0.8585 | 0.8424 | | |
|  | |
|  | |
| ## 2. Precision, Recall and Tradeoff | |
| At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions: | |
| | Model | Threshold | Precision | Recall | F1 | Accuracy | | |
| |-------|----------:|----------:|-------:|---:|---------:| | |
| | MLP | 0.228 | 0.8187 | 0.9008 | 0.8578 | 0.8164 | | |
| | XGBoost | 0.377 | 0.8426 | 0.8750 | 0.8585 | 0.8228 | | |
| Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate). | |
| ## 3. Confusion Matrix (Pooled LOPO) | |
| At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused). | |
| ### MLP | |
| | | Pred 0 | Pred 1 | | |
| |--|-------:|-------:| | |
| | **True 0** | 38065 (TN) | 17750 (FP) | | |
| | **True 1** | 8831 (FN) | 80147 (TP) | | |
| TN=38065, FP=17750, FN=8831, TP=80147. | |
| ### XGBoost | |
| | | Pred 0 | Pred 1 | | |
| |--|-------:|-------:| | |
| | **True 0** | 41271 (TN) | 14544 (FP) | | |
| | **True 1** | 11118 (FN) | 77860 (TP) | | |
| TN=41271, FP=14544, FN=11118, TP=77860. | |
|  | |
|  | |
| ## 4. Per-Person Performance Variance (LOPO) | |
| One fold per left-out person; metrics at optimal threshold. | |
| ### MLP — per held-out person | |
| | Person | Accuracy | F1 | Precision | Recall | | |
| |--------|---------:|---:|----------:|-------:| | |
| | Abdelrahman | 0.8628 | 0.9029 | 0.8760 | 0.9314 | | |
| | Jarek | 0.8400 | 0.8770 | 0.8909 | 0.8635 | | |
| | Junhao | 0.8872 | 0.8986 | 0.8354 | 0.9723 | | |
| | Kexin | 0.7941 | 0.8123 | 0.7965 | 0.8288 | | |
| | Langyuan | 0.5877 | 0.6169 | 0.4972 | 0.8126 | | |
| | Mohamed | 0.8432 | 0.8653 | 0.7931 | 0.9519 | | |
| | Yingtao | 0.8794 | 0.9263 | 0.9217 | 0.9309 | | |
| | ayten | 0.8307 | 0.8986 | 0.8558 | 0.9459 | | |
| | saba | 0.9192 | 0.9243 | 0.9260 | 0.9226 | | |
| ### XGBoost — per held-out person | |
| | Person | Accuracy | F1 | Precision | Recall | | |
| |--------|---------:|---:|----------:|-------:| | |
| | Abdelrahman | 0.8601 | 0.8959 | 0.9129 | 0.8795 | | |
| | Jarek | 0.8680 | 0.8993 | 0.9070 | 0.8917 | | |
| | Junhao | 0.9099 | 0.9180 | 0.8627 | 0.9810 | | |
| | Kexin | 0.7363 | 0.7385 | 0.7906 | 0.6928 | | |
| | Langyuan | 0.6738 | 0.6945 | 0.5625 | 0.9074 | | |
| | Mohamed | 0.8868 | 0.8988 | 0.8529 | 0.9498 | | |
| | Yingtao | 0.8711 | 0.9195 | 0.9347 | 0.9048 | | |
| | ayten | 0.8451 | 0.9070 | 0.8654 | 0.9528 | | |
| | saba | 0.9393 | 0.9421 | 0.9615 | 0.9235 | | |
| ### Summary across persons | |
| | Model | Accuracy mean ± std | F1 mean ± std | Precision mean ± std | Recall mean ± std | | |
| |-------|---------------------|---------------|----------------------|-------------------| | |
| | MLP | 0.8271 ± 0.0968 | 0.8580 ± 0.0968 | 0.8214 ± 0.1307 | 0.9067 ± 0.0572 | | |
| | XGBoost | 0.8434 ± 0.0847 | 0.8682 ± 0.0879 | 0.8500 ± 0.1191 | 0.8981 ± 0.0836 | | |
| ## 5. Confidence Intervals (95%, LOPO over 9 persons) | |
| Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons. | |
| | Model | F1 | Accuracy | Precision | Recall | | |
| |-------|---:|--------:|----------:|-------:| | |
| | MLP | 0.8580 [0.7835, 0.9326] | 0.8271 [0.7526, 0.9017] | 0.8214 [0.7207, 0.9221] | 0.9067 [0.8626, 0.9507] | | |
| | XGBoost | 0.8682 [0.8005, 0.9358] | 0.8434 [0.7781, 0.9086] | 0.8500 [0.7583, 0.9417] | 0.8981 [0.8338, 0.9625] | | |
| ## 6. Geometric Pipeline Weights (s_face vs s_eye) | |
| Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J. | |
| | Face Weight (alpha) | Mean LOPO F1 | | |
| |--------------------:|-------------:| | |
| | 0.2 | 0.7926 | | |
| | 0.3 | 0.8002 | | |
| | 0.4 | 0.7719 | | |
| | 0.5 | 0.7868 | | |
| | 0.6 | 0.8184 | | |
| | 0.7 | 0.8195 **<-- selected** | | |
| | 0.8 | 0.8126 | | |
| **Best:** alpha = 0.7 (face 70%, eye 30%) | |
|  | |
| ## 7. Hybrid Pipeline: MLP vs Geometric | |
| Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3). | |
| | MLP Weight (w_mlp) | Mean LOPO F1 | | |
| |-------------------:|-------------:| | |
| | 0.3 | 0.8409 **<-- selected** | | |
| | 0.4 | 0.8246 | | |
| | 0.5 | 0.8164 | | |
| | 0.6 | 0.8106 | | |
| | 0.7 | 0.8039 | | |
| | 0.8 | 0.8016 | | |
| **Best:** w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409 | |
|  | |
| ## 8. Hybrid Pipeline: XGBoost vs Geometric | |
| Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb. | |
| | XGBoost Weight (w_xgb) | Mean LOPO F1 | | |
| |-----------------------:|-------------:| | |
| | 0.3 | 0.8639 **<-- selected** | | |
| | 0.4 | 0.8552 | | |
| | 0.5 | 0.8451 | | |
| | 0.6 | 0.8419 | | |
| | 0.7 | 0.8382 | | |
| | 0.8 | 0.8353 | | |
| **Best:** w_xgb = 0.3 → mean LOPO F1 = 0.8639 | |
|  | |
| ### Which hybrid is used in the app? | |
| **XGBoost hybrid is better** (F1 = 0.8639 vs MLP hybrid F1 = 0.8409). | |
| ### Logistic regression combiner (replaces heuristic weights) | |
| Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a **logistic regression** combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output. | |
| | Method | Mean LOPO F1 | | |
| |--------|-------------:| | |
| | Heuristic weight grid (best w) | 0.8639 | | |
| | **LR combiner** | **0.8241** | | |
| The app uses the saved LR combiner when `combiner_path` is set in `hybrid_focus_config.json`. | |
| ## 5. Eye and Mouth Aspect Ratio Thresholds | |
| ### EAR (Eye Aspect Ratio) | |
| Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold. | |
| Our thresholds define a linear interpolation zone around this established value: | |
| | Constant | Value | Justification | | |
| |----------|------:|---------------| | |
| | `ear_closed` | 0.16 | Below this, eyes are fully shut. 16.3% of samples fall here. | | |
| | `EAR_BLINK_THRESH` | 0.21 | Blink detection point; close to the 0.2 reference. 21.2% of samples below. | | |
| | `ear_open` | 0.30 | Above this, eyes are fully open. 70.4% of samples here. | | |
| Between 0.16 and 0.30 the `_ear_score` function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff. | |
|  | |
| ### MAR (Mouth Aspect Ratio) | |
| | Constant | Value | Justification | | |
| |----------|------:|---------------| | |
| | `MAR_YAWN_THRESHOLD` | 0.55 | Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives. | | |
|  | |
| ## 10. Other Constants | |
| | Constant | Value | Rationale | | |
| |----------|------:|-----------| | |
| | `gaze_max_offset` | 0.28 | Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge. | | |
| | `max_angle` | 22.0 deg | Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees. | | |
| | `roll_weight` | 0.5 | Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%. | | |
| | `EMA alpha` | 0.3 | Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker. | | |
| | `grace_frames` | 15 | ~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score. | | |
| | `PERCLOS_WINDOW` | 60 frames | 2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998). | | |
| | `BLINK_WINDOW_SEC` | 30 s | Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997). | | |