# Threshold Justification Report Auto-generated by `evaluation/justify_thresholds.py` using LOPO cross-validation over 9 participants (~145k samples). ## 1. ML Model Decision Thresholds Thresholds selected via **Youden's J statistic** (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions. | Model | LOPO AUC | Optimal Threshold (Youden's J) | F1 @ Optimal | F1 @ 0.50 | |-------|----------|-------------------------------|--------------|-----------| | MLP | 0.8624 | **0.228** | 0.8578 | 0.8149 | | XGBoost | 0.8804 | **0.377** | 0.8585 | 0.8424 | ![MLP ROC](plots/roc_mlp.png) ![XGBoost ROC](plots/roc_xgboost.png) ## 2. Precision, Recall and Tradeoff At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions: | Model | Threshold | Precision | Recall | F1 | Accuracy | |-------|----------:|----------:|-------:|---:|---------:| | MLP | 0.228 | 0.8187 | 0.9008 | 0.8578 | 0.8164 | | XGBoost | 0.377 | 0.8426 | 0.8750 | 0.8585 | 0.8228 | Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate). ## 3. Confusion Matrix (Pooled LOPO) At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused). ### MLP | | Pred 0 | Pred 1 | |--|-------:|-------:| | **True 0** | 38065 (TN) | 17750 (FP) | | **True 1** | 8831 (FN) | 80147 (TP) | TN=38065, FP=17750, FN=8831, TP=80147. ### XGBoost | | Pred 0 | Pred 1 | |--|-------:|-------:| | **True 0** | 41271 (TN) | 14544 (FP) | | **True 1** | 11118 (FN) | 77860 (TP) | TN=41271, FP=14544, FN=11118, TP=77860. ![Confusion MLP](plots/confusion_matrix_mlp.png) ![Confusion XGBoost](plots/confusion_matrix_xgb.png) ## 4. Per-Person Performance Variance (LOPO) One fold per left-out person; metrics at optimal threshold. ### MLP — per held-out person | Person | Accuracy | F1 | Precision | Recall | |--------|---------:|---:|----------:|-------:| | Abdelrahman | 0.8628 | 0.9029 | 0.8760 | 0.9314 | | Jarek | 0.8400 | 0.8770 | 0.8909 | 0.8635 | | Junhao | 0.8872 | 0.8986 | 0.8354 | 0.9723 | | Kexin | 0.7941 | 0.8123 | 0.7965 | 0.8288 | | Langyuan | 0.5877 | 0.6169 | 0.4972 | 0.8126 | | Mohamed | 0.8432 | 0.8653 | 0.7931 | 0.9519 | | Yingtao | 0.8794 | 0.9263 | 0.9217 | 0.9309 | | ayten | 0.8307 | 0.8986 | 0.8558 | 0.9459 | | saba | 0.9192 | 0.9243 | 0.9260 | 0.9226 | ### XGBoost — per held-out person | Person | Accuracy | F1 | Precision | Recall | |--------|---------:|---:|----------:|-------:| | Abdelrahman | 0.8601 | 0.8959 | 0.9129 | 0.8795 | | Jarek | 0.8680 | 0.8993 | 0.9070 | 0.8917 | | Junhao | 0.9099 | 0.9180 | 0.8627 | 0.9810 | | Kexin | 0.7363 | 0.7385 | 0.7906 | 0.6928 | | Langyuan | 0.6738 | 0.6945 | 0.5625 | 0.9074 | | Mohamed | 0.8868 | 0.8988 | 0.8529 | 0.9498 | | Yingtao | 0.8711 | 0.9195 | 0.9347 | 0.9048 | | ayten | 0.8451 | 0.9070 | 0.8654 | 0.9528 | | saba | 0.9393 | 0.9421 | 0.9615 | 0.9235 | ### Summary across persons | Model | Accuracy mean ± std | F1 mean ± std | Precision mean ± std | Recall mean ± std | |-------|---------------------|---------------|----------------------|-------------------| | MLP | 0.8271 ± 0.0968 | 0.8580 ± 0.0968 | 0.8214 ± 0.1307 | 0.9067 ± 0.0572 | | XGBoost | 0.8434 ± 0.0847 | 0.8682 ± 0.0879 | 0.8500 ± 0.1191 | 0.8981 ± 0.0836 | ## 5. Confidence Intervals (95%, LOPO over 9 persons) Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons. | Model | F1 | Accuracy | Precision | Recall | |-------|---:|--------:|----------:|-------:| | MLP | 0.8580 [0.7835, 0.9326] | 0.8271 [0.7526, 0.9017] | 0.8214 [0.7207, 0.9221] | 0.9067 [0.8626, 0.9507] | | XGBoost | 0.8682 [0.8005, 0.9358] | 0.8434 [0.7781, 0.9086] | 0.8500 [0.7583, 0.9417] | 0.8981 [0.8338, 0.9625] | ## 6. Geometric Pipeline Weights (s_face vs s_eye) Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J. | Face Weight (alpha) | Mean LOPO F1 | |--------------------:|-------------:| | 0.2 | 0.7926 | | 0.3 | 0.8002 | | 0.4 | 0.7719 | | 0.5 | 0.7868 | | 0.6 | 0.8184 | | 0.7 | 0.8195 **<-- selected** | | 0.8 | 0.8126 | **Best:** alpha = 0.7 (face 70%, eye 30%) ![Geometric weight search](plots/geo_weight_search.png) ## 7. Hybrid Pipeline: MLP vs Geometric Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3). | MLP Weight (w_mlp) | Mean LOPO F1 | |-------------------:|-------------:| | 0.3 | 0.8409 **<-- selected** | | 0.4 | 0.8246 | | 0.5 | 0.8164 | | 0.6 | 0.8106 | | 0.7 | 0.8039 | | 0.8 | 0.8016 | **Best:** w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409 ![Hybrid MLP weight search](plots/hybrid_weight_search.png) ## 8. Hybrid Pipeline: XGBoost vs Geometric Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb. | XGBoost Weight (w_xgb) | Mean LOPO F1 | |-----------------------:|-------------:| | 0.3 | 0.8639 **<-- selected** | | 0.4 | 0.8552 | | 0.5 | 0.8451 | | 0.6 | 0.8419 | | 0.7 | 0.8382 | | 0.8 | 0.8353 | **Best:** w_xgb = 0.3 → mean LOPO F1 = 0.8639 ![Hybrid XGBoost weight search](plots/hybrid_xgb_weight_search.png) ### Which hybrid is used in the app? **XGBoost hybrid is better** (F1 = 0.8639 vs MLP hybrid F1 = 0.8409). ### Logistic regression combiner (replaces heuristic weights) Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a **logistic regression** combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output. | Method | Mean LOPO F1 | |--------|-------------:| | Heuristic weight grid (best w) | 0.8639 | | **LR combiner** | **0.8241** | The app uses the saved LR combiner when `combiner_path` is set in `hybrid_focus_config.json`. ## 5. Eye and Mouth Aspect Ratio Thresholds ### EAR (Eye Aspect Ratio) Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold. Our thresholds define a linear interpolation zone around this established value: | Constant | Value | Justification | |----------|------:|---------------| | `ear_closed` | 0.16 | Below this, eyes are fully shut. 16.3% of samples fall here. | | `EAR_BLINK_THRESH` | 0.21 | Blink detection point; close to the 0.2 reference. 21.2% of samples below. | | `ear_open` | 0.30 | Above this, eyes are fully open. 70.4% of samples here. | Between 0.16 and 0.30 the `_ear_score` function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff. ![EAR distribution](plots/ear_distribution.png) ### MAR (Mouth Aspect Ratio) | Constant | Value | Justification | |----------|------:|---------------| | `MAR_YAWN_THRESHOLD` | 0.55 | Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives. | ![MAR distribution](plots/mar_distribution.png) ## 10. Other Constants | Constant | Value | Rationale | |----------|------:|-----------| | `gaze_max_offset` | 0.28 | Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge. | | `max_angle` | 22.0 deg | Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees. | | `roll_weight` | 0.5 | Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%. | | `EMA alpha` | 0.3 | Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker. | | `grace_frames` | 15 | ~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score. | | `PERCLOS_WINDOW` | 60 frames | 2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998). | | `BLINK_WINDOW_SEC` | 30 s | Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997). |