integration_test2 / evaluation /THRESHOLD_JUSTIFICATION.md
Abdelrahman Almatrooshi
FocusGuard with L2CS-Net gaze estimation
7b53d75
# Threshold Justification Report
Auto-generated by `evaluation/justify_thresholds.py` using LOPO cross-validation over 9 participants (~145k samples).
## 1. ML Model Decision Thresholds
Thresholds selected via **Youden's J statistic** (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions.
| Model | LOPO AUC | Optimal Threshold (Youden's J) | F1 @ Optimal | F1 @ 0.50 |
|-------|----------|-------------------------------|--------------|-----------|
| MLP | 0.8624 | **0.228** | 0.8578 | 0.8149 |
| XGBoost | 0.8804 | **0.377** | 0.8585 | 0.8424 |
![MLP ROC](plots/roc_mlp.png)
![XGBoost ROC](plots/roc_xgboost.png)
## 2. Precision, Recall and Tradeoff
At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions:
| Model | Threshold | Precision | Recall | F1 | Accuracy |
|-------|----------:|----------:|-------:|---:|---------:|
| MLP | 0.228 | 0.8187 | 0.9008 | 0.8578 | 0.8164 |
| XGBoost | 0.377 | 0.8426 | 0.8750 | 0.8585 | 0.8228 |
Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate).
## 3. Confusion Matrix (Pooled LOPO)
At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused).
### MLP
| | Pred 0 | Pred 1 |
|--|-------:|-------:|
| **True 0** | 38065 (TN) | 17750 (FP) |
| **True 1** | 8831 (FN) | 80147 (TP) |
TN=38065, FP=17750, FN=8831, TP=80147.
### XGBoost
| | Pred 0 | Pred 1 |
|--|-------:|-------:|
| **True 0** | 41271 (TN) | 14544 (FP) |
| **True 1** | 11118 (FN) | 77860 (TP) |
TN=41271, FP=14544, FN=11118, TP=77860.
![Confusion MLP](plots/confusion_matrix_mlp.png)
![Confusion XGBoost](plots/confusion_matrix_xgb.png)
## 4. Per-Person Performance Variance (LOPO)
One fold per left-out person; metrics at optimal threshold.
### MLP — per held-out person
| Person | Accuracy | F1 | Precision | Recall |
|--------|---------:|---:|----------:|-------:|
| Abdelrahman | 0.8628 | 0.9029 | 0.8760 | 0.9314 |
| Jarek | 0.8400 | 0.8770 | 0.8909 | 0.8635 |
| Junhao | 0.8872 | 0.8986 | 0.8354 | 0.9723 |
| Kexin | 0.7941 | 0.8123 | 0.7965 | 0.8288 |
| Langyuan | 0.5877 | 0.6169 | 0.4972 | 0.8126 |
| Mohamed | 0.8432 | 0.8653 | 0.7931 | 0.9519 |
| Yingtao | 0.8794 | 0.9263 | 0.9217 | 0.9309 |
| ayten | 0.8307 | 0.8986 | 0.8558 | 0.9459 |
| saba | 0.9192 | 0.9243 | 0.9260 | 0.9226 |
### XGBoost — per held-out person
| Person | Accuracy | F1 | Precision | Recall |
|--------|---------:|---:|----------:|-------:|
| Abdelrahman | 0.8601 | 0.8959 | 0.9129 | 0.8795 |
| Jarek | 0.8680 | 0.8993 | 0.9070 | 0.8917 |
| Junhao | 0.9099 | 0.9180 | 0.8627 | 0.9810 |
| Kexin | 0.7363 | 0.7385 | 0.7906 | 0.6928 |
| Langyuan | 0.6738 | 0.6945 | 0.5625 | 0.9074 |
| Mohamed | 0.8868 | 0.8988 | 0.8529 | 0.9498 |
| Yingtao | 0.8711 | 0.9195 | 0.9347 | 0.9048 |
| ayten | 0.8451 | 0.9070 | 0.8654 | 0.9528 |
| saba | 0.9393 | 0.9421 | 0.9615 | 0.9235 |
### Summary across persons
| Model | Accuracy mean ± std | F1 mean ± std | Precision mean ± std | Recall mean ± std |
|-------|---------------------|---------------|----------------------|-------------------|
| MLP | 0.8271 ± 0.0968 | 0.8580 ± 0.0968 | 0.8214 ± 0.1307 | 0.9067 ± 0.0572 |
| XGBoost | 0.8434 ± 0.0847 | 0.8682 ± 0.0879 | 0.8500 ± 0.1191 | 0.8981 ± 0.0836 |
## 5. Confidence Intervals (95%, LOPO over 9 persons)
Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons.
| Model | F1 | Accuracy | Precision | Recall |
|-------|---:|--------:|----------:|-------:|
| MLP | 0.8580 [0.7835, 0.9326] | 0.8271 [0.7526, 0.9017] | 0.8214 [0.7207, 0.9221] | 0.9067 [0.8626, 0.9507] |
| XGBoost | 0.8682 [0.8005, 0.9358] | 0.8434 [0.7781, 0.9086] | 0.8500 [0.7583, 0.9417] | 0.8981 [0.8338, 0.9625] |
## 6. Geometric Pipeline Weights (s_face vs s_eye)
Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J.
| Face Weight (alpha) | Mean LOPO F1 |
|--------------------:|-------------:|
| 0.2 | 0.7926 |
| 0.3 | 0.8002 |
| 0.4 | 0.7719 |
| 0.5 | 0.7868 |
| 0.6 | 0.8184 |
| 0.7 | 0.8195 **<-- selected** |
| 0.8 | 0.8126 |
**Best:** alpha = 0.7 (face 70%, eye 30%)
![Geometric weight search](plots/geo_weight_search.png)
## 7. Hybrid Pipeline: MLP vs Geometric
Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3).
| MLP Weight (w_mlp) | Mean LOPO F1 |
|-------------------:|-------------:|
| 0.3 | 0.8409 **<-- selected** |
| 0.4 | 0.8246 |
| 0.5 | 0.8164 |
| 0.6 | 0.8106 |
| 0.7 | 0.8039 |
| 0.8 | 0.8016 |
**Best:** w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409
![Hybrid MLP weight search](plots/hybrid_weight_search.png)
## 8. Hybrid Pipeline: XGBoost vs Geometric
Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb.
| XGBoost Weight (w_xgb) | Mean LOPO F1 |
|-----------------------:|-------------:|
| 0.3 | 0.8639 **<-- selected** |
| 0.4 | 0.8552 |
| 0.5 | 0.8451 |
| 0.6 | 0.8419 |
| 0.7 | 0.8382 |
| 0.8 | 0.8353 |
**Best:** w_xgb = 0.3 → mean LOPO F1 = 0.8639
![Hybrid XGBoost weight search](plots/hybrid_xgb_weight_search.png)
### Which hybrid is used in the app?
**XGBoost hybrid is better** (F1 = 0.8639 vs MLP hybrid F1 = 0.8409).
### Logistic regression combiner (replaces heuristic weights)
Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a **logistic regression** combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output.
| Method | Mean LOPO F1 |
|--------|-------------:|
| Heuristic weight grid (best w) | 0.8639 |
| **LR combiner** | **0.8241** |
The app uses the saved LR combiner when `combiner_path` is set in `hybrid_focus_config.json`.
## 5. Eye and Mouth Aspect Ratio Thresholds
### EAR (Eye Aspect Ratio)
Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold.
Our thresholds define a linear interpolation zone around this established value:
| Constant | Value | Justification |
|----------|------:|---------------|
| `ear_closed` | 0.16 | Below this, eyes are fully shut. 16.3% of samples fall here. |
| `EAR_BLINK_THRESH` | 0.21 | Blink detection point; close to the 0.2 reference. 21.2% of samples below. |
| `ear_open` | 0.30 | Above this, eyes are fully open. 70.4% of samples here. |
Between 0.16 and 0.30 the `_ear_score` function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff.
![EAR distribution](plots/ear_distribution.png)
### MAR (Mouth Aspect Ratio)
| Constant | Value | Justification |
|----------|------:|---------------|
| `MAR_YAWN_THRESHOLD` | 0.55 | Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives. |
![MAR distribution](plots/mar_distribution.png)
## 10. Other Constants
| Constant | Value | Rationale |
|----------|------:|-----------|
| `gaze_max_offset` | 0.28 | Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge. |
| `max_angle` | 22.0 deg | Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees. |
| `roll_weight` | 0.5 | Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%. |
| `EMA alpha` | 0.3 | Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker. |
| `grace_frames` | 15 | ~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score. |
| `PERCLOS_WINDOW` | 60 frames | 2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998). |
| `BLINK_WINDOW_SEC` | 30 s | Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997). |