Spaces:

FocusGuard
/

integration_test2

Running

App Files Files Community

integration_test2 / evaluation /THRESHOLD_JUSTIFICATION.md

Abdelrahman Almatrooshi

FocusGuard with L2CS-Net gaze estimation

7b53d75 6 days ago

preview code

raw

history blame contribute delete

8.2 kB

	# Threshold Justification Report

	Auto-generated by `evaluation/justify_thresholds.py` using LOPO cross-validation over 9 participants (~145k samples).

	## 1. ML Model Decision Thresholds

	Thresholds selected via Youden's J statistic (J = sensitivity + specificity - 1) on pooled LOPO held-out predictions.

	\| Model \| LOPO AUC \| Optimal Threshold (Youden's J) \| F1 @ Optimal \| F1 @ 0.50 \|
	\|-------\|----------\|-------------------------------\|--------------\|-----------\|
	\| MLP \| 0.8624 \| 0.228 \| 0.8578 \| 0.8149 \|
	\| XGBoost \| 0.8804 \| 0.377 \| 0.8585 \| 0.8424 \|

	![MLP ROC](plots/roc_mlp.png)

	![XGBoost ROC](plots/roc_xgboost.png)

	## 2. Precision, Recall and Tradeoff

	At the optimal threshold (Youden's J), pooled over all LOPO held-out predictions:

	\| Model \| Threshold \| Precision \| Recall \| F1 \| Accuracy \|
	\|-------\|----------:\|----------:\|-------:\|---:\|---------:\|
	\| MLP \| 0.228 \| 0.8187 \| 0.9008 \| 0.8578 \| 0.8164 \|
	\| XGBoost \| 0.377 \| 0.8426 \| 0.8750 \| 0.8585 \| 0.8228 \|

	Higher threshold → fewer positive predictions → higher precision, lower recall. Youden's J picks the threshold that balances sensitivity and specificity (recall for the positive class and true negative rate).

	## 3. Confusion Matrix (Pooled LOPO)

	At optimal threshold. Rows = true label, columns = predicted label (0 = unfocused, 1 = focused).

	### MLP

	\| \| Pred 0 \| Pred 1 \|
	\|--\|-------:\|-------:\|
	\| True 0 \| 38065 (TN) \| 17750 (FP) \|
	\| True 1 \| 8831 (FN) \| 80147 (TP) \|

	TN=38065, FP=17750, FN=8831, TP=80147.

	### XGBoost

	\| \| Pred 0 \| Pred 1 \|
	\|--\|-------:\|-------:\|
	\| True 0 \| 41271 (TN) \| 14544 (FP) \|
	\| True 1 \| 11118 (FN) \| 77860 (TP) \|

	TN=41271, FP=14544, FN=11118, TP=77860.

	![Confusion MLP](plots/confusion_matrix_mlp.png)

	![Confusion XGBoost](plots/confusion_matrix_xgb.png)

	## 4. Per-Person Performance Variance (LOPO)

	One fold per left-out person; metrics at optimal threshold.

	### MLP — per held-out person

	\| Person \| Accuracy \| F1 \| Precision \| Recall \|
	\|--------\|---------:\|---:\|----------:\|-------:\|
	\| Abdelrahman \| 0.8628 \| 0.9029 \| 0.8760 \| 0.9314 \|
	\| Jarek \| 0.8400 \| 0.8770 \| 0.8909 \| 0.8635 \|
	\| Junhao \| 0.8872 \| 0.8986 \| 0.8354 \| 0.9723 \|
	\| Kexin \| 0.7941 \| 0.8123 \| 0.7965 \| 0.8288 \|
	\| Langyuan \| 0.5877 \| 0.6169 \| 0.4972 \| 0.8126 \|
	\| Mohamed \| 0.8432 \| 0.8653 \| 0.7931 \| 0.9519 \|
	\| Yingtao \| 0.8794 \| 0.9263 \| 0.9217 \| 0.9309 \|
	\| ayten \| 0.8307 \| 0.8986 \| 0.8558 \| 0.9459 \|
	\| saba \| 0.9192 \| 0.9243 \| 0.9260 \| 0.9226 \|

	### XGBoost — per held-out person

	\| Person \| Accuracy \| F1 \| Precision \| Recall \|
	\|--------\|---------:\|---:\|----------:\|-------:\|
	\| Abdelrahman \| 0.8601 \| 0.8959 \| 0.9129 \| 0.8795 \|
	\| Jarek \| 0.8680 \| 0.8993 \| 0.9070 \| 0.8917 \|
	\| Junhao \| 0.9099 \| 0.9180 \| 0.8627 \| 0.9810 \|
	\| Kexin \| 0.7363 \| 0.7385 \| 0.7906 \| 0.6928 \|
	\| Langyuan \| 0.6738 \| 0.6945 \| 0.5625 \| 0.9074 \|
	\| Mohamed \| 0.8868 \| 0.8988 \| 0.8529 \| 0.9498 \|
	\| Yingtao \| 0.8711 \| 0.9195 \| 0.9347 \| 0.9048 \|
	\| ayten \| 0.8451 \| 0.9070 \| 0.8654 \| 0.9528 \|
	\| saba \| 0.9393 \| 0.9421 \| 0.9615 \| 0.9235 \|

	### Summary across persons

	\| Model \| Accuracy mean ± std \| F1 mean ± std \| Precision mean ± std \| Recall mean ± std \|
	\|-------\|---------------------\|---------------\|----------------------\|-------------------\|
	\| MLP \| 0.8271 ± 0.0968 \| 0.8580 ± 0.0968 \| 0.8214 ± 0.1307 \| 0.9067 ± 0.0572 \|
	\| XGBoost \| 0.8434 ± 0.0847 \| 0.8682 ± 0.0879 \| 0.8500 ± 0.1191 \| 0.8981 ± 0.0836 \|

	## 5. Confidence Intervals (95%, LOPO over 9 persons)

	Mean ± half-width of 95% t-interval (df=8) for each metric across the 9 left-out persons.

	\| Model \| F1 \| Accuracy \| Precision \| Recall \|
	\|-------\|---:\|--------:\|----------:\|-------:\|
	\| MLP \| 0.8580 [0.7835, 0.9326] \| 0.8271 [0.7526, 0.9017] \| 0.8214 [0.7207, 0.9221] \| 0.9067 [0.8626, 0.9507] \|
	\| XGBoost \| 0.8682 [0.8005, 0.9358] \| 0.8434 [0.7781, 0.9086] \| 0.8500 [0.7583, 0.9417] \| 0.8981 [0.8338, 0.9625] \|

	## 6. Geometric Pipeline Weights (s_face vs s_eye)

	Grid search over face weight alpha in {0.2 ... 0.8}. Eye weight = 1 - alpha. Threshold per fold via Youden's J.

	\| Face Weight (alpha) \| Mean LOPO F1 \|
	\|--------------------:\|-------------:\|
	\| 0.2 \| 0.7926 \|
	\| 0.3 \| 0.8002 \|
	\| 0.4 \| 0.7719 \|
	\| 0.5 \| 0.7868 \|
	\| 0.6 \| 0.8184 \|
	\| 0.7 \| 0.8195 <-- selected \|
	\| 0.8 \| 0.8126 \|

	Best: alpha = 0.7 (face 70%, eye 30%)

	![Geometric weight search](plots/geo_weight_search.png)

	## 7. Hybrid Pipeline: MLP vs Geometric

	Grid search over w_mlp in {0.3 ... 0.8}. w_geo = 1 - w_mlp. Geometric sub-score uses same weights as geometric pipeline (face=0.7, eye=0.3).

	\| MLP Weight (w_mlp) \| Mean LOPO F1 \|
	\|-------------------:\|-------------:\|
	\| 0.3 \| 0.8409 <-- selected \|
	\| 0.4 \| 0.8246 \|
	\| 0.5 \| 0.8164 \|
	\| 0.6 \| 0.8106 \|
	\| 0.7 \| 0.8039 \|
	\| 0.8 \| 0.8016 \|

	Best: w_mlp = 0.3 (MLP 30%, geometric 70%) → mean LOPO F1 = 0.8409

	![Hybrid MLP weight search](plots/hybrid_weight_search.png)

	## 8. Hybrid Pipeline: XGBoost vs Geometric

	Same grid over w_xgb in {0.3 ... 0.8}. w_geo = 1 - w_xgb.

	\| XGBoost Weight (w_xgb) \| Mean LOPO F1 \|
	\|-----------------------:\|-------------:\|
	\| 0.3 \| 0.8639 <-- selected \|
	\| 0.4 \| 0.8552 \|
	\| 0.5 \| 0.8451 \|
	\| 0.6 \| 0.8419 \|
	\| 0.7 \| 0.8382 \|
	\| 0.8 \| 0.8353 \|

	Best: w_xgb = 0.3 → mean LOPO F1 = 0.8639

	![Hybrid XGBoost weight search](plots/hybrid_xgb_weight_search.png)

	### Which hybrid is used in the app?

	XGBoost hybrid is better (F1 = 0.8639 vs MLP hybrid F1 = 0.8409).

	### Logistic regression combiner (replaces heuristic weights)

	Instead of a fixed linear blend (e.g. 0.3·ML + 0.7·geo), a logistic regression combines model probability and geometric score: meta-features = [model_prob, geo_score], trained on the same LOPO splits. Threshold from Youden's J on combiner output.

	\| Method \| Mean LOPO F1 \|
	\|--------\|-------------:\|
	\| Heuristic weight grid (best w) \| 0.8639 \|
	\| LR combiner \| 0.8241 \|

	The app uses the saved LR combiner when `combiner_path` is set in `hybrid_focus_config.json`.

	## 5. Eye and Mouth Aspect Ratio Thresholds

	### EAR (Eye Aspect Ratio)

	Reference: Soukupova & Cech, "Real-Time Eye Blink Detection Using Facial Landmarks" (2016) established EAR ~ 0.2 as a blink threshold.

	Our thresholds define a linear interpolation zone around this established value:

	\| Constant \| Value \| Justification \|
	\|----------\|------:\|---------------\|
	\| `ear_closed` \| 0.16 \| Below this, eyes are fully shut. 16.3% of samples fall here. \|
	\| `EAR_BLINK_THRESH` \| 0.21 \| Blink detection point; close to the 0.2 reference. 21.2% of samples below. \|
	\| `ear_open` \| 0.30 \| Above this, eyes are fully open. 70.4% of samples here. \|

	Between 0.16 and 0.30 the `_ear_score` function linearly interpolates from 0 to 1, providing a smooth transition rather than a hard binary cutoff.

	![EAR distribution](plots/ear_distribution.png)

	### MAR (Mouth Aspect Ratio)

	\| Constant \| Value \| Justification \|
	\|----------\|------:\|---------------\|
	\| `MAR_YAWN_THRESHOLD` \| 0.55 \| Only 1.7% of samples exceed this, confirming it captures genuine yawns without false positives. \|

	![MAR distribution](plots/mar_distribution.png)

	## 10. Other Constants

	\| Constant \| Value \| Rationale \|
	\|----------\|------:\|-----------\|
	\| `gaze_max_offset` \| 0.28 \| Max iris displacement (normalised) before gaze score drops to zero. Corresponds to ~56% of the eye width; beyond this the iris is at the extreme edge. \|
	\| `max_angle` \| 22.0 deg \| Head deviation beyond which face score = 0. Based on typical monitor-viewing cone: at 60 cm distance and a 24" monitor, the viewing angle is ~20-25 degrees. \|
	\| `roll_weight` \| 0.5 \| Roll is less indicative of inattention than yaw/pitch (tilting head doesn't mean looking away), so it's down-weighted by 50%. \|
	\| `EMA alpha` \| 0.3 \| Smoothing factor for focus score. Gives ~3-4 frame effective window; balances responsiveness vs flicker. \|
	\| `grace_frames` \| 15 \| ~0.5 s at 30 fps before penalising no-face. Allows brief occlusions (e.g. hand gesture) without dropping score. \|
	\| `PERCLOS_WINDOW` \| 60 frames \| 2 s at 30 fps; standard PERCLOS measurement window (Dinges & Grace, 1998). \|
	\| `BLINK_WINDOW_SEC` \| 30 s \| Blink rate measured over 30 s; typical spontaneous blink rate is 15-20/min (Bentivoglio et al., 1997). \|