Spaces:

FocusGuard
/

final_test

Sleeping

App Files Files Community

final_test / evaluation /README.md

Abdelrahman Almatrooshi

Deploy snapshot from main b7a59b11809483dfc959f196f1930240f2662c49

22a6915 about 1 month ago

preview code

raw

history blame contribute delete

3.49 kB

	# evaluation

	Systematic evaluation scripts and generated reports. All evaluation uses Leave-One-Person-Out (LOPO) cross-validation over 9 participants (~145k samples) as the primary generalisation metric.

	## Scripts

	\| Script \| What it does \| Runtime \|
	\|--------\|-------------\|---------\|
	\| `justify_thresholds.py` \| LOPO threshold search (Youden's J) for MLP and XGBoost; geometric alpha grid search; hybrid w_mlp grid search \| ~10-15 min \|
	\| `feature_importance.py` \| XGBoost gain importance + leave-one-feature-out LOPO ablation \| ~20 min (full) \|
	\| `grouped_split_benchmark.py` \| Compares pooled random split vs LOPO on the same XGBoost config \| ~5 min \|

	### Quick mode

	Add `--quick` to reduce tree count for faster iteration:

	```bash
	python -m evaluation.grouped_split_benchmark --quick
	python -m evaluation.feature_importance --quick --skip-lofo
	```

	### ClearML support

	```bash
	USE_CLEARML=1 python -m evaluation.justify_thresholds --clearml
	```

	Logs threshold search results, weight grid searches, and generated reports as ClearML artifacts.

	## Generated reports

	\| Report \| Contents \|
	\|--------\|----------\|
	\| `THRESHOLD_JUSTIFICATION.md` \| ML thresholds (MLP t=0.228, XGBoost t=0.280), geometric weights (alpha=0.7), hybrid weights (w_mlp=0.3), EAR/MAR physiological constants \|
	\| `GROUPED_SPLIT_BENCHMARK.md` \| Pooled (95.1% acc) vs LOPO (83.0% acc) comparison \|
	\| `feature_selection_justification.md` \| Domain rationale, XGBoost gain ranking, channel ablation results \|

	## Generated plots

	All plots are in `plots/` and referenced by the generated reports.

	### ROC curves (LOPO, 9 folds, 144k samples)

	\| Plot \| Model \| AUC \| Optimal threshold \|
	\|------\|-------\|-----\|-------------------\|
	\| ![MLP ROC](plots/roc_mlp.png) \| MLP \| 0.862 \| 0.228 \|
	\| ![XGBoost ROC](plots/roc_xgb.png) \| XGBoost \| 0.870 \| 0.280 \|

	Red dots mark the Youden's J optimal operating points. Both thresholds fall well below 0.50 due to cross-person probability compression under LOPO.

	### Confusion matrices

	\| MLP \| XGBoost \|
	\|-----\|---------\|
	\| ![MLP CM](plots/confusion_matrix_mlp.png) \| ![XGBoost CM](plots/confusion_matrix_xgb.png) \|

	### Weight grid searches

	\| Geometric alpha search \| Hybrid w_mlp search \|
	\|----------------------\|-------------------\|
	\| ![Geo weights](plots/geo_weight_search.png) \| ![Hybrid weights](plots/hybrid_weight_search.png) \|

	Geometric pipeline: face-dominant weighting (alpha=0.7) generalises best across participants.
	Hybrid pipeline: low MLP weight (0.3) with strong geometric anchor gives the best LOPO F1 (0.841).

	### Physiological distributions

	\| EAR distribution \| MAR distribution \|
	\|-----------------\|-----------------\|
	\| ![EAR](plots/ear_distribution.png) \| ![MAR](plots/mar_distribution.png) \|

	EAR thresholds (closed=0.16, blink=0.21, open=0.30) and MAR yawn threshold (0.55) are validated against these distributions.

	## Key findings

	1. LOPO drops ~12 pp vs pooled split, confirming the importance of person-independent evaluation
	2. Threshold optimisation alone yields +2-4 pp F1 without retraining
	3. All three feature channels contribute (removing any one drops F1 by 2-10 pp)
	4. `s_face` and `ear_right` are the highest-gain features, confirming that head pose and eye state are the strongest focus indicators
	5. The geometric anchor (70% weight) stabilises the hybrid model against per-person variance

	## Evaluation logs

	Training logs (per-epoch CSVs and JSON summaries) are written to `logs/` by the MLP and XGBoost training scripts.