Spaces:

FocusGuard
/

final_test

Sleeping

App Files Files Community

final_test / evaluation /README.md

Abdelrahman Almatrooshi

Deploy snapshot from main b7a59b11809483dfc959f196f1930240f2662c49

22a6915 about 1 month ago

preview code

raw

history blame contribute delete

3.49 kB

evaluation

Systematic evaluation scripts and generated reports. All evaluation uses Leave-One-Person-Out (LOPO) cross-validation over 9 participants (~145k samples) as the primary generalisation metric.

Scripts

Script	What it does	Runtime
`justify_thresholds.py`	LOPO threshold search (Youden's J) for MLP and XGBoost; geometric alpha grid search; hybrid w_mlp grid search	~10-15 min
`feature_importance.py`	XGBoost gain importance + leave-one-feature-out LOPO ablation	~20 min (full)
`grouped_split_benchmark.py`	Compares pooled random split vs LOPO on the same XGBoost config	~5 min

Quick mode

Add --quick to reduce tree count for faster iteration:

python -m evaluation.grouped_split_benchmark --quick
python -m evaluation.feature_importance --quick --skip-lofo

ClearML support

USE_CLEARML=1 python -m evaluation.justify_thresholds --clearml

Logs threshold search results, weight grid searches, and generated reports as ClearML artifacts.

Generated reports

Report	Contents
`THRESHOLD_JUSTIFICATION.md`	ML thresholds (MLP t=0.228, XGBoost t=0.280), geometric weights (alpha=0.7), hybrid weights (w_mlp=0.3), EAR/MAR physiological constants
`GROUPED_SPLIT_BENCHMARK.md`	Pooled (95.1% acc) vs LOPO (83.0% acc) comparison
`feature_selection_justification.md`	Domain rationale, XGBoost gain ranking, channel ablation results

Generated plots

All plots are in plots/ and referenced by the generated reports.

ROC curves (LOPO, 9 folds, 144k samples)

Plot	Model	AUC	Optimal threshold
	MLP	0.862	0.228
	XGBoost	0.870	0.280

Red dots mark the Youden's J optimal operating points. Both thresholds fall well below 0.50 due to cross-person probability compression under LOPO.

Confusion matrices

MLP	XGBoost

Weight grid searches

Geometric alpha search	Hybrid w_mlp search

Geometric pipeline: face-dominant weighting (alpha=0.7) generalises best across participants. Hybrid pipeline: low MLP weight (0.3) with strong geometric anchor gives the best LOPO F1 (0.841).

Physiological distributions

EAR distribution	MAR distribution

EAR thresholds (closed=0.16, blink=0.21, open=0.30) and MAR yawn threshold (0.55) are validated against these distributions.

Key findings

LOPO drops ~12 pp vs pooled split, confirming the importance of person-independent evaluation
Threshold optimisation alone yields +2-4 pp F1 without retraining
All three feature channels contribute (removing any one drops F1 by 2-10 pp)
s_face and ear_right are the highest-gain features, confirming that head pose and eye state are the strongest focus indicators
The geometric anchor (70% weight) stabilises the hybrid model against per-person variance

Evaluation logs

Training logs (per-epoch CSVs and JSON summaries) are written to logs/ by the MLP and XGBoost training scripts.