Spaces:
Sleeping
Sleeping
| # evaluation | |
| Systematic evaluation scripts and generated reports. All evaluation uses Leave-One-Person-Out (LOPO) cross-validation over 9 participants (~145k samples) as the primary generalisation metric. | |
| ## Scripts | |
| | Script | What it does | Runtime | | |
| |--------|-------------|---------| | |
| | `justify_thresholds.py` | LOPO threshold search (Youden's J) for MLP and XGBoost; geometric alpha grid search; hybrid w_mlp grid search | ~10-15 min | | |
| | `feature_importance.py` | XGBoost gain importance + leave-one-feature-out LOPO ablation | ~20 min (full) | | |
| | `grouped_split_benchmark.py` | Compares pooled random split vs LOPO on the same XGBoost config | ~5 min | | |
| ### Quick mode | |
| Add `--quick` to reduce tree count for faster iteration: | |
| ```bash | |
| python -m evaluation.grouped_split_benchmark --quick | |
| python -m evaluation.feature_importance --quick --skip-lofo | |
| ``` | |
| ### ClearML support | |
| ```bash | |
| USE_CLEARML=1 python -m evaluation.justify_thresholds --clearml | |
| ``` | |
| Logs threshold search results, weight grid searches, and generated reports as ClearML artifacts. | |
| ## Generated reports | |
| | Report | Contents | | |
| |--------|----------| | |
| | `THRESHOLD_JUSTIFICATION.md` | ML thresholds (MLP t*=0.228, XGBoost t*=0.280), geometric weights (alpha=0.7), hybrid weights (w_mlp=0.3), EAR/MAR physiological constants | | |
| | `GROUPED_SPLIT_BENCHMARK.md` | Pooled (95.1% acc) vs LOPO (83.0% acc) comparison | | |
| | `feature_selection_justification.md` | Domain rationale, XGBoost gain ranking, channel ablation results | | |
| ## Generated plots | |
| All plots are in `plots/` and referenced by the generated reports. | |
| ### ROC curves (LOPO, 9 folds, 144k samples) | |
| | Plot | Model | AUC | Optimal threshold | | |
| |------|-------|-----|-------------------| | |
| |  | MLP | 0.862 | 0.228 | | |
| |  | XGBoost | 0.870 | 0.280 | | |
| Red dots mark the Youden's J optimal operating points. Both thresholds fall well below 0.50 due to cross-person probability compression under LOPO. | |
| ### Confusion matrices | |
| | MLP | XGBoost | | |
| |-----|---------| | |
| |  |  | | |
| ### Weight grid searches | |
| | Geometric alpha search | Hybrid w_mlp search | | |
| |----------------------|-------------------| | |
| |  |  | | |
| Geometric pipeline: face-dominant weighting (alpha=0.7) generalises best across participants. | |
| Hybrid pipeline: low MLP weight (0.3) with strong geometric anchor gives the best LOPO F1 (0.841). | |
| ### Physiological distributions | |
| | EAR distribution | MAR distribution | | |
| |-----------------|-----------------| | |
| |  |  | | |
| EAR thresholds (closed=0.16, blink=0.21, open=0.30) and MAR yawn threshold (0.55) are validated against these distributions. | |
| ## Key findings | |
| 1. LOPO drops ~12 pp vs pooled split, confirming the importance of person-independent evaluation | |
| 2. Threshold optimisation alone yields +2-4 pp F1 without retraining | |
| 3. All three feature channels contribute (removing any one drops F1 by 2-10 pp) | |
| 4. `s_face` and `ear_right` are the highest-gain features, confirming that head pose and eye state are the strongest focus indicators | |
| 5. The geometric anchor (70% weight) stabilises the hybrid model against per-person variance | |
| ## Evaluation logs | |
| Training logs (per-epoch CSVs and JSON summaries) are written to `logs/` by the MLP and XGBoost training scripts. | |