# data/ ## Layout One directory per contributor: `collected_/` with one or more `.npz` files per session. `collect_features.py` appends timestamped files when someone records again (e.g. `collected_Kexin/` has two sessions). Each `.npz` holds: - `features` — N×17 (training uses **10** of these for the `face_orientation` set; see `data_preparation/`) - `labels` — 0 = unfocused, 1 = focused (live key presses while recording) - `feature_names` — names for all 17 columns ## What we have (pooled) Roughly **144.8k** samples from **10** `.npz` sessions across **9** people. Session sizes vary a lot (~8.7k–17.6k samples), so the pool isn’t one uniform block — different setups, days, and recording lengths. | Aspect | Snapshot | |--------|----------| | **Labels** | ~55.8k unfocused / ~89.0k focused (~39% / ~61%) | | **Temporal mix** | Hundreds of focus ↔ unfocus **transitions** in the pooled timeline (not one long stuck label) | | **Signals** | Same 10 inference features as in production: head deviation, face/eye scores, horizontal gaze, pitch, EAR (left/avg/right), gaze offset, PERCLOS — pose + eyes + short-window drowsiness | Run **`data_preparation/data_exploration.ipynb`** for histograms, label-over-time plots, feature–label correlations, correlation matrix, and the small quality checklist (sample count, class balance band, transition count). ## Collect more ```bash python -m models.collect_features --name yourname ``` Webcam + overlay: **1** = focused, **0** = unfocused, **p** = pause, **q** = save and quit.