# data/

## Layout

One directory per contributor: `collected_<name>/` with one or more `.npz` files per session.  
`collect_features.py` appends timestamped files when someone records again (e.g. `collected_Kexin/` has two sessions).

Each `.npz` holds:

- `features` — N×17 (training uses **10** of these for the `face_orientation` set; see `data_preparation/`)
- `labels` — 0 = unfocused, 1 = focused (live key presses while recording)
- `feature_names` — names for all 17 columns

## What we have (pooled)

Roughly **144.8k** samples from **10** `.npz` sessions across **9** people. Session sizes vary a lot (~8.7k–17.6k samples), so the pool isn’t one uniform block — different setups, days, and recording lengths.

| Aspect | Snapshot |
|--------|----------|
| **Labels** | ~55.8k unfocused / ~89.0k focused (~39% / ~61%) |
| **Temporal mix** | Hundreds of focus ↔ unfocus **transitions** in the pooled timeline (not one long stuck label) |
| **Signals** | Same 10 inference features as in production: head deviation, face/eye scores, horizontal gaze, pitch, EAR (left/avg/right), gaze offset, PERCLOS — pose + eyes + short-window drowsiness |

Run **`data_preparation/data_exploration.ipynb`** for histograms, label-over-time plots, feature–label correlations, correlation matrix, and the small quality checklist (sample count, class balance band, transition count).

## Collect more

```bash
python -m models.collect_features --name yourname
```

Webcam + overlay: **1** = focused, **0** = unfocused, **p** = pause, **q** = save and quit.