Spaces:
Sleeping
Sleeping
File size: 2,394 Bytes
24a5e7e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | # data_preparation/
Shared data loading, cleaning, and exploratory analysis.
## 1. Files
| File | Description |
|------|-------------|
| `prepare_dataset.py` | Central data loading module used by all training scripts and notebooks |
| `data_exploration.ipynb` | EDA notebook: feature distributions, class balance, correlations |
## 2. prepare_dataset.py
Provides a consistent pipeline for loading raw `.npz` data from `data/`:
| Function | Purpose |
|----------|---------|
| `load_all_pooled(model_name)` | Load all participants, clean, select features, concatenate |
| `load_per_person(model_name)` | Load grouped by person (for LOPO cross-validation) |
| `get_numpy_splits(model_name)` | Load + stratified 70/15/15 split + StandardScaler |
| `get_dataloaders(model_name)` | Same as above, wrapped in PyTorch DataLoaders |
| `_split_and_scale(features, labels, ...)` | Reusable split + optional scaling |
### Cleaning rules
- `yaw` clipped to [-45, 45], `pitch`/`roll` to [-30, 30]
- `ear_left`, `ear_right`, `ear_avg` clipped to [0, 0.85]
### Selected features (face_orientation)
`head_deviation`, `s_face`, `s_eye`, `h_gaze`, `pitch`, `ear_left`, `ear_avg`, `ear_right`, `gaze_offset`, `perclos`
## 3. data_exploration.ipynb
Run from this folder or from the project root. Covers:
1. Per-feature statistics (mean, std, min, max)
2. Class distribution (focused vs unfocused)
3. Feature histograms and box plots
4. Correlation matrix
## 4. How to run
`prepare_dataset.py` is a **library module**, not a standalone script. You don’t run it directly; you import it from code that needs data.
**From repo root:**
```bash
# Optional: quick test that loading works
python -c "
from data_preparation.prepare_dataset import load_all_pooled
X, y, names = load_all_pooled('face_orientation')
print(f'Loaded {X.shape[0]} samples, {X.shape[1]} features: {names}')
"
```
**Used by:**
- `python -m models.mlp.train`
- `python -m models.xgboost.train`
- `notebooks/mlp.ipynb`, `notebooks/xgboost.ipynb`
- `data_preparation/data_exploration.ipynb`
## 5. Usage (in code)
```python
from data_preparation.prepare_dataset import load_all_pooled, get_numpy_splits
# pooled data
X, y, names = load_all_pooled("face_orientation")
# ready-to-train splits
splits, n_features, n_classes, scaler = get_numpy_splits("face_orientation")
X_train, y_train = splits["X_train"], splits["y_train"]
```
|