Spaces:
Sleeping
Sleeping
| # data_preparation/ | |
| Shared data loading, cleaning, and exploratory analysis. | |
| ## 1. Files | |
| | File | Description | | |
| |------|-------------| | |
| | `prepare_dataset.py` | Central data loading module used by all training scripts and notebooks | | |
| | `data_exploration.ipynb` | EDA notebook: feature distributions, class balance, correlations | | |
| ## 2. prepare_dataset.py | |
| Provides a consistent pipeline for loading raw `.npz` data from `data/`: | |
| | Function | Purpose | | |
| |----------|---------| | |
| | `load_all_pooled(model_name)` | Load all participants, clean, select features, concatenate | | |
| | `load_per_person(model_name)` | Load grouped by person (for LOPO cross-validation) | | |
| | `get_numpy_splits(model_name)` | Load + stratified 70/15/15 split + StandardScaler | | |
| | `get_dataloaders(model_name)` | Same as above, wrapped in PyTorch DataLoaders | | |
| | `_split_and_scale(features, labels, ...)` | Reusable split + optional scaling | | |
| ### Cleaning rules | |
| - `yaw` clipped to [-45, 45], `pitch`/`roll` to [-30, 30] | |
| - `ear_left`, `ear_right`, `ear_avg` clipped to [0, 0.85] | |
| ### Selected features (face_orientation) | |
| `head_deviation`, `s_face`, `s_eye`, `h_gaze`, `pitch`, `ear_left`, `ear_avg`, `ear_right`, `gaze_offset`, `perclos` | |
| ## 3. data_exploration.ipynb | |
| Run from this folder or from the project root. Covers: | |
| 1. Per-feature statistics (mean, std, min, max) | |
| 2. Class distribution (focused vs unfocused) | |
| 3. Feature histograms and box plots | |
| 4. Correlation matrix | |
| ## 4. How to run | |
| `prepare_dataset.py` is a **library module**, not a standalone script. You don’t run it directly; you import it from code that needs data. | |
| **From repo root:** | |
| ```bash | |
| # Optional: quick test that loading works | |
| python -c " | |
| from data_preparation.prepare_dataset import load_all_pooled | |
| X, y, names = load_all_pooled('face_orientation') | |
| print(f'Loaded {X.shape[0]} samples, {X.shape[1]} features: {names}') | |
| " | |
| ``` | |
| **Used by:** | |
| - `python -m models.mlp.train` | |
| - `python -m models.xgboost.train` | |
| - `notebooks/mlp.ipynb`, `notebooks/xgboost.ipynb` | |
| - `data_preparation/data_exploration.ipynb` | |
| ## 5. Usage (in code) | |
| ```python | |
| from data_preparation.prepare_dataset import load_all_pooled, get_numpy_splits | |
| # pooled data | |
| X, y, names = load_all_pooled("face_orientation") | |
| # ready-to-train splits | |
| splits, n_features, n_classes, scaler = get_numpy_splits("face_orientation") | |
| X_train, y_train = splits["X_train"], splits["y_train"] | |
| ``` | |