Spaces:
Sleeping
Sleeping
data_preparation/
Shared data loading, cleaning, and exploratory analysis.
1. Files
| File | Description |
|---|---|
prepare_dataset.py |
Central data loading module used by all training scripts and notebooks |
data_exploration.ipynb |
EDA notebook: feature distributions, class balance, correlations |
2. prepare_dataset.py
Provides a consistent pipeline for loading raw .npz data from data/:
| Function | Purpose |
|---|---|
load_all_pooled(model_name) |
Load all participants, clean, select features, concatenate |
load_per_person(model_name) |
Load grouped by person (for LOPO cross-validation) |
get_numpy_splits(model_name) |
Load + stratified 70/15/15 split + StandardScaler |
get_dataloaders(model_name) |
Same as above, wrapped in PyTorch DataLoaders |
_split_and_scale(features, labels, ...) |
Reusable split + optional scaling |
Cleaning rules
yawclipped to [-45, 45],pitch/rollto [-30, 30]ear_left,ear_right,ear_avgclipped to [0, 0.85]
Selected features (face_orientation)
head_deviation, s_face, s_eye, h_gaze, pitch, ear_left, ear_avg, ear_right, gaze_offset, perclos
3. data_exploration.ipynb
Run from this folder or from the project root. Covers:
- Per-feature statistics (mean, std, min, max)
- Class distribution (focused vs unfocused)
- Feature histograms and box plots
- Correlation matrix
4. How to run
prepare_dataset.py is a library module, not a standalone script. You don’t run it directly; you import it from code that needs data.
From repo root:
# Optional: quick test that loading works
python -c "
from data_preparation.prepare_dataset import load_all_pooled
X, y, names = load_all_pooled('face_orientation')
print(f'Loaded {X.shape[0]} samples, {X.shape[1]} features: {names}')
"
Used by:
python -m models.mlp.trainpython -m models.xgboost.trainnotebooks/mlp.ipynb,notebooks/xgboost.ipynbdata_preparation/data_exploration.ipynb
5. Usage (in code)
from data_preparation.prepare_dataset import load_all_pooled, get_numpy_splits
# pooled data
X, y, names = load_all_pooled("face_orientation")
# ready-to-train splits
splits, n_features, n_classes, scaler = get_numpy_splits("face_orientation")
X_train, y_train = splits["X_train"], splits["y_train"]