Yingtao-Zheng's picture
Add other files and folders, including data related, notebook, test and evaluation
24a5e7e

data_preparation/

Shared data loading, cleaning, and exploratory analysis.

1. Files

File Description
prepare_dataset.py Central data loading module used by all training scripts and notebooks
data_exploration.ipynb EDA notebook: feature distributions, class balance, correlations

2. prepare_dataset.py

Provides a consistent pipeline for loading raw .npz data from data/:

Function Purpose
load_all_pooled(model_name) Load all participants, clean, select features, concatenate
load_per_person(model_name) Load grouped by person (for LOPO cross-validation)
get_numpy_splits(model_name) Load + stratified 70/15/15 split + StandardScaler
get_dataloaders(model_name) Same as above, wrapped in PyTorch DataLoaders
_split_and_scale(features, labels, ...) Reusable split + optional scaling

Cleaning rules

  • yaw clipped to [-45, 45], pitch/roll to [-30, 30]
  • ear_left, ear_right, ear_avg clipped to [0, 0.85]

Selected features (face_orientation)

head_deviation, s_face, s_eye, h_gaze, pitch, ear_left, ear_avg, ear_right, gaze_offset, perclos

3. data_exploration.ipynb

Run from this folder or from the project root. Covers:

  1. Per-feature statistics (mean, std, min, max)
  2. Class distribution (focused vs unfocused)
  3. Feature histograms and box plots
  4. Correlation matrix

4. How to run

prepare_dataset.py is a library module, not a standalone script. You don’t run it directly; you import it from code that needs data.

From repo root:

# Optional: quick test that loading works
python -c "
from data_preparation.prepare_dataset import load_all_pooled
X, y, names = load_all_pooled('face_orientation')
print(f'Loaded {X.shape[0]} samples, {X.shape[1]} features: {names}')
"

Used by:

  • python -m models.mlp.train
  • python -m models.xgboost.train
  • notebooks/mlp.ipynb, notebooks/xgboost.ipynb
  • data_preparation/data_exploration.ipynb

5. Usage (in code)

from data_preparation.prepare_dataset import load_all_pooled, get_numpy_splits

# pooled data
X, y, names = load_all_pooled("face_orientation")

# ready-to-train splits
splits, n_features, n_classes, scaler = get_numpy_splits("face_orientation")
X_train, y_train = splits["X_train"], splits["y_train"]