Spaces:

FocusGuard
/

IntegrationTest

Sleeping

App Files Files Community

IntegrationTest / data_preparation /README.md

Yingtao-Zheng

Add other files and folders, including data related, notebook, test and evaluation

24a5e7e 8 days ago

preview code

raw

history blame contribute delete

2.39 kB

data_preparation/

Shared data loading, cleaning, and exploratory analysis.

1. Files

File	Description
`prepare_dataset.py`	Central data loading module used by all training scripts and notebooks
`data_exploration.ipynb`	EDA notebook: feature distributions, class balance, correlations

2. prepare_dataset.py

Provides a consistent pipeline for loading raw .npz data from data/:

Function	Purpose
`load_all_pooled(model_name)`	Load all participants, clean, select features, concatenate
`load_per_person(model_name)`	Load grouped by person (for LOPO cross-validation)
`get_numpy_splits(model_name)`	Load + stratified 70/15/15 split + StandardScaler
`get_dataloaders(model_name)`	Same as above, wrapped in PyTorch DataLoaders
`_split_and_scale(features, labels, ...)`	Reusable split + optional scaling

Cleaning rules

yaw clipped to [-45, 45], pitch/roll to [-30, 30]
ear_left, ear_right, ear_avg clipped to [0, 0.85]

Selected features (face_orientation)

head_deviation, s_face, s_eye, h_gaze, pitch, ear_left, ear_avg, ear_right, gaze_offset, perclos

3. data_exploration.ipynb

Run from this folder or from the project root. Covers:

Per-feature statistics (mean, std, min, max)
Class distribution (focused vs unfocused)
Feature histograms and box plots
Correlation matrix

4. How to run

prepare_dataset.py is a library module, not a standalone script. You don’t run it directly; you import it from code that needs data.

From repo root:

# Optional: quick test that loading works
python -c "
from data_preparation.prepare_dataset import load_all_pooled
X, y, names = load_all_pooled('face_orientation')
print(f'Loaded {X.shape[0]} samples, {X.shape[1]} features: {names}')
"

Used by:

python -m models.mlp.train
python -m models.xgboost.train
notebooks/mlp.ipynb, notebooks/xgboost.ipynb
data_preparation/data_exploration.ipynb

5. Usage (in code)

from data_preparation.prepare_dataset import load_all_pooled, get_numpy_splits

# pooled data
X, y, names = load_all_pooled("face_orientation")

# ready-to-train splits
splits, n_features, n_classes, scaler = get_numpy_splits("face_orientation")
X_train, y_train = splits["X_train"], splits["y_train"]