Spaces:

FocusGuard
/

IntegrationTest

Sleeping

App Files Files Community

IntegrationTest / data_preparation /README.md

Yingtao-Zheng

Add other files and folders, including data related, notebook, test and evaluation

24a5e7e 9 days ago

preview code

raw

history blame contribute delete

2.39 kB

	# data_preparation/

	Shared data loading, cleaning, and exploratory analysis.

	## 1. Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `prepare_dataset.py` \| Central data loading module used by all training scripts and notebooks \|
	\| `data_exploration.ipynb` \| EDA notebook: feature distributions, class balance, correlations \|

	## 2. prepare_dataset.py

	Provides a consistent pipeline for loading raw `.npz` data from `data/`:

	\| Function \| Purpose \|
	\|----------\|---------\|
	\| `load_all_pooled(model_name)` \| Load all participants, clean, select features, concatenate \|
	\| `load_per_person(model_name)` \| Load grouped by person (for LOPO cross-validation) \|
	\| `get_numpy_splits(model_name)` \| Load + stratified 70/15/15 split + StandardScaler \|
	\| `get_dataloaders(model_name)` \| Same as above, wrapped in PyTorch DataLoaders \|
	\| `_split_and_scale(features, labels, ...)` \| Reusable split + optional scaling \|

	### Cleaning rules

	- `yaw` clipped to [-45, 45], `pitch`/`roll` to [-30, 30]
	- `ear_left`, `ear_right`, `ear_avg` clipped to [0, 0.85]

	### Selected features (face_orientation)

	`head_deviation`, `s_face`, `s_eye`, `h_gaze`, `pitch`, `ear_left`, `ear_avg`, `ear_right`, `gaze_offset`, `perclos`

	## 3. data_exploration.ipynb

	Run from this folder or from the project root. Covers:

	1. Per-feature statistics (mean, std, min, max)
	2. Class distribution (focused vs unfocused)
	3. Feature histograms and box plots
	4. Correlation matrix

	## 4. How to run

	`prepare_dataset.py` is a library module, not a standalone script. You don’t run it directly; you import it from code that needs data.

	From repo root:

	```bash
	# Optional: quick test that loading works
	python -c "
	from data_preparation.prepare_dataset import load_all_pooled
	X, y, names = load_all_pooled('face_orientation')
	print(f'Loaded {X.shape[0]} samples, {X.shape[1]} features: {names}')
	"
	```

	Used by:

	- `python -m models.mlp.train`
	- `python -m models.xgboost.train`
	- `notebooks/mlp.ipynb`, `notebooks/xgboost.ipynb`
	- `data_preparation/data_exploration.ipynb`

	## 5. Usage (in code)

	```python
	from data_preparation.prepare_dataset import load_all_pooled, get_numpy_splits

	# pooled data
	X, y, names = load_all_pooled("face_orientation")

	# ready-to-train splits
	splits, n_features, n_classes, scaler = get_numpy_splits("face_orientation")
	X_train, y_train = splits["X_train"], splits["y_train"]
	```