Spaces:

he99codes
/

Tabular_AutoML

Running

App Files Files Community

Tabular_AutoML / SYSTEM_STRUCTURE.md

he99codes

deploy automl

211c37c 1 day ago

preview code

raw

history blame contribute delete

16.2 kB

	# AutoML System — Complete Working Structure

	---

	## OVERVIEW: How the System Flows

	```
	CSV File
	│
	▼
	[STEP 1] CLI Parsing → reads --data --target --task --time_budget etc.
	│
	▼
	[STEP 2] Dataset Analyzer → inspects rows, cols, types, missing, imbalance
	│
	▼
	[STEP 3] Train/Val/Test Split → 70% / 15% / 15%
	│
	▼
	[STEP 4] Preprocessing → impute → scale / encode → tensor
	│
	▼
	[STEP 5] Feature Engineering → original + log + sqrt + polynomial → SelectKBest
	│
	▼
	[STEP 6] Model Training Loop
	├── sklearn models → Optuna HPO → final fit → evaluate
	└── pytorch models → Optuna HPO → full train loop → evaluate
	│
	▼
	[STEP 7] Leaderboard → rank all models by primary score
	│
	▼
	[STEP 8] Best Model Selected → highest primary score wins
	│
	▼
	[STEP 9] Explainability → SHAP or feature_importances_
	│
	▼
	[STEP 10] Save Artifacts → model + pipeline + leaderboard.csv
	```

	---

	## CASE A: CLASSIFICATION TASK

	```bash
	python main.py --data data.csv --target churn --task classification
	```

	### Models that run (in this order):

	```
	1. RandomForest (RandomForestClassifier)
	2. GradientBoosting (GradientBoostingClassifier)
	3. LogisticRegression (LogisticRegression)
	4. XGBoost (XGBClassifier) ← only if xgboost installed
	5. LightGBM (LGBMClassifier) ← only if lightgbm installed
	6. FeedforwardNN (PyTorch) ← only if torch installed
	7. ResidualMLP (PyTorch) ← only if torch installed
	```

	### Metrics computed per model:

	```
	accuracy = correct predictions / total
	precision = TP / (TP + FP) → binary: direct \| multiclass: macro average
	recall = TP / (TP + FN) → binary: direct \| multiclass: macro average
	f1 = 2 * (P * R) / (P + R)
	roc_auc = area under ROC curve → binary: normal \| multiclass: OvR macro
	```

	### How best model is selected (classification):

	```
	primary_score = roc_auc (if available)
	= f1 (fallback, when predict_proba not available)

	→ Model with HIGHEST primary_score wins
	```

	Example leaderboard:
	```
	Rank Model accuracy f1 roc_auc → primary_score
	1 LogisticRegression 0.907 0.909 0.972 0.972 ← WINNER
	2 GradientBoosting 0.880 0.868 0.958 0.958
	3 FeedforwardNN 0.871 0.855 0.941 0.941
	4 RandomForest 0.853 0.836 0.945 0.945
	```

	---

	## CASE B: REGRESSION TASK

	```bash
	python main.py --data data.csv --target price --task regression
	```

	### Models that run (in this order):

	```
	1. RandomForest (RandomForestRegressor)
	2. GradientBoosting (GradientBoostingRegressor)
	3. Ridge (Ridge regression)
	4. LinearRegression (LinearRegression)
	5. XGBoost (XGBRegressor) ← only if xgboost installed
	6. LightGBM (LGBMRegressor) ← only if lightgbm installed
	7. FeedforwardNN (PyTorch) ← only if torch installed
	8. ResidualMLP (PyTorch) ← only if torch installed
	```

	### Metrics computed per model:

	```
	RMSE = sqrt( mean( (y_pred - y_true)² ) ) → lower is better
	MAE = mean( \|y_pred - y_true\| ) → lower is better
	R² = 1 - SS_residual / SS_total → higher is better (1.0 = perfect)
	```

	### How best model is selected (regression):

	```
	primary_score = -RMSE (negated so that higher = better, like classification)

	→ Model with LOWEST RMSE = HIGHEST primary_score = WINS
	```

	Example leaderboard:
	```
	Rank Model rmse mae r2 primary_score
	1 Ridge 34.44 26.92 0.954 -34.44 ← WINNER
	2 LinearReg 35.36 27.69 0.951 -35.36
	3 FeedforwardNN 61.22 48.11 0.857 -61.22
	4 RandomForest 100.81 77.93 0.607 -100.81
	```

	---

	## FEATURE ENGINEERING — Full Breakdown

	This is what happens step by step when your preprocessed matrix hits the
	FeatureEngineer. Example: you have 15 preprocessed features.

	```
	Input: X shape = (N rows, 15 features)
	```

	### Step 1 — Original features kept as-is:
	```
	[f0, f1, f2, ..., f14] → 15 columns
	```

	### Step 2 — Log Transform (use_log=True):
	```
	For each column:
	X_shifted = X - X.min(axis=0) + 0.000001 ← shift all values to positive
	log_feat = log1p(X_shifted) ← log(1 + x)

	Result: 15 new log-transformed columns
	[log(f0), log(f1), ..., log(f14)] → +15 columns = 30 total
	```

	### Step 3 — Sqrt Transform (use_sqrt=True):
	```
	For each column:
	X_shifted = X - X.min(axis=0) ← shift to non-negative
	sqrt_feat = sqrt(X_shifted)

	Result: 15 new sqrt-transformed columns
	[√f0, √f1, ..., √f14] → +15 columns = 45 total
	```

	### Step 4 — Polynomial Features (use_polynomial=True, degree=2):
	```
	Runs on first 20 columns only (cap to avoid memory explosion)
	With 15 features, runs on all 15.

	PolynomialFeatures(degree=2, include_bias=False) generates:
	- Original terms: f0, f1, ..., f14 (15 cols — REMOVED, already have them)
	- Squared terms: f0², f1², ..., f14² (15 cols) ← NEW
	- Cross terms: f0·f1, f0·f2, ..., f13·f14 (105 cols) ← NEW

	Formula: new_cols = N(N+1)/2 = 1516/2 = 120

	Example with just 3 features A, B, C:
	Input: [A, B, C]
	After poly:
	A² = A*A
	A·B = A*B
	A·C = A*C
	B² = B*B
	B·C = B*C
	C² = C*C
	→ 6 new columns (originals removed since already in base set)

	With 15 features → +120 new polynomial columns = 165 total
	```

	### Step 5 — SelectKBest (select_k=60):
	```
	All 165 features are scored against target y using:
	- f_regression (for regression tasks) → F-statistic
	- f_classif (for classification) → ANOVA F-value

	Top 60 features with highest scores are KEPT.
	All others are DROPPED.

	Output: X shape = (N rows, 60 features)
	```

	### Summary Table:

	```
	Stage Columns Cumulative
	─────────────────────────────────────────────
	Original preprocessed 15 15
	+ Log transforms 15 30
	+ Sqrt transforms 15 45
	+ Polynomial (degree-2) 120 165
	─────────────────────────────────────────────
	After SelectKBest(k=60) ——→ 60 ← final feature matrix
	```

	### Why you "didn't see" polynomial features:

	The polynomial features ARE generated, but then SelectKBest filters the
	full 165-feature matrix down to 60. The output is indexed as:
	feature_0, feature_1, ..., feature_59

	These are not labeled as "poly" or "log" — they're just the top 60
	regardless of type. If a polynomial feature ranked high, it's in there.
	If it ranked low, SelectKBest dropped it. The system doesn't show which
	type each surviving feature came from.

	---

	## MODEL SELECTION — Detailed Decision Tree

	```
	For each model in the candidate pool:
	│
	├── Does it have a search space? (len(search_space) > 0)
	│ ├── YES → Run Optuna HPO
	│ │ │
	│ │ ├── Is Optuna installed?
	│ │ │ ├── YES → Bayesian optimization (TPE sampler)
	│ │ │ │ Each trial: suggest params → fit → score val → report
	│ │ │ └── NO → Random search fallback
	│ │ │ Each trial: random sample → fit → score val → track best
	│ │ │
	│ │ └── Returns best_params dict
	│ │
	│ └── NO → Skip HPO (e.g. LinearRegression has no hyperparams)
	│ Use default params directly
	│
	├── Final fit on FULL train set with best_params
	│
	├── Predict on TEST set (never seen during HPO)
	│
	├── Compute metrics (accuracy/f1/auc OR rmse/mae/r2)
	│
	└── Compute primary_score → add to Leaderboard


	After ALL models finish:
	│
	├── Sort leaderboard by primary_score DESCENDING
	├── Rank 1 = BEST MODEL
	└── best_model = leaderboard[0]["_model"]
	```

	---

	## HPO SEARCH SPACES — Every Parameter

	### RandomForest (classification AND regression):
	```
	n_estimators int [50, 500] → number of trees
	max_depth int [3, 20] → max tree depth
	min_samples_split int [2, 20] → min samples to split a node
	min_samples_leaf int [1, 10] → min samples in leaf node
	```

	### GradientBoosting (classification AND regression):
	```
	n_estimators int [50, 300] → boosting rounds
	max_depth int [2, 8] → tree depth (shallower than RF)
	learning_rate float [0.001, 0.3] → log scale (shrinkage)
	subsample float [0.5, 1.0] → fraction of samples per tree
	```

	### LogisticRegression (classification only):
	```
	C float [0.0001, 10.0] → log scale, inverse regularization
	```

	### Ridge (regression only):
	```
	alpha float [0.001, 100.0] → log scale, regularization strength
	```

	### LinearRegression: NO hyperparams → skips HPO entirely

	### XGBoost (if installed):
	```
	n_estimators int [50, 500]
	max_depth int [2, 10]
	learning_rate float [0.001, 0.3] → log scale
	subsample float [0.5, 1.0]
	colsample_bytree float [0.5, 1.0] → fraction of features per tree
	```

	### LightGBM (if installed):
	```
	n_estimators int [50, 500]
	max_depth int [2, 10]
	learning_rate float [0.001, 0.3] → log scale
	num_leaves int [16, 256] → controls model complexity
	subsample float [0.5, 1.0]
	```

	### FeedforwardNN (PyTorch):
	```
	lr float [0.0001, 0.01] → log scale (Adam learning rate)
	hidden_dim choice [64, 128, 256] → units per hidden layer
	n_layers int [2, 4] → number of hidden layers
	dropout float [0.1, 0.5] → dropout probability
	batch_size choice [32, 64, 128] → mini-batch size
	weight_decay float [1e-6, 0.001] → L2 regularization
	```

	### ResidualMLP (PyTorch):
	```
	lr float [0.0001, 0.01] → log scale
	hidden_dim choice [64, 128, 256] → ALL residual blocks share this width
	n_blocks int [2, 5] → number of residual blocks
	dropout float [0.1, 0.5]
	batch_size choice [32, 64, 128]
	weight_decay float [1e-6, 0.001]
	```

	---

	## PYTORCH ARCHITECTURE DETAILS

	### FeedforwardNN:
	```
	Input(n_features)
	│
	▼
	[Linear(n, hidden_dim) → BatchNorm1d → ReLU → Dropout] ← repeated n_layers times
	│
	▼
	Linear(hidden_dim, output_dim)
	│
	▼
	Output: logits (classification) OR scalar (regression)
	```

	### ResidualMLP:
	```
	Input(n_features)
	│
	▼
	Linear(n, hidden_dim) → BatchNorm1d → ReLU ← input projection
	│
	▼
	┌── ResidualBlock ──────────────────────┐
	│ x_in ──────────────────────────────┐ │
	│ │ │ │
	│ Linear → BN → ReLU → Dropout │ │ ← repeated n_blocks times
	│ │ │ │
	│ Linear → BN │ │
	│ │ │ │
	│ ReLU( x_block + x_in ) ←──────────┘ │ ← skip connection
	└───────────────────────────────────────┘
	│
	▼
	Linear(hidden_dim, output_dim)
	```

	The skip connection (x_in + x_block) is what makes it "residual" —
	gradients flow directly backward through the addition, preventing
	vanishing gradient in deeper networks.

	---

	## TRAINING ENGINE — Each Epoch

	```
	For each epoch (max 80):
	│
	├── TRAIN PHASE
	│ For each mini-batch:
	│ ├── Forward: output = model(X_batch)
	│ ├── Loss:
	│ │ classification → CrossEntropyLoss(output, y_batch)
	│ │ regression → MSELoss(output.squeeze(), y_batch)
	│ ├── Backward: loss.backward()
	│ ├── Gradient clip: clip_grad_norm_(max=1.0) ← safety guard
	│ └── Step: Adam optimizer updates weights
	│
	├── VALIDATION PHASE
	│ ├── model.eval() + torch.no_grad()
	│ ├── Compute val_loss on full val set
	│ └── model.train()
	│
	├── LR SCHEDULER
	│ └── ReduceLROnPlateau: if val_loss doesn't improve for 5 epochs
	│ → halve the learning rate
	│
	└── EARLY STOPPING
	└── If val_loss doesn't improve for 10 epochs:
	├── Restore best weights seen so far
	└── Stop training
	```

	---

	## PREPROCESSING DECISION TREE

	```
	For each column:
	│
	├── Is it numeric dtype? (int, float)
	│ └── YES → Impute(median) → StandardScaler
	│
	├── Is it object/categorical dtype?
	│ ├── nunique ≤ 15 → Impute(most_frequent) → OneHotEncoder
	│ │ Creates binary 0/1 columns for each category
	│ │
	│ └── nunique > 15 → Impute(most_frequent) → OrdinalEncoder (if no category_encoders)
	│ OR TargetEncoder (if category_encoders installed)
	│ Maps categories to integers or target-mean values
	│
	└── Is it text? (avg_string_len > 30 AND nunique > 50)
	└── TfidfVectorizer(max_features=50, ngram_range=(1,2))
	Creates 50 TF-IDF score columns from text
	```

	---

	## WHAT HAPPENS WITH TIME BUDGET

	```
	--time_budget=300 (5 minutes)

	Time allocation per model (approximate):
	Each sklearn model: ~20% of remaining budget, capped at 60s for HPO
	Each pytorch model: ~30% of remaining budget, capped at 90s for HPO

	If budget expires mid-loop:
	├── Current model finishes its current HPO trial
	├── No new models are started
	└── Whatever finished goes to leaderboard → best is selected

	No time budget (default):
	└── Every model runs to completion with all 15 HPO trials
	```

	---

	## EXPLAINABILITY — How It Decides

	```
	After best model is selected:

	Is best model a tree-based sklearn model?
	(RandomForest, GradientBoosting, XGBoost, LightGBM)
	└── YES → use model.feature_importances_ directly (fast, no SHAP needed)
	OR shap.TreeExplainer if SHAP installed (exact, tree-native)

	Is best model a linear sklearn model?
	(LogisticRegression, Ridge, LinearRegression)
	└── YES → use abs(model.coef_) directly
	OR shap.LinearExplainer if SHAP installed

	Is best model a PyTorch model?
	└── YES → requires SHAP installed
	shap.KernelExplainer with 50 background samples (approximate)
	→ if SHAP not installed: no importance available for neural nets

	Output: dict of {feature_name: importance_score}
	sorted descending → top 15 printed as bar chart
	```

	---

	## OUTPUT FILES

	```
	./automl_output/
	├── best_model.joblib ← sklearn model (if sklearn won)
	├── best_model.pt ← PyTorch model weights (if neural net won)
	├── preprocessing.joblib ← fitted ColumnTransformer (imputers + scalers + encoders)
	├── feature_engineering.joblib ← fitted FeatureEngineer (poly + log + sqrt + selector)
	└── leaderboard.csv ← all models with all metrics, ranked
	```

	To reload and use the best model later:
	```python
	import joblib
	import torch

	# For sklearn winner:
	model = joblib.load("automl_output/best_model.joblib")
	prep = joblib.load("automl_output/preprocessing.joblib")
	feat = joblib.load("automl_output/feature_engineering.joblib")

	X_new = prep.transform(new_df)
	X_new = feat.transform(X_new)
	preds = model.predict(X_new)

	# For PyTorch winner:
	model = torch.load("automl_output/best_model.pt")
	model.eval()
	```