Tabular_AutoML / SYSTEM_STRUCTURE.md
he99codes's picture
deploy automl
211c37c
# AutoML System — Complete Working Structure
---
## OVERVIEW: How the System Flows
```
CSV File
[STEP 1] CLI Parsing → reads --data --target --task --time_budget etc.
[STEP 2] Dataset Analyzer → inspects rows, cols, types, missing, imbalance
[STEP 3] Train/Val/Test Split → 70% / 15% / 15%
[STEP 4] Preprocessing → impute → scale / encode → tensor
[STEP 5] Feature Engineering → original + log + sqrt + polynomial → SelectKBest
[STEP 6] Model Training Loop
├── sklearn models → Optuna HPO → final fit → evaluate
└── pytorch models → Optuna HPO → full train loop → evaluate
[STEP 7] Leaderboard → rank all models by primary score
[STEP 8] Best Model Selected → highest primary score wins
[STEP 9] Explainability → SHAP or feature_importances_
[STEP 10] Save Artifacts → model + pipeline + leaderboard.csv
```
---
## CASE A: CLASSIFICATION TASK
```bash
python main.py --data data.csv --target churn --task classification
```
### Models that run (in this order):
```
1. RandomForest (RandomForestClassifier)
2. GradientBoosting (GradientBoostingClassifier)
3. LogisticRegression (LogisticRegression)
4. XGBoost (XGBClassifier) ← only if xgboost installed
5. LightGBM (LGBMClassifier) ← only if lightgbm installed
6. FeedforwardNN (PyTorch) ← only if torch installed
7. ResidualMLP (PyTorch) ← only if torch installed
```
### Metrics computed per model:
```
accuracy = correct predictions / total
precision = TP / (TP + FP) → binary: direct | multiclass: macro average
recall = TP / (TP + FN) → binary: direct | multiclass: macro average
f1 = 2 * (P * R) / (P + R)
roc_auc = area under ROC curve → binary: normal | multiclass: OvR macro
```
### How best model is selected (classification):
```
primary_score = roc_auc (if available)
= f1 (fallback, when predict_proba not available)
→ Model with HIGHEST primary_score wins
```
Example leaderboard:
```
Rank Model accuracy f1 roc_auc → primary_score
1 LogisticRegression 0.907 0.909 0.972 0.972 ← WINNER
2 GradientBoosting 0.880 0.868 0.958 0.958
3 FeedforwardNN 0.871 0.855 0.941 0.941
4 RandomForest 0.853 0.836 0.945 0.945
```
---
## CASE B: REGRESSION TASK
```bash
python main.py --data data.csv --target price --task regression
```
### Models that run (in this order):
```
1. RandomForest (RandomForestRegressor)
2. GradientBoosting (GradientBoostingRegressor)
3. Ridge (Ridge regression)
4. LinearRegression (LinearRegression)
5. XGBoost (XGBRegressor) ← only if xgboost installed
6. LightGBM (LGBMRegressor) ← only if lightgbm installed
7. FeedforwardNN (PyTorch) ← only if torch installed
8. ResidualMLP (PyTorch) ← only if torch installed
```
### Metrics computed per model:
```
RMSE = sqrt( mean( (y_pred - y_true)² ) ) → lower is better
MAE = mean( |y_pred - y_true| ) → lower is better
R² = 1 - SS_residual / SS_total → higher is better (1.0 = perfect)
```
### How best model is selected (regression):
```
primary_score = -RMSE (negated so that higher = better, like classification)
→ Model with LOWEST RMSE = HIGHEST primary_score = WINS
```
Example leaderboard:
```
Rank Model rmse mae r2 primary_score
1 Ridge 34.44 26.92 0.954 -34.44 ← WINNER
2 LinearReg 35.36 27.69 0.951 -35.36
3 FeedforwardNN 61.22 48.11 0.857 -61.22
4 RandomForest 100.81 77.93 0.607 -100.81
```
---
## FEATURE ENGINEERING — Full Breakdown
This is what happens step by step when your preprocessed matrix hits the
FeatureEngineer. Example: you have 15 preprocessed features.
```
Input: X shape = (N rows, 15 features)
```
### Step 1 — Original features kept as-is:
```
[f0, f1, f2, ..., f14] → 15 columns
```
### Step 2 — Log Transform (use_log=True):
```
For each column:
X_shifted = X - X.min(axis=0) + 0.000001 ← shift all values to positive
log_feat = log1p(X_shifted) ← log(1 + x)
Result: 15 new log-transformed columns
[log(f0), log(f1), ..., log(f14)] → +15 columns = 30 total
```
### Step 3 — Sqrt Transform (use_sqrt=True):
```
For each column:
X_shifted = X - X.min(axis=0) ← shift to non-negative
sqrt_feat = sqrt(X_shifted)
Result: 15 new sqrt-transformed columns
[√f0, √f1, ..., √f14] → +15 columns = 45 total
```
### Step 4 — Polynomial Features (use_polynomial=True, degree=2):
```
Runs on first 20 columns only (cap to avoid memory explosion)
With 15 features, runs on all 15.
PolynomialFeatures(degree=2, include_bias=False) generates:
- Original terms: f0, f1, ..., f14 (15 cols — REMOVED, already have them)
- Squared terms: f0², f1², ..., f14² (15 cols) ← NEW
- Cross terms: f0·f1, f0·f2, ..., f13·f14 (105 cols) ← NEW
Formula: new_cols = N*(N+1)/2 = 15*16/2 = 120
Example with just 3 features A, B, C:
Input: [A, B, C]
After poly:
A² = A*A
A·B = A*B
A·C = A*C
B² = B*B
B·C = B*C
C² = C*C
→ 6 new columns (originals removed since already in base set)
With 15 features → +120 new polynomial columns = 165 total
```
### Step 5 — SelectKBest (select_k=60):
```
All 165 features are scored against target y using:
- f_regression (for regression tasks) → F-statistic
- f_classif (for classification) → ANOVA F-value
Top 60 features with highest scores are KEPT.
All others are DROPPED.
Output: X shape = (N rows, 60 features)
```
### Summary Table:
```
Stage Columns Cumulative
─────────────────────────────────────────────
Original preprocessed 15 15
+ Log transforms 15 30
+ Sqrt transforms 15 45
+ Polynomial (degree-2) 120 165
─────────────────────────────────────────────
After SelectKBest(k=60) ——→ 60 ← final feature matrix
```
### Why you "didn't see" polynomial features:
The polynomial features ARE generated, but then SelectKBest filters the
full 165-feature matrix down to 60. The output is indexed as:
feature_0, feature_1, ..., feature_59
These are not labeled as "poly" or "log" — they're just the top 60
regardless of type. If a polynomial feature ranked high, it's in there.
If it ranked low, SelectKBest dropped it. The system doesn't show which
type each surviving feature came from.
---
## MODEL SELECTION — Detailed Decision Tree
```
For each model in the candidate pool:
├── Does it have a search space? (len(search_space) > 0)
│ ├── YES → Run Optuna HPO
│ │ │
│ │ ├── Is Optuna installed?
│ │ │ ├── YES → Bayesian optimization (TPE sampler)
│ │ │ │ Each trial: suggest params → fit → score val → report
│ │ │ └── NO → Random search fallback
│ │ │ Each trial: random sample → fit → score val → track best
│ │ │
│ │ └── Returns best_params dict
│ │
│ └── NO → Skip HPO (e.g. LinearRegression has no hyperparams)
│ Use default params directly
├── Final fit on FULL train set with best_params
├── Predict on TEST set (never seen during HPO)
├── Compute metrics (accuracy/f1/auc OR rmse/mae/r2)
└── Compute primary_score → add to Leaderboard
After ALL models finish:
├── Sort leaderboard by primary_score DESCENDING
├── Rank 1 = BEST MODEL
└── best_model = leaderboard[0]["_model"]
```
---
## HPO SEARCH SPACES — Every Parameter
### RandomForest (classification AND regression):
```
n_estimators int [50, 500] → number of trees
max_depth int [3, 20] → max tree depth
min_samples_split int [2, 20] → min samples to split a node
min_samples_leaf int [1, 10] → min samples in leaf node
```
### GradientBoosting (classification AND regression):
```
n_estimators int [50, 300] → boosting rounds
max_depth int [2, 8] → tree depth (shallower than RF)
learning_rate float [0.001, 0.3] → log scale (shrinkage)
subsample float [0.5, 1.0] → fraction of samples per tree
```
### LogisticRegression (classification only):
```
C float [0.0001, 10.0] → log scale, inverse regularization
```
### Ridge (regression only):
```
alpha float [0.001, 100.0] → log scale, regularization strength
```
### LinearRegression: NO hyperparams → skips HPO entirely
### XGBoost (if installed):
```
n_estimators int [50, 500]
max_depth int [2, 10]
learning_rate float [0.001, 0.3] → log scale
subsample float [0.5, 1.0]
colsample_bytree float [0.5, 1.0] → fraction of features per tree
```
### LightGBM (if installed):
```
n_estimators int [50, 500]
max_depth int [2, 10]
learning_rate float [0.001, 0.3] → log scale
num_leaves int [16, 256] → controls model complexity
subsample float [0.5, 1.0]
```
### FeedforwardNN (PyTorch):
```
lr float [0.0001, 0.01] → log scale (Adam learning rate)
hidden_dim choice [64, 128, 256] → units per hidden layer
n_layers int [2, 4] → number of hidden layers
dropout float [0.1, 0.5] → dropout probability
batch_size choice [32, 64, 128] → mini-batch size
weight_decay float [1e-6, 0.001] → L2 regularization
```
### ResidualMLP (PyTorch):
```
lr float [0.0001, 0.01] → log scale
hidden_dim choice [64, 128, 256] → ALL residual blocks share this width
n_blocks int [2, 5] → number of residual blocks
dropout float [0.1, 0.5]
batch_size choice [32, 64, 128]
weight_decay float [1e-6, 0.001]
```
---
## PYTORCH ARCHITECTURE DETAILS
### FeedforwardNN:
```
Input(n_features)
[Linear(n, hidden_dim) → BatchNorm1d → ReLU → Dropout] ← repeated n_layers times
Linear(hidden_dim, output_dim)
Output: logits (classification) OR scalar (regression)
```
### ResidualMLP:
```
Input(n_features)
Linear(n, hidden_dim) → BatchNorm1d → ReLU ← input projection
┌── ResidualBlock ──────────────────────┐
│ x_in ──────────────────────────────┐ │
│ │ │ │
│ Linear → BN → ReLU → Dropout │ │ ← repeated n_blocks times
│ │ │ │
│ Linear → BN │ │
│ │ │ │
│ ReLU( x_block + x_in ) ←──────────┘ │ ← skip connection
└───────────────────────────────────────┘
Linear(hidden_dim, output_dim)
```
The skip connection (x_in + x_block) is what makes it "residual" —
gradients flow directly backward through the addition, preventing
vanishing gradient in deeper networks.
---
## TRAINING ENGINE — Each Epoch
```
For each epoch (max 80):
├── TRAIN PHASE
│ For each mini-batch:
│ ├── Forward: output = model(X_batch)
│ ├── Loss:
│ │ classification → CrossEntropyLoss(output, y_batch)
│ │ regression → MSELoss(output.squeeze(), y_batch)
│ ├── Backward: loss.backward()
│ ├── Gradient clip: clip_grad_norm_(max=1.0) ← safety guard
│ └── Step: Adam optimizer updates weights
├── VALIDATION PHASE
│ ├── model.eval() + torch.no_grad()
│ ├── Compute val_loss on full val set
│ └── model.train()
├── LR SCHEDULER
│ └── ReduceLROnPlateau: if val_loss doesn't improve for 5 epochs
│ → halve the learning rate
└── EARLY STOPPING
└── If val_loss doesn't improve for 10 epochs:
├── Restore best weights seen so far
└── Stop training
```
---
## PREPROCESSING DECISION TREE
```
For each column:
├── Is it numeric dtype? (int, float)
│ └── YES → Impute(median) → StandardScaler
├── Is it object/categorical dtype?
│ ├── nunique ≤ 15 → Impute(most_frequent) → OneHotEncoder
│ │ Creates binary 0/1 columns for each category
│ │
│ └── nunique > 15 → Impute(most_frequent) → OrdinalEncoder (if no category_encoders)
│ OR TargetEncoder (if category_encoders installed)
│ Maps categories to integers or target-mean values
└── Is it text? (avg_string_len > 30 AND nunique > 50)
└── TfidfVectorizer(max_features=50, ngram_range=(1,2))
Creates 50 TF-IDF score columns from text
```
---
## WHAT HAPPENS WITH TIME BUDGET
```
--time_budget=300 (5 minutes)
Time allocation per model (approximate):
Each sklearn model: ~20% of remaining budget, capped at 60s for HPO
Each pytorch model: ~30% of remaining budget, capped at 90s for HPO
If budget expires mid-loop:
├── Current model finishes its current HPO trial
├── No new models are started
└── Whatever finished goes to leaderboard → best is selected
No time budget (default):
└── Every model runs to completion with all 15 HPO trials
```
---
## EXPLAINABILITY — How It Decides
```
After best model is selected:
Is best model a tree-based sklearn model?
(RandomForest, GradientBoosting, XGBoost, LightGBM)
└── YES → use model.feature_importances_ directly (fast, no SHAP needed)
OR shap.TreeExplainer if SHAP installed (exact, tree-native)
Is best model a linear sklearn model?
(LogisticRegression, Ridge, LinearRegression)
└── YES → use abs(model.coef_) directly
OR shap.LinearExplainer if SHAP installed
Is best model a PyTorch model?
└── YES → requires SHAP installed
shap.KernelExplainer with 50 background samples (approximate)
→ if SHAP not installed: no importance available for neural nets
Output: dict of {feature_name: importance_score}
sorted descending → top 15 printed as bar chart
```
---
## OUTPUT FILES
```
./automl_output/
├── best_model.joblib ← sklearn model (if sklearn won)
├── best_model.pt ← PyTorch model weights (if neural net won)
├── preprocessing.joblib ← fitted ColumnTransformer (imputers + scalers + encoders)
├── feature_engineering.joblib ← fitted FeatureEngineer (poly + log + sqrt + selector)
└── leaderboard.csv ← all models with all metrics, ranked
```
To reload and use the best model later:
```python
import joblib
import torch
# For sklearn winner:
model = joblib.load("automl_output/best_model.joblib")
prep = joblib.load("automl_output/preprocessing.joblib")
feat = joblib.load("automl_output/feature_engineering.joblib")
X_new = prep.transform(new_df)
X_new = feat.transform(X_new)
preds = model.predict(X_new)
# For PyTorch winner:
model = torch.load("automl_output/best_model.pt")
model.eval()
```