Tabular_AutoML / SYSTEM_STRUCTURE.md
he99codes's picture
deploy automl
211c37c

AutoML System — Complete Working Structure


OVERVIEW: How the System Flows

CSV File
   │
   ▼
[STEP 1] CLI Parsing         → reads --data --target --task --time_budget etc.
   │
   ▼
[STEP 2] Dataset Analyzer    → inspects rows, cols, types, missing, imbalance
   │
   ▼
[STEP 3] Train/Val/Test Split → 70% / 15% / 15%
   │
   ▼
[STEP 4] Preprocessing       → impute → scale / encode → tensor
   │
   ▼
[STEP 5] Feature Engineering → original + log + sqrt + polynomial → SelectKBest
   │
   ▼
[STEP 6] Model Training Loop
   ├── sklearn models  → Optuna HPO → final fit → evaluate
   └── pytorch models  → Optuna HPO → full train loop → evaluate
   │
   ▼
[STEP 7] Leaderboard         → rank all models by primary score
   │
   ▼
[STEP 8] Best Model Selected → highest primary score wins
   │
   ▼
[STEP 9] Explainability      → SHAP or feature_importances_
   │
   ▼
[STEP 10] Save Artifacts     → model + pipeline + leaderboard.csv

CASE A: CLASSIFICATION TASK

python main.py --data data.csv --target churn --task classification

Models that run (in this order):

1. RandomForest        (RandomForestClassifier)
2. GradientBoosting    (GradientBoostingClassifier)
3. LogisticRegression  (LogisticRegression)
4. XGBoost             (XGBClassifier)          ← only if xgboost installed
5. LightGBM            (LGBMClassifier)          ← only if lightgbm installed
6. FeedforwardNN       (PyTorch)                 ← only if torch installed
7. ResidualMLP         (PyTorch)                 ← only if torch installed

Metrics computed per model:

accuracy   = correct predictions / total
precision  = TP / (TP + FP)       → binary: direct | multiclass: macro average
recall     = TP / (TP + FN)       → binary: direct | multiclass: macro average
f1         = 2 * (P * R) / (P + R)
roc_auc    = area under ROC curve  → binary: normal | multiclass: OvR macro

How best model is selected (classification):

primary_score = roc_auc   (if available)
              = f1         (fallback, when predict_proba not available)

→ Model with HIGHEST primary_score wins

Example leaderboard:

Rank  Model               accuracy  f1      roc_auc   → primary_score
 1    LogisticRegression  0.907     0.909   0.972      0.972   ← WINNER
 2    GradientBoosting    0.880     0.868   0.958      0.958
 3    FeedforwardNN       0.871     0.855   0.941      0.941
 4    RandomForest        0.853     0.836   0.945      0.945

CASE B: REGRESSION TASK

python main.py --data data.csv --target price --task regression

Models that run (in this order):

1. RandomForest        (RandomForestRegressor)
2. GradientBoosting    (GradientBoostingRegressor)
3. Ridge               (Ridge regression)
4. LinearRegression    (LinearRegression)
5. XGBoost             (XGBRegressor)            ← only if xgboost installed
6. LightGBM            (LGBMRegressor)            ← only if lightgbm installed
7. FeedforwardNN       (PyTorch)                  ← only if torch installed
8. ResidualMLP         (PyTorch)                  ← only if torch installed

Metrics computed per model:

RMSE = sqrt( mean( (y_pred - y_true)² ) )   → lower is better
MAE  = mean( |y_pred - y_true| )             → lower is better
R²   = 1 - SS_residual / SS_total            → higher is better (1.0 = perfect)

How best model is selected (regression):

primary_score = -RMSE    (negated so that higher = better, like classification)

→ Model with LOWEST RMSE = HIGHEST primary_score = WINS

Example leaderboard:

Rank  Model            rmse      mae      r2      primary_score
 1    Ridge            34.44    26.92    0.954    -34.44    ← WINNER
 2    LinearReg        35.36    27.69    0.951    -35.36
 3    FeedforwardNN    61.22    48.11    0.857    -61.22
 4    RandomForest    100.81    77.93    0.607   -100.81

FEATURE ENGINEERING — Full Breakdown

This is what happens step by step when your preprocessed matrix hits the FeatureEngineer. Example: you have 15 preprocessed features.

Input:  X shape = (N rows, 15 features)

Step 1 — Original features kept as-is:

[f0, f1, f2, ..., f14]     → 15 columns

Step 2 — Log Transform (use_log=True):

For each column:
  X_shifted = X - X.min(axis=0) + 0.000001   ← shift all values to positive
  log_feat  = log1p(X_shifted)                ← log(1 + x)

Result: 15 new log-transformed columns
  [log(f0), log(f1), ..., log(f14)]           → +15 columns = 30 total

Step 3 — Sqrt Transform (use_sqrt=True):

For each column:
  X_shifted = X - X.min(axis=0)     ← shift to non-negative
  sqrt_feat = sqrt(X_shifted)

Result: 15 new sqrt-transformed columns
  [√f0, √f1, ..., √f14]             → +15 columns = 45 total

Step 4 — Polynomial Features (use_polynomial=True, degree=2):

Runs on first 20 columns only (cap to avoid memory explosion)
With 15 features, runs on all 15.

PolynomialFeatures(degree=2, include_bias=False) generates:
  - Original terms:    f0, f1, ..., f14         (15 cols — REMOVED, already have them)
  - Squared terms:     f0², f1², ..., f14²       (15 cols)  ← NEW
  - Cross terms:       f0·f1, f0·f2, ..., f13·f14 (105 cols) ← NEW

Formula: new_cols = N*(N+1)/2 = 15*16/2 = 120

Example with just 3 features A, B, C:
  Input:  [A, B, C]
  After poly:
    A²   = A*A
    A·B  = A*B
    A·C  = A*C
    B²   = B*B
    B·C  = B*C
    C²   = C*C
  → 6 new columns (originals removed since already in base set)

With 15 features → +120 new polynomial columns = 165 total

Step 5 — SelectKBest (select_k=60):

All 165 features are scored against target y using:
  - f_regression  (for regression tasks)   → F-statistic
  - f_classif     (for classification)     → ANOVA F-value

Top 60 features with highest scores are KEPT.
All others are DROPPED.

Output: X shape = (N rows, 60 features)

Summary Table:

Stage                   Columns   Cumulative
─────────────────────────────────────────────
Original preprocessed      15         15
+ Log transforms           15         30
+ Sqrt transforms          15         45
+ Polynomial (degree-2)   120        165
─────────────────────────────────────────────
After SelectKBest(k=60)   ——→         60   ← final feature matrix

Why you "didn't see" polynomial features:

The polynomial features ARE generated, but then SelectKBest filters the full 165-feature matrix down to 60. The output is indexed as: feature_0, feature_1, ..., feature_59

These are not labeled as "poly" or "log" — they're just the top 60 regardless of type. If a polynomial feature ranked high, it's in there. If it ranked low, SelectKBest dropped it. The system doesn't show which type each surviving feature came from.


MODEL SELECTION — Detailed Decision Tree

For each model in the candidate pool:
│
├── Does it have a search space? (len(search_space) > 0)
│   ├── YES → Run Optuna HPO
│   │         │
│   │         ├── Is Optuna installed?
│   │         │   ├── YES → Bayesian optimization (TPE sampler)
│   │         │   │         Each trial: suggest params → fit → score val → report
│   │         │   └── NO  → Random search fallback
│   │         │             Each trial: random sample → fit → score val → track best
│   │         │
│   │         └── Returns best_params dict
│   │
│   └── NO  → Skip HPO (e.g. LinearRegression has no hyperparams)
│             Use default params directly
│
├── Final fit on FULL train set with best_params
│
├── Predict on TEST set (never seen during HPO)
│
├── Compute metrics (accuracy/f1/auc OR rmse/mae/r2)
│
└── Compute primary_score → add to Leaderboard


After ALL models finish:
│
├── Sort leaderboard by primary_score DESCENDING
├── Rank 1 = BEST MODEL
└── best_model = leaderboard[0]["_model"]

HPO SEARCH SPACES — Every Parameter

RandomForest (classification AND regression):

n_estimators     int    [50, 500]      → number of trees
max_depth        int    [3, 20]        → max tree depth
min_samples_split int   [2, 20]        → min samples to split a node
min_samples_leaf  int   [1, 10]        → min samples in leaf node

GradientBoosting (classification AND regression):

n_estimators     int    [50, 300]      → boosting rounds
max_depth        int    [2, 8]         → tree depth (shallower than RF)
learning_rate    float  [0.001, 0.3]   → log scale (shrinkage)
subsample        float  [0.5, 1.0]     → fraction of samples per tree

LogisticRegression (classification only):

C               float  [0.0001, 10.0]  → log scale, inverse regularization

Ridge (regression only):

alpha           float  [0.001, 100.0]  → log scale, regularization strength

LinearRegression: NO hyperparams → skips HPO entirely

XGBoost (if installed):

n_estimators     int    [50, 500]
max_depth        int    [2, 10]
learning_rate    float  [0.001, 0.3]   → log scale
subsample        float  [0.5, 1.0]
colsample_bytree float  [0.5, 1.0]    → fraction of features per tree

LightGBM (if installed):

n_estimators     int    [50, 500]
max_depth        int    [2, 10]
learning_rate    float  [0.001, 0.3]   → log scale
num_leaves       int    [16, 256]      → controls model complexity
subsample        float  [0.5, 1.0]

FeedforwardNN (PyTorch):

lr              float  [0.0001, 0.01]  → log scale (Adam learning rate)
hidden_dim      choice [64, 128, 256]  → units per hidden layer
n_layers        int    [2, 4]          → number of hidden layers
dropout         float  [0.1, 0.5]      → dropout probability
batch_size      choice [32, 64, 128]   → mini-batch size
weight_decay    float  [1e-6, 0.001]   → L2 regularization

ResidualMLP (PyTorch):

lr              float  [0.0001, 0.01]  → log scale
hidden_dim      choice [64, 128, 256]  → ALL residual blocks share this width
n_blocks        int    [2, 5]          → number of residual blocks
dropout         float  [0.1, 0.5]
batch_size      choice [32, 64, 128]
weight_decay    float  [1e-6, 0.001]

PYTORCH ARCHITECTURE DETAILS

FeedforwardNN:

Input(n_features)
    │
    ▼
[Linear(n, hidden_dim) → BatchNorm1d → ReLU → Dropout]  ← repeated n_layers times
    │
    ▼
Linear(hidden_dim, output_dim)
    │
    ▼
Output: logits (classification) OR scalar (regression)

ResidualMLP:

Input(n_features)
    │
    ▼
Linear(n, hidden_dim) → BatchNorm1d → ReLU      ← input projection
    │
    ▼
┌── ResidualBlock ──────────────────────┐
│  x_in ──────────────────────────────┐ │
│      │                              │ │
│  Linear → BN → ReLU → Dropout       │ │    ← repeated n_blocks times
│      │                              │ │
│  Linear → BN                        │ │
│      │                              │ │
│  ReLU( x_block + x_in ) ←──────────┘ │   ← skip connection
└───────────────────────────────────────┘
    │
    ▼
Linear(hidden_dim, output_dim)

The skip connection (x_in + x_block) is what makes it "residual" — gradients flow directly backward through the addition, preventing vanishing gradient in deeper networks.


TRAINING ENGINE — Each Epoch

For each epoch (max 80):
│
├── TRAIN PHASE
│   For each mini-batch:
│   ├── Forward: output = model(X_batch)
│   ├── Loss:
│   │     classification → CrossEntropyLoss(output, y_batch)
│   │     regression     → MSELoss(output.squeeze(), y_batch)
│   ├── Backward: loss.backward()
│   ├── Gradient clip: clip_grad_norm_(max=1.0)   ← safety guard
│   └── Step: Adam optimizer updates weights
│
├── VALIDATION PHASE
│   ├── model.eval() + torch.no_grad()
│   ├── Compute val_loss on full val set
│   └── model.train()
│
├── LR SCHEDULER
│   └── ReduceLROnPlateau: if val_loss doesn't improve for 5 epochs
│       → halve the learning rate
│
└── EARLY STOPPING
    └── If val_loss doesn't improve for 10 epochs:
        ├── Restore best weights seen so far
        └── Stop training

PREPROCESSING DECISION TREE

For each column:
│
├── Is it numeric dtype? (int, float)
│   └── YES → Impute(median) → StandardScaler
│
├── Is it object/categorical dtype?
│   ├── nunique ≤ 15  → Impute(most_frequent) → OneHotEncoder
│   │                   Creates binary 0/1 columns for each category
│   │
│   └── nunique > 15  → Impute(most_frequent) → OrdinalEncoder (if no category_encoders)
│                        OR TargetEncoder (if category_encoders installed)
│                        Maps categories to integers or target-mean values
│
└── Is it text? (avg_string_len > 30 AND nunique > 50)
    └── TfidfVectorizer(max_features=50, ngram_range=(1,2))
        Creates 50 TF-IDF score columns from text

WHAT HAPPENS WITH TIME BUDGET

--time_budget=300  (5 minutes)

Time allocation per model (approximate):
  Each sklearn model:  ~20% of remaining budget, capped at 60s for HPO
  Each pytorch model:  ~30% of remaining budget, capped at 90s for HPO

If budget expires mid-loop:
  ├── Current model finishes its current HPO trial
  ├── No new models are started
  └── Whatever finished goes to leaderboard → best is selected

No time budget (default):
  └── Every model runs to completion with all 15 HPO trials

EXPLAINABILITY — How It Decides

After best model is selected:

Is best model a tree-based sklearn model?
  (RandomForest, GradientBoosting, XGBoost, LightGBM)
  └── YES → use model.feature_importances_ directly (fast, no SHAP needed)
            OR shap.TreeExplainer if SHAP installed (exact, tree-native)

Is best model a linear sklearn model?
  (LogisticRegression, Ridge, LinearRegression)
  └── YES → use abs(model.coef_) directly
            OR shap.LinearExplainer if SHAP installed

Is best model a PyTorch model?
  └── YES → requires SHAP installed
            shap.KernelExplainer with 50 background samples (approximate)
            → if SHAP not installed: no importance available for neural nets

Output: dict of {feature_name: importance_score}
        sorted descending → top 15 printed as bar chart

OUTPUT FILES

./automl_output/
├── best_model.joblib          ← sklearn model (if sklearn won)
├── best_model.pt              ← PyTorch model weights (if neural net won)
├── preprocessing.joblib       ← fitted ColumnTransformer (imputers + scalers + encoders)
├── feature_engineering.joblib ← fitted FeatureEngineer (poly + log + sqrt + selector)
└── leaderboard.csv            ← all models with all metrics, ranked

To reload and use the best model later:

import joblib
import torch

# For sklearn winner:
model = joblib.load("automl_output/best_model.joblib")
prep  = joblib.load("automl_output/preprocessing.joblib")
feat  = joblib.load("automl_output/feature_engineering.joblib")

X_new = prep.transform(new_df)
X_new = feat.transform(X_new)
preds = model.predict(X_new)

# For PyTorch winner:
model = torch.load("automl_output/best_model.pt")
model.eval()