Spaces:
Running
Running
| # AutoML System — Complete Working Structure | |
| --- | |
| ## OVERVIEW: How the System Flows | |
| ``` | |
| CSV File | |
| │ | |
| ▼ | |
| [STEP 1] CLI Parsing → reads --data --target --task --time_budget etc. | |
| │ | |
| ▼ | |
| [STEP 2] Dataset Analyzer → inspects rows, cols, types, missing, imbalance | |
| │ | |
| ▼ | |
| [STEP 3] Train/Val/Test Split → 70% / 15% / 15% | |
| │ | |
| ▼ | |
| [STEP 4] Preprocessing → impute → scale / encode → tensor | |
| │ | |
| ▼ | |
| [STEP 5] Feature Engineering → original + log + sqrt + polynomial → SelectKBest | |
| │ | |
| ▼ | |
| [STEP 6] Model Training Loop | |
| ├── sklearn models → Optuna HPO → final fit → evaluate | |
| └── pytorch models → Optuna HPO → full train loop → evaluate | |
| │ | |
| ▼ | |
| [STEP 7] Leaderboard → rank all models by primary score | |
| │ | |
| ▼ | |
| [STEP 8] Best Model Selected → highest primary score wins | |
| │ | |
| ▼ | |
| [STEP 9] Explainability → SHAP or feature_importances_ | |
| │ | |
| ▼ | |
| [STEP 10] Save Artifacts → model + pipeline + leaderboard.csv | |
| ``` | |
| --- | |
| ## CASE A: CLASSIFICATION TASK | |
| ```bash | |
| python main.py --data data.csv --target churn --task classification | |
| ``` | |
| ### Models that run (in this order): | |
| ``` | |
| 1. RandomForest (RandomForestClassifier) | |
| 2. GradientBoosting (GradientBoostingClassifier) | |
| 3. LogisticRegression (LogisticRegression) | |
| 4. XGBoost (XGBClassifier) ← only if xgboost installed | |
| 5. LightGBM (LGBMClassifier) ← only if lightgbm installed | |
| 6. FeedforwardNN (PyTorch) ← only if torch installed | |
| 7. ResidualMLP (PyTorch) ← only if torch installed | |
| ``` | |
| ### Metrics computed per model: | |
| ``` | |
| accuracy = correct predictions / total | |
| precision = TP / (TP + FP) → binary: direct | multiclass: macro average | |
| recall = TP / (TP + FN) → binary: direct | multiclass: macro average | |
| f1 = 2 * (P * R) / (P + R) | |
| roc_auc = area under ROC curve → binary: normal | multiclass: OvR macro | |
| ``` | |
| ### How best model is selected (classification): | |
| ``` | |
| primary_score = roc_auc (if available) | |
| = f1 (fallback, when predict_proba not available) | |
| → Model with HIGHEST primary_score wins | |
| ``` | |
| Example leaderboard: | |
| ``` | |
| Rank Model accuracy f1 roc_auc → primary_score | |
| 1 LogisticRegression 0.907 0.909 0.972 0.972 ← WINNER | |
| 2 GradientBoosting 0.880 0.868 0.958 0.958 | |
| 3 FeedforwardNN 0.871 0.855 0.941 0.941 | |
| 4 RandomForest 0.853 0.836 0.945 0.945 | |
| ``` | |
| --- | |
| ## CASE B: REGRESSION TASK | |
| ```bash | |
| python main.py --data data.csv --target price --task regression | |
| ``` | |
| ### Models that run (in this order): | |
| ``` | |
| 1. RandomForest (RandomForestRegressor) | |
| 2. GradientBoosting (GradientBoostingRegressor) | |
| 3. Ridge (Ridge regression) | |
| 4. LinearRegression (LinearRegression) | |
| 5. XGBoost (XGBRegressor) ← only if xgboost installed | |
| 6. LightGBM (LGBMRegressor) ← only if lightgbm installed | |
| 7. FeedforwardNN (PyTorch) ← only if torch installed | |
| 8. ResidualMLP (PyTorch) ← only if torch installed | |
| ``` | |
| ### Metrics computed per model: | |
| ``` | |
| RMSE = sqrt( mean( (y_pred - y_true)² ) ) → lower is better | |
| MAE = mean( |y_pred - y_true| ) → lower is better | |
| R² = 1 - SS_residual / SS_total → higher is better (1.0 = perfect) | |
| ``` | |
| ### How best model is selected (regression): | |
| ``` | |
| primary_score = -RMSE (negated so that higher = better, like classification) | |
| → Model with LOWEST RMSE = HIGHEST primary_score = WINS | |
| ``` | |
| Example leaderboard: | |
| ``` | |
| Rank Model rmse mae r2 primary_score | |
| 1 Ridge 34.44 26.92 0.954 -34.44 ← WINNER | |
| 2 LinearReg 35.36 27.69 0.951 -35.36 | |
| 3 FeedforwardNN 61.22 48.11 0.857 -61.22 | |
| 4 RandomForest 100.81 77.93 0.607 -100.81 | |
| ``` | |
| --- | |
| ## FEATURE ENGINEERING — Full Breakdown | |
| This is what happens step by step when your preprocessed matrix hits the | |
| FeatureEngineer. Example: you have 15 preprocessed features. | |
| ``` | |
| Input: X shape = (N rows, 15 features) | |
| ``` | |
| ### Step 1 — Original features kept as-is: | |
| ``` | |
| [f0, f1, f2, ..., f14] → 15 columns | |
| ``` | |
| ### Step 2 — Log Transform (use_log=True): | |
| ``` | |
| For each column: | |
| X_shifted = X - X.min(axis=0) + 0.000001 ← shift all values to positive | |
| log_feat = log1p(X_shifted) ← log(1 + x) | |
| Result: 15 new log-transformed columns | |
| [log(f0), log(f1), ..., log(f14)] → +15 columns = 30 total | |
| ``` | |
| ### Step 3 — Sqrt Transform (use_sqrt=True): | |
| ``` | |
| For each column: | |
| X_shifted = X - X.min(axis=0) ← shift to non-negative | |
| sqrt_feat = sqrt(X_shifted) | |
| Result: 15 new sqrt-transformed columns | |
| [√f0, √f1, ..., √f14] → +15 columns = 45 total | |
| ``` | |
| ### Step 4 — Polynomial Features (use_polynomial=True, degree=2): | |
| ``` | |
| Runs on first 20 columns only (cap to avoid memory explosion) | |
| With 15 features, runs on all 15. | |
| PolynomialFeatures(degree=2, include_bias=False) generates: | |
| - Original terms: f0, f1, ..., f14 (15 cols — REMOVED, already have them) | |
| - Squared terms: f0², f1², ..., f14² (15 cols) ← NEW | |
| - Cross terms: f0·f1, f0·f2, ..., f13·f14 (105 cols) ← NEW | |
| Formula: new_cols = N*(N+1)/2 = 15*16/2 = 120 | |
| Example with just 3 features A, B, C: | |
| Input: [A, B, C] | |
| After poly: | |
| A² = A*A | |
| A·B = A*B | |
| A·C = A*C | |
| B² = B*B | |
| B·C = B*C | |
| C² = C*C | |
| → 6 new columns (originals removed since already in base set) | |
| With 15 features → +120 new polynomial columns = 165 total | |
| ``` | |
| ### Step 5 — SelectKBest (select_k=60): | |
| ``` | |
| All 165 features are scored against target y using: | |
| - f_regression (for regression tasks) → F-statistic | |
| - f_classif (for classification) → ANOVA F-value | |
| Top 60 features with highest scores are KEPT. | |
| All others are DROPPED. | |
| Output: X shape = (N rows, 60 features) | |
| ``` | |
| ### Summary Table: | |
| ``` | |
| Stage Columns Cumulative | |
| ───────────────────────────────────────────── | |
| Original preprocessed 15 15 | |
| + Log transforms 15 30 | |
| + Sqrt transforms 15 45 | |
| + Polynomial (degree-2) 120 165 | |
| ───────────────────────────────────────────── | |
| After SelectKBest(k=60) ——→ 60 ← final feature matrix | |
| ``` | |
| ### Why you "didn't see" polynomial features: | |
| The polynomial features ARE generated, but then SelectKBest filters the | |
| full 165-feature matrix down to 60. The output is indexed as: | |
| feature_0, feature_1, ..., feature_59 | |
| These are not labeled as "poly" or "log" — they're just the top 60 | |
| regardless of type. If a polynomial feature ranked high, it's in there. | |
| If it ranked low, SelectKBest dropped it. The system doesn't show which | |
| type each surviving feature came from. | |
| --- | |
| ## MODEL SELECTION — Detailed Decision Tree | |
| ``` | |
| For each model in the candidate pool: | |
| │ | |
| ├── Does it have a search space? (len(search_space) > 0) | |
| │ ├── YES → Run Optuna HPO | |
| │ │ │ | |
| │ │ ├── Is Optuna installed? | |
| │ │ │ ├── YES → Bayesian optimization (TPE sampler) | |
| │ │ │ │ Each trial: suggest params → fit → score val → report | |
| │ │ │ └── NO → Random search fallback | |
| │ │ │ Each trial: random sample → fit → score val → track best | |
| │ │ │ | |
| │ │ └── Returns best_params dict | |
| │ │ | |
| │ └── NO → Skip HPO (e.g. LinearRegression has no hyperparams) | |
| │ Use default params directly | |
| │ | |
| ├── Final fit on FULL train set with best_params | |
| │ | |
| ├── Predict on TEST set (never seen during HPO) | |
| │ | |
| ├── Compute metrics (accuracy/f1/auc OR rmse/mae/r2) | |
| │ | |
| └── Compute primary_score → add to Leaderboard | |
| After ALL models finish: | |
| │ | |
| ├── Sort leaderboard by primary_score DESCENDING | |
| ├── Rank 1 = BEST MODEL | |
| └── best_model = leaderboard[0]["_model"] | |
| ``` | |
| --- | |
| ## HPO SEARCH SPACES — Every Parameter | |
| ### RandomForest (classification AND regression): | |
| ``` | |
| n_estimators int [50, 500] → number of trees | |
| max_depth int [3, 20] → max tree depth | |
| min_samples_split int [2, 20] → min samples to split a node | |
| min_samples_leaf int [1, 10] → min samples in leaf node | |
| ``` | |
| ### GradientBoosting (classification AND regression): | |
| ``` | |
| n_estimators int [50, 300] → boosting rounds | |
| max_depth int [2, 8] → tree depth (shallower than RF) | |
| learning_rate float [0.001, 0.3] → log scale (shrinkage) | |
| subsample float [0.5, 1.0] → fraction of samples per tree | |
| ``` | |
| ### LogisticRegression (classification only): | |
| ``` | |
| C float [0.0001, 10.0] → log scale, inverse regularization | |
| ``` | |
| ### Ridge (regression only): | |
| ``` | |
| alpha float [0.001, 100.0] → log scale, regularization strength | |
| ``` | |
| ### LinearRegression: NO hyperparams → skips HPO entirely | |
| ### XGBoost (if installed): | |
| ``` | |
| n_estimators int [50, 500] | |
| max_depth int [2, 10] | |
| learning_rate float [0.001, 0.3] → log scale | |
| subsample float [0.5, 1.0] | |
| colsample_bytree float [0.5, 1.0] → fraction of features per tree | |
| ``` | |
| ### LightGBM (if installed): | |
| ``` | |
| n_estimators int [50, 500] | |
| max_depth int [2, 10] | |
| learning_rate float [0.001, 0.3] → log scale | |
| num_leaves int [16, 256] → controls model complexity | |
| subsample float [0.5, 1.0] | |
| ``` | |
| ### FeedforwardNN (PyTorch): | |
| ``` | |
| lr float [0.0001, 0.01] → log scale (Adam learning rate) | |
| hidden_dim choice [64, 128, 256] → units per hidden layer | |
| n_layers int [2, 4] → number of hidden layers | |
| dropout float [0.1, 0.5] → dropout probability | |
| batch_size choice [32, 64, 128] → mini-batch size | |
| weight_decay float [1e-6, 0.001] → L2 regularization | |
| ``` | |
| ### ResidualMLP (PyTorch): | |
| ``` | |
| lr float [0.0001, 0.01] → log scale | |
| hidden_dim choice [64, 128, 256] → ALL residual blocks share this width | |
| n_blocks int [2, 5] → number of residual blocks | |
| dropout float [0.1, 0.5] | |
| batch_size choice [32, 64, 128] | |
| weight_decay float [1e-6, 0.001] | |
| ``` | |
| --- | |
| ## PYTORCH ARCHITECTURE DETAILS | |
| ### FeedforwardNN: | |
| ``` | |
| Input(n_features) | |
| │ | |
| ▼ | |
| [Linear(n, hidden_dim) → BatchNorm1d → ReLU → Dropout] ← repeated n_layers times | |
| │ | |
| ▼ | |
| Linear(hidden_dim, output_dim) | |
| │ | |
| ▼ | |
| Output: logits (classification) OR scalar (regression) | |
| ``` | |
| ### ResidualMLP: | |
| ``` | |
| Input(n_features) | |
| │ | |
| ▼ | |
| Linear(n, hidden_dim) → BatchNorm1d → ReLU ← input projection | |
| │ | |
| ▼ | |
| ┌── ResidualBlock ──────────────────────┐ | |
| │ x_in ──────────────────────────────┐ │ | |
| │ │ │ │ | |
| │ Linear → BN → ReLU → Dropout │ │ ← repeated n_blocks times | |
| │ │ │ │ | |
| │ Linear → BN │ │ | |
| │ │ │ │ | |
| │ ReLU( x_block + x_in ) ←──────────┘ │ ← skip connection | |
| └───────────────────────────────────────┘ | |
| │ | |
| ▼ | |
| Linear(hidden_dim, output_dim) | |
| ``` | |
| The skip connection (x_in + x_block) is what makes it "residual" — | |
| gradients flow directly backward through the addition, preventing | |
| vanishing gradient in deeper networks. | |
| --- | |
| ## TRAINING ENGINE — Each Epoch | |
| ``` | |
| For each epoch (max 80): | |
| │ | |
| ├── TRAIN PHASE | |
| │ For each mini-batch: | |
| │ ├── Forward: output = model(X_batch) | |
| │ ├── Loss: | |
| │ │ classification → CrossEntropyLoss(output, y_batch) | |
| │ │ regression → MSELoss(output.squeeze(), y_batch) | |
| │ ├── Backward: loss.backward() | |
| │ ├── Gradient clip: clip_grad_norm_(max=1.0) ← safety guard | |
| │ └── Step: Adam optimizer updates weights | |
| │ | |
| ├── VALIDATION PHASE | |
| │ ├── model.eval() + torch.no_grad() | |
| │ ├── Compute val_loss on full val set | |
| │ └── model.train() | |
| │ | |
| ├── LR SCHEDULER | |
| │ └── ReduceLROnPlateau: if val_loss doesn't improve for 5 epochs | |
| │ → halve the learning rate | |
| │ | |
| └── EARLY STOPPING | |
| └── If val_loss doesn't improve for 10 epochs: | |
| ├── Restore best weights seen so far | |
| └── Stop training | |
| ``` | |
| --- | |
| ## PREPROCESSING DECISION TREE | |
| ``` | |
| For each column: | |
| │ | |
| ├── Is it numeric dtype? (int, float) | |
| │ └── YES → Impute(median) → StandardScaler | |
| │ | |
| ├── Is it object/categorical dtype? | |
| │ ├── nunique ≤ 15 → Impute(most_frequent) → OneHotEncoder | |
| │ │ Creates binary 0/1 columns for each category | |
| │ │ | |
| │ └── nunique > 15 → Impute(most_frequent) → OrdinalEncoder (if no category_encoders) | |
| │ OR TargetEncoder (if category_encoders installed) | |
| │ Maps categories to integers or target-mean values | |
| │ | |
| └── Is it text? (avg_string_len > 30 AND nunique > 50) | |
| └── TfidfVectorizer(max_features=50, ngram_range=(1,2)) | |
| Creates 50 TF-IDF score columns from text | |
| ``` | |
| --- | |
| ## WHAT HAPPENS WITH TIME BUDGET | |
| ``` | |
| --time_budget=300 (5 minutes) | |
| Time allocation per model (approximate): | |
| Each sklearn model: ~20% of remaining budget, capped at 60s for HPO | |
| Each pytorch model: ~30% of remaining budget, capped at 90s for HPO | |
| If budget expires mid-loop: | |
| ├── Current model finishes its current HPO trial | |
| ├── No new models are started | |
| └── Whatever finished goes to leaderboard → best is selected | |
| No time budget (default): | |
| └── Every model runs to completion with all 15 HPO trials | |
| ``` | |
| --- | |
| ## EXPLAINABILITY — How It Decides | |
| ``` | |
| After best model is selected: | |
| Is best model a tree-based sklearn model? | |
| (RandomForest, GradientBoosting, XGBoost, LightGBM) | |
| └── YES → use model.feature_importances_ directly (fast, no SHAP needed) | |
| OR shap.TreeExplainer if SHAP installed (exact, tree-native) | |
| Is best model a linear sklearn model? | |
| (LogisticRegression, Ridge, LinearRegression) | |
| └── YES → use abs(model.coef_) directly | |
| OR shap.LinearExplainer if SHAP installed | |
| Is best model a PyTorch model? | |
| └── YES → requires SHAP installed | |
| shap.KernelExplainer with 50 background samples (approximate) | |
| → if SHAP not installed: no importance available for neural nets | |
| Output: dict of {feature_name: importance_score} | |
| sorted descending → top 15 printed as bar chart | |
| ``` | |
| --- | |
| ## OUTPUT FILES | |
| ``` | |
| ./automl_output/ | |
| ├── best_model.joblib ← sklearn model (if sklearn won) | |
| ├── best_model.pt ← PyTorch model weights (if neural net won) | |
| ├── preprocessing.joblib ← fitted ColumnTransformer (imputers + scalers + encoders) | |
| ├── feature_engineering.joblib ← fitted FeatureEngineer (poly + log + sqrt + selector) | |
| └── leaderboard.csv ← all models with all metrics, ranked | |
| ``` | |
| To reload and use the best model later: | |
| ```python | |
| import joblib | |
| import torch | |
| # For sklearn winner: | |
| model = joblib.load("automl_output/best_model.joblib") | |
| prep = joblib.load("automl_output/preprocessing.joblib") | |
| feat = joblib.load("automl_output/feature_engineering.joblib") | |
| X_new = prep.transform(new_df) | |
| X_new = feat.transform(X_new) | |
| preds = model.predict(X_new) | |
| # For PyTorch winner: | |
| model = torch.load("automl_output/best_model.pt") | |
| model.eval() | |
| ``` | |