predictlm-base-26m

A 26.2M-parameter transformer-based tabular foundation model that uses in-context learning to solve regression and classification in a single forward pass. Pass a small training table as the context, and the model predicts on new rows — no fine-tuning, no model selection, no hyperparameter sweep.

Looking for a more compact variant? PredictLM Mini (13.5M) is distilled from this Base model via warm-start knowledge transfer. It is statistically tied with Base on classification accuracy (paired-bootstrap delta -0.001, 95% CI crosses zero) and ~4 pp lower R² on regression at half the parameter count.

Getting started — the published 0.751 cls / 0.609 reg recipe, by default

pip install predictlm
from predictlm import PredictLM

model = PredictLM.from_pretrained("zerooneresearch/predictlm-base-26m")  # cpu / mps / cuda all OK

# Regression — pass float y, get continuous predictions
preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)

# Classification — same model, same API; auto-routed via y_train dtype
preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
probs = model.predict_proba(X_test_cls)            # [n, n_classes]

That's it. On the first .predict() call the package silently downloads its partner checkpoint (predictlm-mini-13m), forms the published Duo + TTT ensemble under the hood, and returns the 0.751 cls / 0.609 reg result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in ~/.cache/huggingface/.

Recipe (chosen via auto_duo= flag) cls mean acc reg mean R²
Default .predict() (Duo + TTT under the hood) 0.751 0.609
auto_duo=False (Base-only, zero-tuning) 0.685 0.589
auto_duo=False + fit_and_predict_with_ttt() (Base-only TTT) 0.748 0.608

Edge cases:

  • No internet / air-gapped. Pass auto_duo=False at load to disable partner download — .predict() returns the single-model in-context result.
  • Real-time inference (<10 ms latency)? Use auto_duo=False zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.

TTT (Test-Time Training) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.

PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch on synthetic data and shipped under Apache-2.0.

Architecture

Unified architecture: a shared backbone with two task heads (regression via a 1024-bin BarDistribution, classification via per-task masked softmax). The model auto-detects task type from the dtype of y_train and routes through the matching head. One fit/predict API for both. This unified framing follows TabICLv2 (Soda Inria, Feb 2026); the closest non-unified precedent is TabPFN v2, which ships separate classifier and regressor checkpoints.

X_train and X_test are numeric np.ndarray or torch.Tensor. y_train controls task routing: float → regression, int / string → classification.

Developers and affiliations

  • Developed by: ZeroOne Research
  • Model card contact: message the org on the Hub
  • License: Apache 2.0 — permissive, commercial use allowed, no attribution-only restriction

Intended use

predictlm-base-26m is a drop-in tabular predictor for small-to-medium tables when you want one model that handles both regression and classification:

  • Direct use: fit/predict on numeric tabular data with ≤ 128 features, ≤ 1024 training rows, and (for classification) ≤ 10 classes. Best in zero-tuning settings.
  • Downstream use: as a baseline foundation model in tabular benchmarking, or as an ICL backbone for derivative work (the trunk weights are released under Apache 2.0).
  • Most useful when: you have few training rows, you have mixed reg + cls tasks in one pipeline, you want zero hyperparameter tuning, or you want a single model artifact rather than maintaining separate per-task models.

Not intended use

Do not use this model for:

  • High-stakes decisions in medicine, lending, hiring, criminal justice, or any context where a wrong prediction causes individual harm — without domain-specific validation, calibration audit, and human review. Like any tabular predictor, predictlm-base-26m will reflect biases present in the labeled context the user provides.
  • Wide tables (> 128 features). The input projection truncates extra columns.
  • Many-class classification (> 10 classes). Will raise an error.
  • Very large training sets (> ~10,000 rows). Performance saturates around the 1024-row context cap; gradient-boosted trees (XGBoost / LightGBM) will outperform here.
  • Non-numeric features without prior encoding. One-hot / target-encode categoricals first.
  • Latency-critical inference under ~10 ms on CPU. A trained XGBoost is faster on small problems.

Model architecture

field value
Parameters 26.2 M
Layers 12 (8 shared trunk + 2 reg-head + 2 cls-head)
d_model 256
n_heads 8
max_features 128
max_classes 10
max_context (training rows passed at inference) 1024
max_query (test rows scored per call) 256
Regression head BarDistribution, 1024 bins
Classification head Per-task masked softmax
Attention row-axis transformer; queries cross-attend to context only (deterministic given the context)
Feature embedding Periodic-frequency, 8 bands (scale-invariant, no explicit standardization required)
Inference precision bf16

The trunk is a row-axis transformer over the training context concatenated with query rows. Queries cross-attend over the context but not over each other, which makes predictions deterministic given the context.

Training data and priors

predictlm-base-26m was trained on synthetic priors with cleared real-data augmentation. No raw OpenML rows were ever shown to the model.

Training-task mix per step:

  • 70% structural causal model (SCM) tasks — mixed-node SCMs (linear / MLP / tree / periodic / discretizer), with heavy-tail noise, MNAR missingness, target censoring, hierarchical groups, Pitman-Yor categoricals, and covariate shift between context and query.
  • 30% Gaussian-copula tasks fit on cleared real tables — 99 bundles total, sampled from UCI and EU government open-data sources.

Real datasets used as copula seeds were screened by a 3-rule contamination auditor (MinHash + character-n-gram + target-name match) against the full locked OpenML eval set before being admitted to the training pool. The auditor's clearance manifest is the load-bearing artifact for our no-leakage claim — see Reproducibility below.

A DifficultyCurriculum + HardExampleBuffer (5,000-task capacity, 30% replay rate) accelerates training on tasks where the model under-performs a copula baseline.

Performance benchmarks

Locked OpenML eval (held-out, contamination-audited)

Benchmark suites: CC-18 + CTR-23 + AMLB + TabPFN-extras (153 unique OpenML IDs, manifest SHA-256 below). 30-dataset stratified sample, seed=42, n=1500 rows max per task, fair-set filter n_features ≤ 128. 4-way comparison run 2026-05-14 against open and hosted SOTA baselines.

Fair set (n_features ≤ 128):

reg-R² (n=13) cls-acc (n=12)
predictlm-base-26m (this model, 26M) +0.589 0.685
XGBoost (200 trees, depth 6) +0.516 0.743
TabPFN-2.5 (hosted, ~100M, non-commercial license) +0.662 0.780
TabICLv2 (open, BSD-3, ~50M) (cls-only) 0.792

Paired-bootstrap 95% CIs (10,000 resamples, seed=42)

Per-dataset deltas (predictlm-base-26m minus baseline):

comparison mean Δ 95% CI n significant?
Reg vs XGBoost +0.073 [-0.041, +0.196] 13 within noise
Reg vs TabPFN-2.5 -0.073 [-0.108, -0.038] 13 ✅ significant loss
Cls vs XGBoost -0.058 [-0.094, -0.024] 12 ✅ significant loss
Cls vs TabPFN-2.5 -0.096 [-0.133, -0.059] 12 ✅ significant loss
Cls vs TabICLv2 -0.108 [-0.150, -0.066] 12 ✅ significant loss

Honest read on the headline number. The +7.3 pp mean R² advantage over XGBoost on regression is the point estimate; the 95% paired-bootstrap CI is [−4.1 pp, +19.6 pp], so the regression win does not survive 95%-CI hypothesis testing on this 13-dataset sample. Within-dataset variance is large (some datasets predictlm wins by 10+ pp, others XGBoost wins by 5+ pp). What we can say: on this evaluation, predictlm-base-26m trends ahead of XGBoost on regression with a positive point estimate, while neither method has a statistically dominant advantage.

Significant losses (real, not noise): loses to XGBoost on classification (-5.8 pp, CI [-9.4, -2.4]); loses to TabPFN-2.5 and TabICLv2 on both axes — these are commercial / SOTA models 2-4× our parameter count.

Out-of-regime (n_features > 128, n=2 datasets): predictlm-base-26m degrades sharply (~0.15 R² / 0.50 cls) — the input projection truncates extra columns. Use a different method for wide tables (see Limitations).

Empirical examples on real-world datasets (not in the eval set)

Same fit/predict call, default settings, 1000 train rows / 200 test rows, single seed:

dataset task n_train predictlm XGBoost winner
California housing reg (R²) 1000 0.728 0.727 tied
Abalone reg (R²) 1000 0.562 0.459 predictlm (+10 pp)
Wine quality, as float reg (R²) 1000 0.129 0.441 XGBoost (mean reversion on ordinal)
Wine quality, as int cls (acc) 1000 0.530 n/a (cls mode resolves the failure mode above)
Kin8nm reg (R²) 1000 0.625 0.594 predictlm
Titanic cls (acc) 1000 0.905 0.940 XGBoost (small gap)
Glass cls (acc) 14 0.610 0.400 predictlm (+21 pp)
Segment cls (acc) 1000 0.940 0.960 XGBoost (small gap)

The glass result is the foundation-model signal: with only 14 training rows on a 6-class problem, the pretrained ICL prior generalizes; XGBoost has nothing to fit. Conversely, on wine quality the BarDistribution regression head collapses to mean predictions on a near-categorical target — casting y to int switches the model into classification mode and recovers utility.

When to prefer GBDTs (XGBoost / LightGBM)

This is the operating-envelope guidance, not a confession. predictlm-base-26m is not the right tool when:

  • Training set is large (≳ 10,000 rows) — gradient-boosted trees scale better on data and saturate the predictlm context cap.
  • Wide tables (> 128 features) — out of model regime; use trees or wait for v12.
  • High-cardinality categoricals (e.g. ZIP codes, product IDs) — encode-and-truncate fights ICL pretraining.
  • Latency budget < 10 ms on CPU for many small predictions — a trained tree is faster.
  • Single, well-defined task with tuning budget — a tuned XGBoost almost always wins by a few points if you have time to grid-search.

Ethical considerations

  • No personal data in training: the model was not trained on any dataset containing personally identifying information. The 99 copula bundles are drawn from public UCI / EU government open-data sources.
  • No benchmark leakage: the locked eval set was never used in training, and any real dataset used as a copula seed was screened against the eval manifest before admission. Manifest SHA-256: fe4da8ccc...4fe0841 (see Reproducibility).
  • Bias inheritance: in classification, predictions reflect the labeled context the user supplies at inference time. Like any other tabular prediction method, when applied to high-risk use cases, users should ensure the labeled data is free of biases.
  • Interpretability: this is a black-box transformer over context+query; do not use without a human-in-the-loop in regulated decision contexts.

Limitations

  • max_features = 128 — wider tables truncate columns.
  • max_classes = 10 — many-class classification raises an error.
  • max_context = 1024 rows — larger training sets are randomly subsampled per call.
  • Numeric features only — encode categoricals before passing.
  • Rows are treated as exchangeable — no time-series / sequence inductive bias.
  • Single-seed eval numbers above; per-dataset variance is ±5 pp.

Inference latency

Single GPU, n_train=500, n_test=100, n_features=20:

device latency
H100 / A100 ~30 ms
L4 / RTX 3090 ~80 ms
Apple M-series MPS ~150 ms
CPU 2–5 s

Reproducibility

  • Weights file: v11_final.pt
  • SHA-256: e787b783f4ad06c55367d1912ec105626e94c82d399909aa98d93c446dc03e26
  • Size: 105 MB (EMA weights + architecture cfg only — training-only state stripped)
  • Training step: 75,000 (the best held-out checkpoint per the locked eval; subsequent fine-tunes did not improve)
  • Training seed: 42
  • Eval-lock manifest SHA-256: fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841 (frozen 2026-04-25, 153 OpenML IDs)

Licensing

Apache 2.0 — see LICENSE. Permissive, commercial use allowed, no attribution-only restriction.

Version

  • v11.0 (current) — first public release. Step-75k checkpoint of the v11 training run.
  • predictlm-mini-13m — distilled compact sibling shipping alongside this release. 13.5M params, T4-trainable. Recommended for deployment on commodity GPUs.
  • Future releases will ship as new HF model repos under the same predictlm Python package.

Citation

BibTeX

@misc{predictlm2026,
  author       = {ZeroOne Research},
  title        = {predictlm-base-26m: a unified tabular foundation model for in-context regression and classification},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-base-26m}}
}

APA

ZeroOne Research. (2026). predictlm-base-26m: a unified tabular foundation model for in-context regression and classification. Hugging Face. https://huggingface.co/zerooneresearch/predictlm-base-26m

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zerooneresearch/predictlm-base-26m

Finetunes
1 model

Papers for zerooneresearch/predictlm-base-26m

Evaluation results

  • mean accuracy (n=12, seed=42, fair-set n_features ≤ 128) on Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
    self-reported
    0.685
  • mean R² (n=13, seed=42, fair-set n_features ≤ 128) on Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
    self-reported
    0.589
  • mean accuracy with Duo + TTT recipe (Mini + Base + test-time training) on Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
    self-reported
    0.751
  • mean R² with Duo + TTT recipe (Mini + Base + test-time training) on Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
    self-reported
    0.609