predictlm-mini-13m

A 13.5M-parameter distilled tabular foundation model. Half the parameters of PredictLM Base (26M); statistically tied with Base on classification accuracy and within ~4 pp R² on regression.

This is the compact deployment variant of PredictLM, designed to run inference on any modern laptop or commodity GPU. Same single-forward-pass in-context-learning API as Base, same architecture family — just smaller, distilled, and re-trainable on hardware most teams already have.

Getting started — the published 0.751 cls / 0.609 reg recipe, by default

pip install predictlm

from predictlm import PredictLM

model = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m")  # cpu / mps / cuda all OK

# Regression — pass float y, get continuous predictions
preds = model.fit(X_train_reg, y_train_reg).predict(X_test_reg)

# Classification — same model, same API; auto-routed via y_train dtype
preds = model.fit(X_train_cls, y_train_cls).predict(X_test_cls)
probs = model.predict_proba(X_test_cls)

That's it. On the first .predict() call the package silently downloads its partner checkpoint (predictlm-base-26m), forms the published Duo + TTT ensemble under the hood, and returns the 0.751 cls / 0.609 reg result on the locked 25-dataset OpenML eval. You never manage the ensemble; the partner is cached in ~/.cache/huggingface/.

Recipe (chosen via `auto_duo=` flag)	cls mean acc	reg mean R²
Default `.predict()` (Duo + TTT under the hood)	0.751	0.609
`auto_duo=False` (Mini-only, zero-tuning)	0.673	0.536
`auto_duo=False` + `fit_and_predict_with_ttt()` (Mini-only TTT)	0.742	0.595

Edge cases:

No internet / air-gapped. Pass auto_duo=False at load to disable partner download — .predict() returns the single-model in-context result.
Real-time inference (<10 ms latency)? Use auto_duo=False zero-tuning. Duo + TTT adds ~1-60 s per query depending on table size.

TTT (Test-Time Training) does ~15 inner Adam steps of self-supervised fine-tuning on the user's in-context examples before predicting. Per-task specialization on top of a generic ICL prior. 19 / 20 datasets improved vs zero-tuning; no dataset regressed by more than 0.006.

PredictLM's TTT is an independent implementation of the published technique. This repo does not include or derive from TabPFN code or weights — PredictLM weights are trained from scratch (Mini distilled from PredictLM-Base) and shipped under Apache-2.0.

Developers and affiliations

Developed by: ZeroOne Research
Distilled from: predictlm-base-26m (v11.0)
Model card contact: message the org on the Hub
License: Apache 2.0 — permissive, commercial use allowed

Why Mini (when to prefer this over Base)

GPU memory budget < 8 GB at inference — Mini fits comfortably on a consumer GPU or M-series MPS
You want to re-distill / fine-tune yourself — Mini's training recipe runs on a single consumer GPU; Base requires an A100/H100
You want a smaller artifact to ship inside a product — 55 MB inference weights vs Base's 105 MB
You're running many concurrent inference jobs — 4× as many parallel Mini instances fit per GPU vs Base
You can tolerate ~4 pp lower regression R² (CI [-6.5, -1.5]; cls accuracy is statistically tied with Base)

Prefer Base instead if you have an A100/H100, value the last ~4 pp of regression accuracy, and don't need to re-distill.

Performance benchmarks

Locked OpenML eval (held-out, contamination-audited)

Same 30-dataset stratified sample, seed=42, fair-set filter n_features ≤ 128, 4-way comparison. Same eval pipeline as Base (scripts/eval_v11.py).

	reg-R² (n=13)	cls-acc (n=12)
predictlm-base-26m (teacher)	+0.589	0.685
predictlm-mini-13m (this model, 13.5M)	+0.551	0.684
XGBoost (200 trees, depth 6)	+0.516	0.743
TabPFN-2.5 (hosted, ~100M, non-commercial license)	+0.662	0.780
TabICLv2 (open, BSD-3, ~50M)	(cls-only)	0.792

Paired-bootstrap 95% CIs (10,000 resamples, seed=42)

Per-dataset deltas (predictlm-mini-13m minus baseline):

comparison	mean Δ	95% CI	n	significant?
Mini vs Base (compression cost)
Reg R²	-0.038	[-0.065, -0.015]	13	✅ real (~4 pp loss)
Cls acc	-0.001	[-0.027, +0.029]	12	✅ statistical tie
vs other peers (Mini)
Reg vs XGBoost	+0.035	[-0.076, +0.158]	13	within noise
Reg vs TabPFN-2.5	-0.111	[-0.152, -0.067]	13	✅ significant loss
Cls vs XGBoost	-0.059	[-0.089, -0.031]	12	✅ significant loss
Cls vs TabPFN-2.5	-0.097	[-0.132, -0.059]	12	✅ significant loss
Cls vs TabICLv2	-0.109	[-0.147, -0.069]	12	✅ significant loss

Retention vs Base — the headline compression story:

Classification: statistical tie with Base (delta -0.001, CI [-0.027, +0.029]). At half the parameters, Mini is indistinguishable from the 26M teacher on classification accuracy.
Regression: ~4 pp R² cost vs Base, CI [-6.5, -1.5] (statistically real but small).

Honest read on the peer comparisons. Like Base, Mini's regression-vs-XGBoost point estimate is positive (+3.5 pp) but the 95% CI on this 13-dataset sample crosses zero. We can't claim a statistically significant XGBoost win on regression from this single-seed eval. What we can say: Mini and XGBoost are competitive on regression on this benchmark, with Mini's distribution being slightly better on most datasets.

Significant losses (real, not noise): loses to XGBoost on classification (-5.9 pp), and to TabPFN-2.5 / TabICLv2 on both axes — these are commercial / SOTA models 2-8× Mini's parameter count.

Model size vs accuracy

model	params	params (%)	reg-R²	cls-acc
TabPFN-2.5	~100M	740%	0.662	0.780
TabICLv2	~50M	370%	—	0.792
predictlm-base-26m	26M	192%	0.589	0.685
predictlm-mini-13m	13.5M	100% (baseline)	0.551	0.684

Mini is the smallest open-source ICL tabular FM in this comparison and the only one that trains on a single commodity GPU.

Architecture

Identical architecture family to PredictLM Base, with cross-layer parameter sharing (ALBERT-style) to halve the trunk parameter count.

field	value
Parameters	13.5 M
Layers (effective depth)	12 (4 unique × 3 shares — ALBERT-style sharing in shared trunk; 2 unique × 2 shares per task head)
d_model	256
n_heads	8
max_features	128
max_classes	10
max_context	1024
max_query	256
Regression head	BarDistribution, 1024 bins (bins identical to Base — required for KL distillation)
Classification head	Per-task masked softmax
Attention	row-axis transformer (same as Base)
Inference precision	fp16 (T4-compatible — Base uses bf16 on A100/H100)

Cross-layer sharing means Mini has 4 unique trunk blocks each applied 3 times during forward pass (vs Base's 8 unique blocks each applied once). The effective compute graph depth is preserved; only the parameter count is halved.

Training recipe (distillation from Base)

Mini was trained via warm-start sliced distillation: a novel recipe for compressing in-context-learning models that preserves real-data transfer ability.

Three-stage recipe:

Warm-start by slicing. Copy every-Nth layer from the Base model (26M) into Mini's smaller unique-block list. Non-layer modules (feature embeddings, normalization, heads) copy verbatim. This initializes Mini at ~v11.0-half quality — student starts with the teacher's transfer ability already.
Distill via teacher logits. Train Mini on synthetic SCM tasks using Base as a frozen teacher. Loss = 0.7 × KL(student || teacher, T=2) + 0.2 × hard-label CE + 0.1 × feature MSE. Online distillation with replay buffer.
30,000 training steps with AdamW, cosine lr 3e-5 → 3e-6, fp16 mixed precision.

The critical insight: distillation from scratch (Option A in our experiments) failed to transfer to real OpenML data — student matched teacher on synthetic but couldn't generalize. Warm-start sliced distillation (Option B, this release) succeeded because the student inherits the teacher's transfer ability as the starting point; distillation only needs to refine.

Intended use, limitations, ethical considerations

Identical to predictlm-base-26m — see that model card for full details:

Intended: drop-in tabular predictor for ≤128 features, ≤1024 training rows, ≤10 classes
Not intended: high-stakes decisions without domain validation; wide tables (>128 features); many-class cls (>10); very large training sets (>10K rows); non-numeric features without encoding
No personal data in training: distilled from Base, which was trained on synthetic priors + cleared real-data copulas. No raw eval-set rows seen.
Bias inheritance: predictions reflect the labeled context the user supplies at inference time

The known weaknesses (cls below XGBoost; below TabPFN-2.5 / TabICLv2 on both axes) are inherited from Base; Mini does not amplify them but cannot fix them either.

Reproducibility

Weights file: v11_06_tiny_final.pt (inference-only, EMA-preferred state)
SHA-256: e27c8af6cda7a3426ffed33cb98eb8338966a8190712b5d37ff9e5f442b75a17
Size: 54.4 MB (inference-only, optimizer + curriculum + buffer + L2-SP state stripped from 217 MB raw)
Training step: 30,000 (final)
Training seed: 42
Teacher: predictlm-base-26m (v11.0)
Distillation recipe: warm-start slice + online KL distillation
Eval-lock manifest SHA-256: fe4da8cccfc78fc3c7746579f604154af7d37e525c4fd575965ba77ce4fe0841 (identical to Base)

Licensing

Apache 2.0 — see LICENSE. Permissive, commercial use allowed.

The distillation recipe uses our own predictlm-base-26m (Apache 2.0) as the teacher — no third-party license obligations propagate to this model. Mini is fully commercially usable.

Version

v11.0.6-tiny (current) — first public release of the compact distilled variant.
Sibling: predictlm-base-26m (full-size, 26M)
Future releases under the same predictlm Python package.

Citation

BibTeX

@misc{predictlm_mini_2026,
  author       = {ZeroOne Research},
  title        = {predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/zerooneresearch/predictlm-mini-13m}}
}

APA

ZeroOne Research. (2026). predictlm-mini-13m: a compact distilled tabular foundation model for commodity hardware. Hugging Face. https://huggingface.co/zerooneresearch/predictlm-mini-13m

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for zerooneresearch/predictlm-mini-13m

Base model

zerooneresearch/predictlm-base-26m

Finetuned

(1)

this model

Space using zerooneresearch/predictlm-mini-13m 1

Paper for zerooneresearch/predictlm-mini-13m

Test-Time Training Provably Improves Transformers as In-context Learners

Paper • 2503.11842 • Published Feb 21

Evaluation results

mean accuracy (n=12, seed=42, fair-set n_features ≤ 128) on Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
self-reported

0.684
mean R² (n=13, seed=42, fair-set n_features ≤ 128) on Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
self-reported

0.551
mean accuracy with Duo + TTT recipe (Mini + Base + test-time training) on Locked OpenML eval (CC-18 + AMLB + TabPFN-extras), fair-set n_features ≤ 128
self-reported

0.751
mean R² with Duo + TTT recipe (Mini + Base + test-time training) on Locked OpenML eval (CTR-23 + AMLB), fair-set n_features ≤ 128
self-reported

0.609