FitCheck / model /README.md
cn0303's picture
Speed predictions now come from a model trained on 6.6k real measurements (17.5% median error on unseen hardware), physics formula beyond its measured range
0935028 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
license: mit
library_name: skops
tags:
  - tabular-regression
  - xgboost
  - llm-inference
  - performance-prediction

FitCheck speed predictor

Predicts local-LLM decode tokens/sec from hardware + model features. Part of FitCheck, the honest "what AI can your computer run" advisor.

Method

Gradient-boosted regression (XGBoost) following the methodology of LLM-Pilot (IBM, SC'24): arXiv:2410.02425 — performance prediction for LLM inference on unseen hardware, validated leave-one-accelerator-out so the error below is measured on hardware the model never saw in training.

Features: effective memory bandwidth, bytes read per token (weights + KV), weights size, KV size, MoE active fraction, offload fraction, and the analytical roofline prior (bandwidth / bytes). Decode is memory-bandwidth-bound; the model learns the residual between the roofline ideal and reality.

Training data

6,633 real measurements across 595 distinct accelerators (consumer CPUs, Apple Silicon, NVIDIA/AMD GPUs), from the LocalScore community benchmark (Mozilla Builders / cjpais — thank you; data attributed, not owned, takedown requests honoured). Trained 2026-06-10.

Honest holdout results (leave-one-accelerator-out)

metric roofline baseline this model
median APE (bandwidth-known hardware) 28.1% 17.5%
median abs error (tok/s) 11.63 9.55
all hardware incl. CPUs (no baseline possible) 23.6% median APE

Shipping rule: this model is only deployed because it beat the analytical baseline on held-out hardware. If a retrain ever fails that gate, FitCheck falls back to the labelled roofline estimate.

Limits (read this)

  • Trained on dense LLMs running fully on-device (LocalScore's fixed grid: 1B / 8B / 14B at Q4_K_M, varied context). The model axis generalises through the bytes-per-token feature, not data diversity.
  • MoE and GPU->RAM offload are corrected analytically upstream, then fed through — those corrections are engineering estimates, labelled as such.
  • Does NOT cover vision/diffusion models (compute-bound, different physics).