SmolLM2-135M-FlashNorm

FlashNorm-prepared compatibility checkpoint of HuggingFaceTB/SmolLM2-135M.

The FlashNorm transformation is mathematically exact (see paper, Propositions 1 & 2). This checkpoint loads in stock transformers and vLLM with no code changes.

What is FlashNorm?

An exact reformulation of RMSNorm → Linear that (i) folds the per-channel normalization weights into the following linear layer (W* = W · diag(g)) and (ii) defers the scalar 1/RMS(x) normalization to after the matmul. On hardware with distinct vector and matrix units, the matrix multiplication and the RMS reduction can execute in parallel.

See the paper and the transformer-tricks repo for details.

What's different from the source checkpoint?

Tensor Source This checkpoint
model.layers.*.input_layernorm.weight learned per-channel g all ones
model.layers.*.self_attn.{q,k,v}_proj.weight W (bf16) W · diag(g_input_layernorm) (fp32)
model.layers.*.post_attention_layernorm.weight learned per-channel g all ones
model.layers.*.mlp.{gate,up}_proj.weight W (bf16) W · diag(g_post_attention_layernorm) (fp32)
config.flashnorm — true
config.flashnorm_mode — "compat"
config.flashnorm_version — 1

model.norm.weight is unchanged (tied embeddings). Merged projection weights are stored in fp32 to preserve precision for downstream lower-precision inference.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained('open-machine/SmolLM2-135M-FlashNorm')
model = AutoModelForCausalLM.from_pretrained('open-machine/SmolLM2-135M-FlashNorm')

ids = tok('Once upon a time there was', return_tensors='pt').input_ids
out = model.generate(ids, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

With vLLM:

vllm serve open-machine/SmolLM2-135M-FlashNorm

Verification

Generated via flashify_repo('HuggingFaceTB/SmolLM2-135M') from transformer-tricks (branch flashnorm-neurips-plan).

Metric Value
HuggingFace Transformers greedy generation, 50 tokens character-identical to source (fp16 and fp32)
Cosine similarity of logits (fp32) 1.0
Cosine similarity of logits (fp16) 0.99998

Framework behavior

The FlashNorm transformation is mathematically exact. Concrete runtime behavior:

  • HuggingFace Transformers (any precision): greedy generation matches the source model byte-for-byte.
  • vLLM (any precision): one-token argmax divergence possible at tight decision points. vLLM's PagedAttention kernel has different reduction semantics than HF's SDPA and amplifies the tiny numerical differences introduced by precomputing W · diag(g). Downstream greedy decoding then amplifies the flip to full text divergence.

This is a general property of precomputing weight-folded tensors for lossy-inference kernels — not specific to FlashNorm. A native fused RMSNorm + QKV kernel (deferring g to runtime rather than precomputing it into W) eliminates the framework dependency and is in progress for vLLM / FlashInfer; once landed, inference on this checkpoint under those kernels will also be bit-identical to the source.

License

Apache-2.0, inherited from the source model.

Downloads last month
23
Safetensors
Model size
0.1B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for open-machine/SmolLM2-135M-FlashNorm

Finetuned
(861)
this model