SmolLM2-135M-FlashNorm
FlashNorm-prepared compatibility checkpoint of HuggingFaceTB/SmolLM2-135M.
The FlashNorm transformation is mathematically exact (see paper, Propositions 1 & 2). This checkpoint loads in stock transformers and vLLM with no code changes.
What is FlashNorm?
An exact reformulation of RMSNorm → Linear that (i) folds the per-channel normalization weights into the following linear layer (W* = W · diag(g)) and (ii) defers the scalar 1/RMS(x) normalization to after the matmul. On hardware with distinct vector and matrix units, the matrix multiplication and the RMS reduction can execute in parallel.
See the paper and the transformer-tricks repo for details.
What's different from the source checkpoint?
| Tensor | Source | This checkpoint |
|---|---|---|
model.layers.*.input_layernorm.weight |
learned per-channel g |
all ones |
model.layers.*.self_attn.{q,k,v}_proj.weight |
W (bf16) |
W · diag(g_input_layernorm) (fp32) |
model.layers.*.post_attention_layernorm.weight |
learned per-channel g |
all ones |
model.layers.*.mlp.{gate,up}_proj.weight |
W (bf16) |
W · diag(g_post_attention_layernorm) (fp32) |
config.flashnorm |
— | true |
config.flashnorm_mode |
— | "compat" |
config.flashnorm_version |
— | 1 |
model.norm.weight is unchanged (tied embeddings). Merged projection weights are stored in fp32 to preserve precision for downstream lower-precision inference.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained('open-machine/SmolLM2-135M-FlashNorm')
model = AutoModelForCausalLM.from_pretrained('open-machine/SmolLM2-135M-FlashNorm')
ids = tok('Once upon a time there was', return_tensors='pt').input_ids
out = model.generate(ids, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
With vLLM:
vllm serve open-machine/SmolLM2-135M-FlashNorm
Verification
Generated via flashify_repo('HuggingFaceTB/SmolLM2-135M') from transformer-tricks (branch flashnorm-neurips-plan).
| Metric | Value |
|---|---|
| HuggingFace Transformers greedy generation, 50 tokens | character-identical to source (fp16 and fp32) |
| Cosine similarity of logits (fp32) | 1.0 |
| Cosine similarity of logits (fp16) | 0.99998 |
Framework behavior
The FlashNorm transformation is mathematically exact. Concrete runtime behavior:
- HuggingFace Transformers (any precision): greedy generation matches the source model byte-for-byte.
- vLLM (any precision): one-token argmax divergence possible at tight decision points. vLLM's PagedAttention kernel has different reduction semantics than HF's SDPA and amplifies the tiny numerical differences introduced by precomputing
W · diag(g). Downstream greedy decoding then amplifies the flip to full text divergence.
This is a general property of precomputing weight-folded tensors for lossy-inference kernels — not specific to FlashNorm. A native fused RMSNorm + QKV kernel (deferring g to runtime rather than precomputing it into W) eliminates the framework dependency and is in progress for vLLM / FlashInfer; once landed, inference on this checkpoint under those kernels will also be bit-identical to the source.
License
Apache-2.0, inherited from the source model.
- Downloads last month
- 23
Model tree for open-machine/SmolLM2-135M-FlashNorm
Base model
HuggingFaceTB/SmolLM2-135M