| --- |
| language: |
| - en |
| license: mit |
| library_name: transformers |
| tags: |
| - causal-lm |
| - quartet-ii |
| - nvfp4 |
| - low-precision-training |
| - pretrained |
| datasets: |
| - nvidia/ClimbMix |
| pipeline_tag: text-generation |
| --- |
| |
| # CloverLM |
|
|
| CloverLM is a **4-billion-parameter** dense decoder-only language model pretrained entirely in **native NVFP4** precision using the [Quartet II](https://github.com/IST-DASLab/Quartet-II) algorithm. |
| Trained on the [ClimbMix](https://arxiv.org/abs/2504.13161) data mixture for approximately **310 billion tokens** on 8 NVIDIA B300 GPUs in roughly 8 days, CloverLM reaches zero-shot accuracy competitive with OPT-175B on a standard evaluation suite β at a fraction of the cost. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Parameters** | ~4.06 B (29 blocks, 28 attention heads, d_head=128) | |
| | **Hidden dimension** | 3,584 | |
| | **GQA ratio** | 4 (7 KV heads) | |
| | **Context length** | 1,024 tokens | |
| | **Vocabulary** | 32,000 ([TokenMonster](https://github.com/alasdairforsythe/tokenmonster), `englishcode-32000-strict-nocapcode-v1`) | |
| | **Normalization** | RMSNorm (post-attention, post-MLP) | |
| | **Activation** | Squared ReLU | |
| | **Position encoding** | Rotary (RoPE) | |
| | **Weight tying** | Yes (embedding = output projection) | |
| | **Precision** | Quartet II NVFP4 linear layers; embeddings, norms in BF16 | |
| | **Attention** | Configurable: PyTorch SDPA, Flash Attention 2/3/4 | |
| |
| ## Training |
| |
| | Property | Value | |
| |---|---| |
| | **Data** | [ClimbMix](https://arxiv.org/abs/2504.13161) (from Nemotron-CC + SmolLM-Corpus), ~305 B tokens | |
| | **Tokenizer** | [TokenMonster](https://huggingface.co/gvlassis/tokenmonster/resolve/main/englishcode-32000-strict-nocapcode-v1-eot%3D14199.vocab) (ungreedy subword, not BPE) | |
| | **Sampled tokens** | ~309.3 B (590k steps) | |
| | **Optimizer** | Adam, peak LR 3Γ10β»Β³ | |
| | **Hardware** | 1 Γ 8-GPU NVIDIA B300 SXM6 node | |
| | **Wall-clock time** | ~8 days | |
| | **Throughput** | ~50β54k tokens/s/GPU | |
| | **Quantization** | Quartet II native NVFP4 training ([Panferov et al., 2026](https://arxiv.org/abs/2601.22813)) | |
| | **Estimated cost** | $4,600β$10,700 depending on spot vs. on-demand pricing ([Verda](https://verda.com/b300)) | |
| |
| ## Evaluation Results |
| |
| All evaluations are zero-shot using the [EleutherAI lm-eval harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11. |
| The model is loaded via a custom `CloverLMHFLM` wrapper in BF16 with Quartet II kernels. |
| |
| ### Compact Zero-Shot Suite |
| |
| | Task | Metric | CloverLM (590k) | OPT-175B | GPT-3 175B | |
| |---|---|---:|---:|---:| |
| | ARC-Challenge | acc | **46.3** | 41.2 | β | |
| | ARC-Challenge | acc_mutual_info | 50.9 | β | **51.4** | |
| | ARC-Easy | acc | **80.0** | 75.1 | β | |
| | ARC-Easy | acc_mutual_info | **72.4** | β | 68.8 | |
| | HellaSwag | acc_norm | 71.7 | **78.3** | **78.9** | |
| | PIQA | acc_norm | 80.6 | **81.2** | 81.0 | |
| | **Avg (OPT-style)** | | **69.6** | 69.0 | β | |
| | **Avg (GPT-3-style)** | | 68.9 | β | **70.0** | |
| |
| **OPT-style average** = mean(ARC-C `acc`, ARC-E `acc`, HellaSwag `acc_norm`, PIQA `acc_norm`). |
| **GPT-3-style average** = mean(ARC-C `acc_mutual_info`, ARC-E `acc_mutual_info`, HellaSwag `acc_norm`, PIQA `acc_norm`). |
| |
| OPT-175B baselines from the [BigScience evaluation repository](https://github.com/bigscience-workshop/bigscience/blob/master/evaluation/results/tr11/opt/bslmeval.json). |
| |
| ### Extended Benchmarks (590k checkpoint) |
| |
| | Task | Metric | CloverLM | GPT-3 175B | |
| |---|---|---:|---:| |
| | Wikitext | bits per byte β | 0.723 | β | |
| | LAMBADA (OpenAI) | acc β | 61.1 | **76.2** | |
| | NQ | exact match β | 7.8 | **14.6** | |
| |
| ### MMLU (590k checkpoint) |
| |
| | Category | 0-shot | Few-shot | |
| |---|---:|---:| |
| | Humanities | 35.4 | 35.7 | |
| | Social Sciences | 42.1 | 47.1 | |
| | STEM | 37.2 | 39.0 | |
| | Other | 45.2 | 49.1 | |
| | **Overall** | 39.4 | **41.9** | |
| | *OPT-175B* | β | *31.8* | |
| | *GPT-3 175B* | β | *43.9* | |
| |
| Few-shot MMLU accuracy (41.9%) substantially exceeds OPT-175B (31.8%) and approaches GPT-3 175B (43.9%). |
| |
| ### Full lm-eval Output (Quartet II kernels) |
| |
| ``` |
| | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
| |----------------|------:|------|-----:|---------------|---|-----:|---|-----:| |
| |arc_challenge_mi| 1|none | 0|acc |β |0.4625|Β± |0.0146| |
| | | |none | 0|acc_mutual_info|β |0.5094|Β± |0.0146| |
| | | |none | 0|acc_norm |β |0.4923|Β± |0.0146| |
| |arc_easy_mi | 1|none | 0|acc |β |0.7997|Β± |0.0082| |
| | | |none | 0|acc_mutual_info|β |0.7239|Β± |0.0092| |
| | | |none | 0|acc_norm |β |0.7731|Β± |0.0086| |
| |hellaswag | 1|none | 0|acc |β |0.5392|Β± |0.0050| |
| | | |none | 0|acc_norm |β |0.7167|Β± |0.0045| |
| |piqa | 1|none | 0|acc |β |0.7922|Β± |0.0095| |
| | | |none | 0|acc_norm |β |0.8058|Β± |0.0092| |
| ``` |
| |
| ## Usage |
| |
| ### Quick Start |
| |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "daslab-testing/CloverLM", |
| trust_remote_code=True, |
| dtype="bfloat16", |
| quartet_2_impl="quartet2", # native NVFP4 kernel or "pseudoquant" on non-Blackwell GPUs |
| ).to("cuda") # for GPU usage or "cpu" for CPU usage |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| "daslab-testing/CloverLM", |
| trust_remote_code=True, |
| ) |
| |
| input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids |
| output = model.generate(input_ids.to(model.device), max_new_tokens=32) |
| print(tokenizer.decode(output[0])) |
| ``` |
| |
| ### Running Evaluations |
| |
| See the [`lm_eval/`](lm_eval/) directory for the full evaluation setup. |
| |
| ```bash |
| cd lm_eval |
| uv sync |
| source .venv/bin/activate |
|
|
| accelerate launch eval.py \ |
| --model cloverlm \ |
| --model_args "pretrained=daslab-testing/CloverLM,dtype=bfloat16,quartet_2_impl=quartet2,attn_backend=pytorch" \ |
| --tasks "arc_easy_mi,arc_challenge_mi,hellaswag,piqa" \ |
| --num_fewshot 0 \ |
| --include_path ./ \ |
| --trust_remote_code \ |
| --confirm_run_unsafe_code \ |
| --batch_size auto |
| ``` |
| |
| Use `quartet_2_impl=pseudoquant` on non-Blackwell GPUs (uses Triton-based FP4 emulation). |
| Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`. |
|
|
| ### Dependencies |
|
|
| - Python β₯ 3.11 |
| - PyTorch 2.10+ with CUDA 13.0 |
| - `transformers β₯ 5.3.0` |
| - `tokenmonster β₯ 1.1.12` |
| - [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II) (for native FP4; `pseudoquant` mode works without them) |
|
|
| ## Architecture Details |
|
|
| CloverLM is a decoder-only Transformer loosely following the OLMo2 design. |
| Each block applies multi-head self-attention (with grouped-query attention at ratio 4) followed by a squared-ReLU MLP, both with post-sublayer RMSNorm and residual connections. |
| Query and key projections use RoPE and are sphere-normalized before scaling. |
| All dense linear layers (Q, K, V, O projections and MLP layers) use Quartet II NVFP4 quantization during both training and inference. |
| Embeddings, layer norms, and the output head remain in BF16. |
|
|
| The model uses 264 weight tensors totaling ~4.14 B parameters. |
|
|
| ## Limitations |
|
|
| - **Short context**: Trained with a 1,024-token context window. Performance on long-context or open-ended generation tasks may be limited. |
| - **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric. |
| - **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat. |
| - **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution. |
| - **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{cloverlm2026, |
| title = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply |
| by Leveraging Native NVFP4}, |
| author = {Erik Schultheis and Matin Ansaripour and Andrei Panferov and |
| Georgios Vlassis and Dan Alistarh}, |
| year = {2026}, |
| } |
| ``` |
|
|