--- language: - en license: mit library_name: transformers tags: - causal-lm - quartet-ii - nvfp4 - low-precision-training - pretrained datasets: - nvidia/ClimbMix pipeline_tag: text-generation --- # CloverLM CloverLM is a **4-billion-parameter** dense decoder-only language model pretrained entirely in **native NVFP4** precision using the [Quartet II](https://github.com/IST-DASLab/Quartet-II) algorithm. Trained on the [ClimbMix](https://arxiv.org/abs/2504.13161) data mixture for approximately **310 billion tokens** on 8 NVIDIA B300 GPUs in roughly 8 days, CloverLM reaches zero-shot accuracy competitive with OPT-175B on a standard evaluation suite — at a fraction of the cost. ## Model Details | Property | Value | |---|---| | **Parameters** | ~4.06 B (29 blocks, 28 attention heads, d_head=128) | | **Hidden dimension** | 3,584 | | **GQA ratio** | 4 (7 KV heads) | | **Context length** | 1,024 tokens | | **Vocabulary** | 32,000 ([TokenMonster](https://github.com/alasdairforsythe/tokenmonster), `englishcode-32000-strict-nocapcode-v1`) | | **Normalization** | RMSNorm (post-attention, post-MLP) | | **Activation** | Squared ReLU | | **Position encoding** | Rotary (RoPE) | | **Weight tying** | Yes (embedding = output projection) | | **Precision** | Quartet II NVFP4 linear layers; embeddings, norms in BF16 | | **Attention** | Configurable: PyTorch SDPA, Flash Attention 2/3/4 | ## Training | Property | Value | |---|---| | **Data** | [ClimbMix](https://arxiv.org/abs/2504.13161) (from Nemotron-CC + SmolLM-Corpus), ~305 B tokens | | **Tokenizer** | [TokenMonster](https://huggingface.co/gvlassis/tokenmonster/resolve/main/englishcode-32000-strict-nocapcode-v1-eot%3D14199.vocab) (ungreedy subword, not BPE) | | **Sampled tokens** | ~309.3 B (590k steps) | | **Optimizer** | Adam, peak LR 3×10⁻³ | | **Hardware** | 1 × 8-GPU NVIDIA B300 SXM6 node | | **Wall-clock time** | ~8 days | | **Throughput** | ~50–54k tokens/s/GPU | | **Quantization** | Quartet II native NVFP4 training ([Panferov et al., 2026](https://arxiv.org/abs/2601.22813)) | | **Estimated cost** | $4,600–$10,700 depending on spot vs. on-demand pricing ([Verda](https://verda.com/b300)) | ## Evaluation Results All evaluations are zero-shot using the [EleutherAI lm-eval harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11. The model is loaded via a custom `CloverLMHFLM` wrapper in BF16 with Quartet II kernels. ### Compact Zero-Shot Suite | Task | Metric | CloverLM (590k) | OPT-175B | GPT-3 175B | |---|---|---:|---:|---:| | ARC-Challenge | acc | **46.3** | 41.2 | — | | ARC-Challenge | acc_mutual_info | 50.9 | — | **51.4** | | ARC-Easy | acc | **80.0** | 75.1 | — | | ARC-Easy | acc_mutual_info | **72.4** | — | 68.8 | | HellaSwag | acc_norm | 71.7 | **78.3** | **78.9** | | PIQA | acc_norm | 80.6 | **81.2** | 81.0 | | **Avg (OPT-style)** | | **69.6** | 69.0 | — | | **Avg (GPT-3-style)** | | 68.9 | — | **70.0** | **OPT-style average** = mean(ARC-C `acc`, ARC-E `acc`, HellaSwag `acc_norm`, PIQA `acc_norm`). **GPT-3-style average** = mean(ARC-C `acc_mutual_info`, ARC-E `acc_mutual_info`, HellaSwag `acc_norm`, PIQA `acc_norm`). OPT-175B baselines from the [BigScience evaluation repository](https://github.com/bigscience-workshop/bigscience/blob/master/evaluation/results/tr11/opt/bslmeval.json). ### Extended Benchmarks (590k checkpoint) | Task | Metric | CloverLM | GPT-3 175B | |---|---|---:|---:| | Wikitext | bits per byte ↓ | 0.723 | — | | LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** | | NQ | exact match ↑ | 7.8 | **14.6** | ### MMLU (590k checkpoint) | Category | 0-shot | Few-shot | |---|---:|---:| | Humanities | 35.4 | 35.7 | | Social Sciences | 42.1 | 47.1 | | STEM | 37.2 | 39.0 | | Other | 45.2 | 49.1 | | **Overall** | 39.4 | **41.9** | | *OPT-175B* | — | *31.8* | | *GPT-3 175B* | — | *43.9* | Few-shot MMLU accuracy (41.9%) substantially exceeds OPT-175B (31.8%) and approaches GPT-3 175B (43.9%). ### Full lm-eval Output (Quartet II kernels) ``` | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |----------------|------:|------|-----:|---------------|---|-----:|---|-----:| |arc_challenge_mi| 1|none | 0|acc |↑ |0.4625|± |0.0146| | | |none | 0|acc_mutual_info|↑ |0.5094|± |0.0146| | | |none | 0|acc_norm |↑ |0.4923|± |0.0146| |arc_easy_mi | 1|none | 0|acc |↑ |0.7997|± |0.0082| | | |none | 0|acc_mutual_info|↑ |0.7239|± |0.0092| | | |none | 0|acc_norm |↑ |0.7731|± |0.0086| |hellaswag | 1|none | 0|acc |↑ |0.5392|± |0.0050| | | |none | 0|acc_norm |↑ |0.7167|± |0.0045| |piqa | 1|none | 0|acc |↑ |0.7922|± |0.0095| | | |none | 0|acc_norm |↑ |0.8058|± |0.0092| ``` ## Usage ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "daslab-testing/CloverLM", trust_remote_code=True, torch_dtype="bfloat16", ) tokenizer = AutoTokenizer.from_pretrained( "daslab-testing/CloverLM", trust_remote_code=True, ) input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids output = model.generate(input_ids.to(model.device), max_new_tokens=32) print(tokenizer.decode(output[0])) ``` ### Running Evaluations See the [`lm_eval/`](lm_eval/) directory for the full evaluation setup. ```bash cd lm_eval uv sync source .venv/bin/activate accelerate launch eval.py \ --model cloverlm \ --model_args "pretrained=daslab-testing/CloverLM,dtype=bfloat16,quartet_2_impl=quartet2,attn_backend=pytorch" \ --tasks "arc_easy_mi,arc_challenge_mi,hellaswag,piqa" \ --num_fewshot 0 \ --include_path ./ \ --trust_remote_code \ --confirm_run_unsafe_code \ --batch_size auto ``` Use `quartet_2_impl=pseudoquant` on non-Blackwell GPUs (uses Triton-based FP4 emulation). Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`. ### Dependencies - Python ≥ 3.11 - PyTorch 2.10+ with CUDA 13.0 - `transformers ≥ 5.3.0` - `tokenmonster ≥ 1.1.12` - [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II) (for native FP4; `pseudoquant` mode works without them) ## Architecture Details CloverLM is a decoder-only Transformer loosely following the OLMo2 design. Each block applies multi-head self-attention (with grouped-query attention at ratio 4) followed by a squared-ReLU MLP, both with post-sublayer RMSNorm and residual connections. Query and key projections use RoPE and are sphere-normalized before scaling. All dense linear layers (Q, K, V, O projections and MLP layers) use Quartet II NVFP4 quantization during both training and inference. Embeddings, layer norms, and the output head remain in BF16. The model uses 264 weight tensors totaling ~4.14 B parameters. ## Limitations - **Short context**: Trained with a 1,024-token context window. Performance on long-context or open-ended generation tasks may be limited. - **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric. - **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat. - **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution. - **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall. ## Citation ```bibtex @article{cloverlm2026, title = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply by Leveraging Native NVFP4}, author = {Erik Schultheis and Matin Ansaripour and Andrei Panferov and Georgios Vlassis and Dan Alistarh}, year = {2026}, } ```