| --- |
| license: apache-2.0 |
| tags: |
| - diffusion |
| - llada |
| - gguf |
| - cpu-inference |
| - diffuse-cpp |
| language: |
| - en |
| base_model: GSAI-ML/LLaDA-8B-Instruct |
| pipeline_tag: text-generation |
| --- |
| |
| # LLaDA-8B-Instruct-GGUF |
|
|
| GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), the first C++ inference engine for Diffusion Language Models. |
|
|
| LLaDA is a masked diffusion language model based on the Llama backbone. Unlike autoregressive models that generate one token at a time, LLaDA generates all tokens in parallel through iterative refinement — making it compute-bound rather than memory-bound on CPU. |
|
|
| **On a 12-core CPU, LLaDA with diffuse-cpp reaches 27.7 tok/s on translation tasks — 3.3x faster than llama.cpp (8.51 tok/s) on the same hardware.** |
|
|
| ## Available Quantizations |
|
|
| | File | Type | Size | Description | |
| |------|------|------|-------------| |
| | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality | |
| | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless | |
| | `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed, best speed/quality ratio | |
| |
| **Recommended:** Q4_K_M for most users. |
| |
| ## Quick Start |
| |
| ```bash |
| # Download |
| huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf |
| |
| # Build diffuse-cpp |
| git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git |
| cd diffuse-cpp |
| cmake -B build -DCMAKE_BUILD_TYPE=Release |
| cmake --build build -j$(nproc) |
| |
| # Run |
| ./build/diffuse-cli -m ../llada-8b-q4km.gguf \ |
| --tokens "128000,3923,374,279,6864,315,9822,30" \ |
| -n 256 -s 16 -t 12 --remasking entropy_exit |
| ``` |
| |
| ## Performance |
| |
| Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=256: |
| |
| | Prompt | No-Cache | Cache | Steps | vs llama.cpp | |
| |--------|----------|-------|-------|-------------| |
| | Capital of France? | 17.5 | **24.4 tok/s** | 3 | 2.9x | |
| | Translate to French | 25.9 | **27.7 tok/s** | 2 | **3.3x** | |
| | 15 x 23? | 12.8 | **15.7 tok/s** | 4 | 1.8x | |
| | Translate to Spanish | 7.6 | **22.9 tok/s** | 7 | 2.7x | |
| | Python is_prime() | 3.2 | **4.9 tok/s** | 16 | 0.6x | |
| | Poem about ocean | 3.2 | **5.3 tok/s** | 16 | 0.6x | |
| | Why is sky blue? | 3.3 | **12.0 tok/s** | 16 | 1.4x | |
| | List the planets | 3.3 | **9.4 tok/s** | 15 | 1.1x | |
| | **Average** | **9.6** | **15.3 tok/s** | | **1.8x** | |
| |
| - Inter-step cache: 1.6x average speedup with no quality degradation |
| - 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline) |
| - LLaDA excels at translation tasks (converges in 2-5 steps) |
| |
| ## Model Details |
| |
| - **Architecture:** Llama backbone with bidirectional (non-causal) attention |
| - **Parameters:** 8B |
| - **Layers:** 32 |
| - **Hidden size:** 4096 |
| - **Attention:** MHA (32 query heads, 32 KV heads) |
| - **FFN:** SwiGLU, intermediate 12288 |
| - **Vocabulary:** 126,464 tokens |
| - **RoPE theta:** 500,000 |
| - **Mask token ID:** 126336 |
| |
| ## Also Available |
| |
| - **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA. Excels at math and code (21.6 tok/s, correctly solves arithmetic in 2 steps). |
| |
| ## Citation |
| |
| ```bibtex |
| @software{diffuse_cpp_2026, |
| title={diffuse-cpp: High-Performance Inference for Diffusion Language Models}, |
| author={Carmen Esteban}, |
| year={2026}, |
| url={https://github.com/iafiscal1212/diffuse-cpp} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |
| |