arcLM-0.8B
A 0.8B parameter diffusion language model built by converting Qwen3.5-0.8B from autoregressive to discrete diffusion using the dLLM framework with BD3LM (Block Discrete Denoising Diffusion Language Model).
Instead of generating tokens one at a time left-to-right, arcLM generates blocks of 64 tokens simultaneously through iterative denoising — trading some quality for parallel generation.
Model Details
Model Description
arcLM-0.8B is an experimental diffusion language model created by applying the A2D (Autoregressive-to-Diffusion) conversion pipeline to Qwen3.5-0.8B. The conversion replaces causal attention with bidirectional attention in the standard attention layers (6/24 layers), while keeping the Gated Delta Network (GDN) layers causal due to mathematical constraints of the delta-rule recurrence.
The model was jointly fine-tuned on reasoning, tool-calling, and general instruction-following data using the BD3LM training objective.
- Developed by: cosmicallyrun
- Model type: Discrete diffusion language model (BD3LM)
- Language(s): English
- License: Apache 2.0
- Base model: Qwen/Qwen3.5-0.8B
- Conversion method: A2D (AR-to-Diffusion) via dLLM
Model Sources
- Framework: https://github.com/ZHZisZZ/dllm
- BD3LM Paper: Block Discrete Denoising Diffusion Language Models
- A2D Paper: Autoregressive-to-Diffusion
Uses
Direct Use
Experimental text generation via block diffusion sampling. Supports multi-turn chat with <think> reasoning blocks.
import dllm, transformers
model = dllm.utils.get_model(
model_args=type('A', (), {'model_name_or_path': 'sd17js2/arcLM-0.8B'})()
).eval().cuda()
tokenizer = dllm.utils.get_tokenizer(
model_args=type('A', (), {'model_name_or_path': 'sd17js2/arcLM-0.8B', 'tokenizer_name_or_path': None})()
)
sampler = dllm.core.samplers.BD3LMSampler(model=model, tokenizer=tokenizer)
config = dllm.core.samplers.BD3LMSamplerConfig(
steps=256, max_new_tokens=256, block_size=64, temperature=0.6
)
Or interactively:
python -u examples/a2d/bd3lm/chat.py \
--model_name_or_path sd17js2/arcLM-0.8B \
--block_size 64 --max_new_tokens 256 --steps 256 --temperature 0.6
Out-of-Scope Use
This is an early-stage research model. It is not suitable for production use. Outputs are frequently incoherent or repetitive. Do not use for factual queries, safety-critical applications, or any deployment where reliability matters.
Bias, Risks, and Limitations
- Output quality is significantly below autoregressive models of equivalent size
- The model often produces garbled or repetitive text, especially for complex prompts
- Training data includes publicly available datasets which may contain biases
- The diffusion decoding process can amplify model uncertainty into degenerate outputs (e.g. floods of special tokens)
Training Details
Training Data
Jointly trained on a mix of reasoning, tool-calling, and general instruction data:
| Dataset | Split | Type |
|---|---|---|
| tatsu-lab/alpaca | train[:8000] | General instruction |
| HuggingFaceH4/ultrachat_200k | train[:5000] | Multi-turn chat |
| NousResearch/hermes-function-calling-v1 | train[:5000] | Tool calling |
| Jofthomas/hermes-function-calling-thinking-V1 | full | Tool calling + reasoning |
| open-thoughts/OpenThoughts-114k | train[:5000] | Long reasoning |
| simplescaling/s1K | full | Long reasoning |
Training Procedure
- A2D Conversion: Qwen3.5-0.8B weights copied into bidirectional architecture. Standard attention layers (6/24) made bidirectional; GDN layers (18/24) kept causal.
- BD3LM SFT: Joint fine-tuning with block diffusion objective (block_size=64), linear noise schedule with 1/t loss weighting.
Training Hyperparameters
- Training regime: bf16 mixed precision
- Optimizer: AdamW
- Learning rate: 5e-5 with 200-step linear warmup
- Batch size: 1 per device, 16 gradient accumulation steps (effective batch 16)
- Max sequence length: 2048
- Block size: 64
- Epochs: 10
- Max grad norm: 0.5
- Gradient checkpointing: enabled
Speeds, Sizes, Times
- Training time: ~9 hours on 1x RTX 4090
- Final training loss: ~20 (BD3LM weighted; corresponds to NLL ~3.0)
- Checkpoint size: ~1.6 GB
Technical Specifications
Model Architecture and Objective
Architecture: Qwen3.5-0.8B hybrid (GDN + standard attention), 24 layers in 6 cycles of (3 GDN + 1 standard attention).
- GDN layers: causal (delta-rule recurrence requires lower-triangular structure)
- Standard attention layers: bidirectional (full attention mask for diffusion denoising)
Training objective: BD3LM — block discrete denoising diffusion. Generates 64-token blocks with iterative denoising. Loss = cross-entropy weighted by 1/t from a linear alpha schedule.
Known architectural limitation: The GDN delta-rule computes (I - L)^{-1} via forward substitution on a lower-triangular matrix. Making this bidirectional (full matrix) causes numerical instability: decay mask overflow, O((1+||A||)^63) gradient explosion, and inter-chunk recurrence amplification. Solving this for bidirectional GDN is an open research problem (potential approaches: Neumann series truncation, torch.linalg.solve, or bidirectional linear attention).
Compute Infrastructure
Hardware
1x NVIDIA RTX 4090 (24GB) on RunPod
Software
- dLLM v0.1.0
- PyTorch 2.x
- Transformers 5.x
- Accelerate (DDP config, single process)
Environmental Impact
- Hardware Type: 1x RTX 4090
- Hours used: ~25 hours (including v1 training, debugging, v2 training)
- Cloud Provider: RunPod
- Compute Region: US
- Carbon Emitted: ~3.8 kg CO2eq (estimated)
- Downloads last month
- 335