arcLM-0.8B

A 0.8B parameter diffusion language model built by converting Qwen3.5-0.8B from autoregressive to discrete diffusion using the dLLM framework with BD3LM (Block Discrete Denoising Diffusion Language Model).

Instead of generating tokens one at a time left-to-right, arcLM generates blocks of 64 tokens simultaneously through iterative denoising — trading some quality for parallel generation.

Model Details

Model Description

arcLM-0.8B is an experimental diffusion language model created by applying the A2D (Autoregressive-to-Diffusion) conversion pipeline to Qwen3.5-0.8B. The conversion replaces causal attention with bidirectional attention in the standard attention layers (6/24 layers), while keeping the Gated Delta Network (GDN) layers causal due to mathematical constraints of the delta-rule recurrence.

The model was jointly fine-tuned on reasoning, tool-calling, and general instruction-following data using the BD3LM training objective.

Developed by: cosmicallyrun
Model type: Discrete diffusion language model (BD3LM)
Language(s): English
License: Apache 2.0
Base model: Qwen/Qwen3.5-0.8B
Conversion method: A2D (AR-to-Diffusion) via dLLM

Model Sources

Framework: https://github.com/ZHZisZZ/dllm
BD3LM Paper: Block Discrete Denoising Diffusion Language Models
A2D Paper: Autoregressive-to-Diffusion

Uses

Direct Use

Experimental text generation via block diffusion sampling. Supports multi-turn chat with <think> reasoning blocks.

import dllm, transformers

model = dllm.utils.get_model(
    model_args=type('A', (), {'model_name_or_path': 'sd17js2/arcLM-0.8B'})()
).eval().cuda()
tokenizer = dllm.utils.get_tokenizer(
    model_args=type('A', (), {'model_name_or_path': 'sd17js2/arcLM-0.8B', 'tokenizer_name_or_path': None})()
)

sampler = dllm.core.samplers.BD3LMSampler(model=model, tokenizer=tokenizer)
config = dllm.core.samplers.BD3LMSamplerConfig(
    steps=256, max_new_tokens=256, block_size=64, temperature=0.6
)

Or interactively:

python -u examples/a2d/bd3lm/chat.py \
  --model_name_or_path sd17js2/arcLM-0.8B \
  --block_size 64 --max_new_tokens 256 --steps 256 --temperature 0.6

Out-of-Scope Use

This is an early-stage research model. It is not suitable for production use. Outputs are frequently incoherent or repetitive. Do not use for factual queries, safety-critical applications, or any deployment where reliability matters.

Bias, Risks, and Limitations

Output quality is significantly below autoregressive models of equivalent size
The model often produces garbled or repetitive text, especially for complex prompts
Training data includes publicly available datasets which may contain biases
The diffusion decoding process can amplify model uncertainty into degenerate outputs (e.g. floods of special tokens)

Training Details

Training Data

Jointly trained on a mix of reasoning, tool-calling, and general instruction data:

Dataset	Split	Type
tatsu-lab/alpaca	train[:8000]	General instruction
HuggingFaceH4/ultrachat_200k	train[:5000]	Multi-turn chat
NousResearch/hermes-function-calling-v1	train[:5000]	Tool calling
Jofthomas/hermes-function-calling-thinking-V1	full	Tool calling + reasoning
open-thoughts/OpenThoughts-114k	train[:5000]	Long reasoning
simplescaling/s1K	full	Long reasoning

Training Procedure

A2D Conversion: Qwen3.5-0.8B weights copied into bidirectional architecture. Standard attention layers (6/24) made bidirectional; GDN layers (18/24) kept causal.
BD3LM SFT: Joint fine-tuning with block diffusion objective (block_size=64), linear noise schedule with 1/t loss weighting.

Training Hyperparameters

Training regime: bf16 mixed precision
Optimizer: AdamW
Learning rate: 5e-5 with 200-step linear warmup
Batch size: 1 per device, 16 gradient accumulation steps (effective batch 16)
Max sequence length: 2048
Block size: 64
Epochs: 10
Max grad norm: 0.5
Gradient checkpointing: enabled

Speeds, Sizes, Times

Training time: ~9 hours on 1x RTX 4090
Final training loss: ~20 (BD3LM weighted; corresponds to NLL ~3.0)
Checkpoint size: ~1.6 GB

Technical Specifications

Model Architecture and Objective

Architecture: Qwen3.5-0.8B hybrid (GDN + standard attention), 24 layers in 6 cycles of (3 GDN + 1 standard attention).

GDN layers: causal (delta-rule recurrence requires lower-triangular structure)
Standard attention layers: bidirectional (full attention mask for diffusion denoising)

Training objective: BD3LM — block discrete denoising diffusion. Generates 64-token blocks with iterative denoising. Loss = cross-entropy weighted by 1/t from a linear alpha schedule.

Known architectural limitation: The GDN delta-rule computes (I - L)^{-1} via forward substitution on a lower-triangular matrix. Making this bidirectional (full matrix) causes numerical instability: decay mask overflow, O((1+||A||)^63) gradient explosion, and inter-chunk recurrence amplification. Solving this for bidirectional GDN is an open research problem (potential approaches: Neumann series truncation, torch.linalg.solve, or bidirectional linear attention).

Compute Infrastructure

Hardware

1x NVIDIA RTX 4090 (24GB) on RunPod

Software

dLLM v0.1.0
PyTorch 2.x
Transformers 5.x
Accelerate (DDP config, single process)

Environmental Impact

Hardware Type: 1x RTX 4090
Hours used: ~25 hours (including v1 training, debugging, v2 training)
Cloud Provider: RunPod
Compute Region: US
Carbon Emitted: ~3.8 kg CO2eq (estimated)

Downloads last month: 335

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for sd17js2/arcLM-0.8B

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(158)

this model

Papers for sd17js2/arcLM-0.8B

Gradient-Free Training of Quantized Neural Networks

Paper • 2410.09734 • Published Sep 29, 2025

Bayesian inference from time series of allele frequency data using exact simulation techniques

Paper • 2502.12279 • Published Feb 17, 2025