arcLM-0.8B

A 0.8B parameter diffusion language model built by converting Qwen3.5-0.8B from autoregressive to discrete diffusion using the dLLM framework with BD3LM (Block Discrete Denoising Diffusion Language Model).

Instead of generating tokens one at a time left-to-right, arcLM generates blocks of 64 tokens simultaneously through iterative denoising — trading some quality for parallel generation.

Model Details

Model Description

arcLM-0.8B is an experimental diffusion language model created by applying the A2D (Autoregressive-to-Diffusion) conversion pipeline to Qwen3.5-0.8B. The conversion replaces causal attention with bidirectional attention in the standard attention layers (6/24 layers), while keeping the Gated Delta Network (GDN) layers causal due to mathematical constraints of the delta-rule recurrence.

The model was jointly fine-tuned on reasoning, tool-calling, and general instruction-following data using the BD3LM training objective.

  • Developed by: cosmicallyrun
  • Model type: Discrete diffusion language model (BD3LM)
  • Language(s): English
  • License: Apache 2.0
  • Base model: Qwen/Qwen3.5-0.8B
  • Conversion method: A2D (AR-to-Diffusion) via dLLM

Model Sources

Uses

Direct Use

Experimental text generation via block diffusion sampling. Supports multi-turn chat with <think> reasoning blocks.

import dllm, transformers

model = dllm.utils.get_model(
    model_args=type('A', (), {'model_name_or_path': 'sd17js2/arcLM-0.8B'})()
).eval().cuda()
tokenizer = dllm.utils.get_tokenizer(
    model_args=type('A', (), {'model_name_or_path': 'sd17js2/arcLM-0.8B', 'tokenizer_name_or_path': None})()
)

sampler = dllm.core.samplers.BD3LMSampler(model=model, tokenizer=tokenizer)
config = dllm.core.samplers.BD3LMSamplerConfig(
    steps=256, max_new_tokens=256, block_size=64, temperature=0.6
)

Or interactively:

python -u examples/a2d/bd3lm/chat.py \
  --model_name_or_path sd17js2/arcLM-0.8B \
  --block_size 64 --max_new_tokens 256 --steps 256 --temperature 0.6

Out-of-Scope Use

This is an early-stage research model. It is not suitable for production use. Outputs are frequently incoherent or repetitive. Do not use for factual queries, safety-critical applications, or any deployment where reliability matters.

Bias, Risks, and Limitations

  • Output quality is significantly below autoregressive models of equivalent size
  • The model often produces garbled or repetitive text, especially for complex prompts
  • Training data includes publicly available datasets which may contain biases
  • The diffusion decoding process can amplify model uncertainty into degenerate outputs (e.g. floods of special tokens)

Training Details

Training Data

Jointly trained on a mix of reasoning, tool-calling, and general instruction data:

Dataset Split Type
tatsu-lab/alpaca train[:8000] General instruction
HuggingFaceH4/ultrachat_200k train[:5000] Multi-turn chat
NousResearch/hermes-function-calling-v1 train[:5000] Tool calling
Jofthomas/hermes-function-calling-thinking-V1 full Tool calling + reasoning
open-thoughts/OpenThoughts-114k train[:5000] Long reasoning
simplescaling/s1K full Long reasoning

Training Procedure

  1. A2D Conversion: Qwen3.5-0.8B weights copied into bidirectional architecture. Standard attention layers (6/24) made bidirectional; GDN layers (18/24) kept causal.
  2. BD3LM SFT: Joint fine-tuning with block diffusion objective (block_size=64), linear noise schedule with 1/t loss weighting.

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Optimizer: AdamW
  • Learning rate: 5e-5 with 200-step linear warmup
  • Batch size: 1 per device, 16 gradient accumulation steps (effective batch 16)
  • Max sequence length: 2048
  • Block size: 64
  • Epochs: 10
  • Max grad norm: 0.5
  • Gradient checkpointing: enabled

Speeds, Sizes, Times

  • Training time: ~9 hours on 1x RTX 4090
  • Final training loss: ~20 (BD3LM weighted; corresponds to NLL ~3.0)
  • Checkpoint size: ~1.6 GB

Technical Specifications

Model Architecture and Objective

Architecture: Qwen3.5-0.8B hybrid (GDN + standard attention), 24 layers in 6 cycles of (3 GDN + 1 standard attention).

  • GDN layers: causal (delta-rule recurrence requires lower-triangular structure)
  • Standard attention layers: bidirectional (full attention mask for diffusion denoising)

Training objective: BD3LM — block discrete denoising diffusion. Generates 64-token blocks with iterative denoising. Loss = cross-entropy weighted by 1/t from a linear alpha schedule.

Known architectural limitation: The GDN delta-rule computes (I - L)^{-1} via forward substitution on a lower-triangular matrix. Making this bidirectional (full matrix) causes numerical instability: decay mask overflow, O((1+||A||)^63) gradient explosion, and inter-chunk recurrence amplification. Solving this for bidirectional GDN is an open research problem (potential approaches: Neumann series truncation, torch.linalg.solve, or bidirectional linear attention).

Compute Infrastructure

Hardware

1x NVIDIA RTX 4090 (24GB) on RunPod

Software

  • dLLM v0.1.0
  • PyTorch 2.x
  • Transformers 5.x
  • Accelerate (DDP config, single process)

Environmental Impact

  • Hardware Type: 1x RTX 4090
  • Hours used: ~25 hours (including v1 training, debugging, v2 training)
  • Cloud Provider: RunPod
  • Compute Region: US
  • Carbon Emitted: ~3.8 kg CO2eq (estimated)

Downloads last month
335
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sd17js2/arcLM-0.8B

Finetuned
(158)
this model

Papers for sd17js2/arcLM-0.8B