Instructions to use OpenGVLab/SDLM-3B-D4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenGVLab/SDLM-3B-D4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenGVLab/SDLM-3B-D4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("OpenGVLab/SDLM-3B-D4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("OpenGVLab/SDLM-3B-D4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use OpenGVLab/SDLM-3B-D4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenGVLab/SDLM-3B-D4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenGVLab/SDLM-3B-D4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OpenGVLab/SDLM-3B-D4

SGLang

How to use OpenGVLab/SDLM-3B-D4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenGVLab/SDLM-3B-D4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenGVLab/SDLM-3B-D4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenGVLab/SDLM-3B-D4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenGVLab/SDLM-3B-D4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use OpenGVLab/SDLM-3B-D4 with Docker Model Runner:
```
docker model run hf.co/OpenGVLab/SDLM-3B-D4
```

lll2343 commited on Sep 29, 2025

Commit

79fbdab

verified ·

1 Parent(s): a4ef5eb

Update README.md

Browse files

Files changed (1) hide show

README.md +154 -3

README.md CHANGED Viewed

@@ -1,3 +1,154 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+license_name: qwen
+license_link: https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE
+pipeline_tag: text-generation
+library_name: transformers
+base_model:
+- Qwen/Qwen2.5-3B
+base_model_relation: finetune
+language:
+- en
+tags:
+- sdlm
+- diffusion language model
+- custom_code
+datasets:
+- dyyyyyyyy/ScaleQuest-Math
+- OpenCoder-LLM/opc-sft-stage2
+- allenai/tulu-3-sft-mixture
+- HuggingFaceTB/smoltalk2
+- LipengCS/Table-GPT
+- allenai/SciRIFF
+---
+# SDLM-3B-D4
+[\[📂 GitHub\]](https://github.com/OpenGVLab/SDLM)  [\[📜 Tech Report\]](https://huggingface.co/papers/xxx)  [\[🤗 HuggingFace\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552)
+## Introduction
+We propose a <b>S</b>equential <b>D</b>iffusion <b>L</b>anguage <b>M</b>odel (<b>SDLM</b>), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.
+![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/three_framework.png)
+## SDLM Family
+In the following table, we provide an overview of the SDLM series.
+| Model Name  | Base Model 🤗                                                 | HF Link 🤗                                    |
+| ----------- | ------------------------------------------------------------ | -------------------------------------------- |
+| SDLM-3B-D4  | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D4  |
+| SDLM-3B-D8  | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D8  |
+| SDLM-32B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen2.5-32B</a> | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |
+## Model Architecture
+We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.
+* **(a) Training pipeline.** Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
+* **(b) Sampling Pipeline.** Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.
+![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/framework.png)
+## Performance
+### Long-Form Benchmarks
+SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.
+![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/main_exp1.png)
+### General Mutiple-Choice Benchmarks
+![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/main_exp2.png)
+### Block Size & Self-Speculative Decoding
+![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/self_speculative_decoding.png)
+## Trade-off Between Performance and Speed
+Trade-off between performance and speed under different confidence thresholds τ for SDLM-3B (B=4) and SDLM-3B (B=8). By adjusting τ, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass.
+![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/ablation_tau.png)
+## Inference
+1. Install Dependencies
+   Key package versions:
+   ```
+   transformers==4.37.2
+   torch>=2.5.0
+   ```
+2. Download the model generation script [sdlm_inference.py](https://github.com/OpenGVLab/SDLM/blob/main/sdlm_inference.py) to your working directory.
+3. We provide an example code to run `SDLM-3B-D4` using `transformers`.
+   ```python
+   import torch
+   from transformers import AutoModelForCausalLM, AutoTokenizer
+   from sdlm_inference import SDLM_generate
+   if __name__ == "__main__":
+       ckpt_hf = 'OpenGVLab/SDLM-3B-D4'
+       model = AutoModelForCausalLM.from_pretrained(
+           ckpt_hf,
+           attn_implementation="eager",
+           trust_remote_code=True
+       ).to(dtype=torch.float16)
+       tokenizer = AutoTokenizer.from_pretrained(ckpt_hf)
+       prompt = 'Write a Fibonacci function in Python.'
+       messages = [
+           {"role": "system", "content": "You are a helpful assistant."},
+           {"role": "user", "content": prompt}
+       ]
+       text = tokenizer.apply_chat_template(
+           messages,
+           tokenize=False,
+           add_generation_prompt=True
+       )
+       model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+       response, history = SDLM_generate(
+           model,
+           tokenizer,
+           model_inputs,
+           max_gen_len = 1024,
+           temperature = 0,
+           threshold = 0.5,
+           n_future_tokens = 4,
+           alg = 'prob_conf', #  prob_conf | entropy_conf | self_speculative
+           save_history = True,
+           use_cache = True
+       )
+       print('response: ', response[0])
+       print('=======histroy')
+       for item in history:
+           print('cur total token ', item[1])
+           print(item[0][0])
+           print('--------')
+   ```
+## Citation
+If you find this project useful in your research, please consider citing:
+```BibTeX
+@article{SDLM,
+  title={Sequential Diffusion Language Models},
+  author={},
+  journal={arXiv preprint arXiv:2025.xxxxx},
+  year={2025}
+}
+```