Instructions to use QuixiAI/DeepSeek-R1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuixiAI/DeepSeek-R1-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuixiAI/DeepSeek-R1-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuixiAI/DeepSeek-R1-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ

SGLang

How to use QuixiAI/DeepSeek-R1-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuixiAI/DeepSeek-R1-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuixiAI/DeepSeek-R1-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuixiAI/DeepSeek-R1-AWQ with Docker Model Runner:
```
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
```

Are there any updates to the recommended commands?

#27

by NaiveYan - opened Mar 19, 2025

Discussion

NaiveYan

Mar 19, 2025

•

edited Mar 19, 2025

I tested the command in the current README with vLLM v0.8.0 (on 8 x A800 GPUs), but it only returns garbled text.
Are there any updates to the recommended commands, or are there other inference engines you would suggest?

v2ray

Mar 19, 2025

Merge these three PRs, then build it yourself, then it should work.

VenomEY

Mar 25, 2025

硬件环境：8*H800
软件环境：vllm==0.8.1
我用V1引擎去跑以下命令发现输入4k以上文本时会出现

File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 688, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 245, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     model_output = self.forward(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 626, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     def forward(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return fn(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     raise e
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "<eval_with_key>.124", line 2186, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     raise e
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "<eval_with_key>.2", line 5, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(x_7, x_11, k_pe, output_5, 'model.layers.0.self_attn.attn');  x_7 = x_11 = k_pe = output_5 = unified_attention_with_output = None
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/attention/layer.py", line 363, in unified_attention_with_output
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     self.impl.forward(self,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 929, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     output[num_decode_tokens:] = self._forward_prefill(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 826, in _forward_prefill
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     context_output, context_lse = self._compute_prefill_context( \
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 742, in _compute_prefill_context
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     kv_nope = self.kv_b_proj(kv_c_normed)[0].view( \
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 303, in apply
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return apply_awq_marlin_linear(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 379, in apply_awq_marlin_linear
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     output = ops.gptq_marlin_gemm(reshaped_x,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/vllm/_custom_ops.py", line 741, in gptq_marlin_gemm
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]   File "/usr/local/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375]     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] RuntimeError: A is not contiguous

运行命令：

export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_V1=1
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ATTENTION_BACKEND=FLASHMLA
export TORCH_CUDA_ARCH_LIST=9.0
vllm serve /DeepSeek-R1-awq --host 0.0.0.0 --port 8080 --trust-remote-code --max-model-len 65536 --max-num-batched-tokens 65536 --max-seq-len-to-capture 65536 --gpu-memory-utilization 0.95 --max-num-seqs 64 --served-model-name DeepSeek-R1 --tensor-parallel-size 8 --enable-reasoning --reasoning-parser deepseek_r1 -q awq_marlin

v2ray

Mar 25, 2025

@VenomEY https://github.com/vllm-project/vllm/pull/14658

v2ray changed discussion status to closed Mar 25, 2025

NaiveYan

Mar 26, 2025

@v2ray v0.8.2 merged this PR but did not resolve the issue.

v2ray

Mar 26, 2025

You need to merge all 3 PRs, one of them switches to the Marlin kernel which supports non contiguous input.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment