Text Generation
Transformers
Safetensors
English
Chinese
deepseek_v3
conversational
custom_code
text-generation-inference
4-bit precision
awq
Instructions to use QuixiAI/DeepSeek-R1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuixiAI/DeepSeek-R1-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use QuixiAI/DeepSeek-R1-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuixiAI/DeepSeek-R1-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
- SGLang
How to use QuixiAI/DeepSeek-R1-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuixiAI/DeepSeek-R1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuixiAI/DeepSeek-R1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuixiAI/DeepSeek-R1-AWQ with Docker Model Runner:
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
Are there any updates to the recommended commands?
#27
by NaiveYan - opened
I tested the command in the current README with vLLM v0.8.0 (on 8 x A800 GPUs), but it only returns garbled text.
Are there any updates to the recommended commands, or are there other inference engines you would suggest?
硬件环境:8*H800
软件环境:vllm==0.8.1
我用V1引擎去跑以下命令发现输入4k以上文本时会出现
File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 688, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 245, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] model_output = self.forward(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 626, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] def forward(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return fn(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] raise e
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "<eval_with_key>.124", line 2186, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] raise e
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "<eval_with_key>.2", line 5, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(x_7, x_11, k_pe, output_5, 'model.layers.0.self_attn.attn'); x_7 = x_11 = k_pe = output_5 = unified_attention_with_output = None
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/attention/layer.py", line 363, in unified_attention_with_output
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] self.impl.forward(self,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 929, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] output[num_decode_tokens:] = self._forward_prefill(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 826, in _forward_prefill
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] context_output, context_lse = self._compute_prefill_context( \
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/v1/attention/backends/mla/common.py", line 742, in _compute_prefill_context
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] kv_nope = self.kv_b_proj(kv_c_normed)[0].view( \
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 303, in apply
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return apply_awq_marlin_linear(
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 379, in apply_awq_marlin_linear
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] output = ops.gptq_marlin_gemm(reshaped_x,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/vllm/_custom_ops.py", line 741, in gptq_marlin_gemm
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, b_zeros,
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] File "/usr/local/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] return self._op(*args, **(kwargs or {}))
(VllmWorker rank=0 pid=94399) ERROR 03-25 02:28:56 [multiproc_executor.py:375] RuntimeError: A is not contiguous
运行命令:
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_V1=1
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ATTENTION_BACKEND=FLASHMLA
export TORCH_CUDA_ARCH_LIST=9.0
vllm serve /DeepSeek-R1-awq --host 0.0.0.0 --port 8080 --trust-remote-code --max-model-len 65536 --max-num-batched-tokens 65536 --max-seq-len-to-capture 65536 --gpu-memory-utilization 0.95 --max-num-seqs 64 --served-model-name DeepSeek-R1 --tensor-parallel-size 8 --enable-reasoning --reasoning-parser deepseek_r1 -q awq_marlin
v2ray changed discussion status to closed
You need to merge all 3 PRs, one of them switches to the Marlin kernel which supports non contiguous input.