Instructions to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/Qwen3-Coder-Next-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-Next-FP8-Dynamic") model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-Coder-Next-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/Qwen3-Coder-Next-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic
- SGLang
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/Qwen3-Coder-Next-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/Qwen3-Coder-Next-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="unsloth/Qwen3-Coder-Next-FP8-Dynamic", max_seq_length=2048, ) - Docker Model Runner
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic
Inconsistent output (resolved)
UPD: Sorry, guys. It was my setup causing inconsistent outputs. VLLM produces garbage when running on cards from different generations. Qwen's FP8 just had less errors, possible due to different data types and less overflow / precision loss on Ampere <-> Ada communication.
Compared to Qwen's "official" FP8 quant, this one tends to add redundant characters to text output.
For example, test with VLLM nightly with recommended sampling parameters following question
is /users/me endpoint a bad practice?
This will result in following issues with output:
Forgetting to require auth → anyone gets someonesomeone'’s data*Use Vary: Authorization, avoid server-side caching per endpoint without per-user granularitycache keys�💡 Alternatives & Complements:�✅ Best Practices for /users/meHowever, whether it's *appropriate* depends on **context, **security considerations**, **consistency**, and **implementation quality**. Here’s a balanced breakdown:
There are broken unicode chars, missing closing tags (**context without closing **), repetitions inside of words (someonesomeone) and missing spaces.
Changing sampling parameters doesn't affects these issues. With temp=0.0 output have much more mistakes than with temp=1.0.
But despite this models still performs good in agentic tasks with OpenCode and I don't know how 🫥
Oh hey! Yes this is expected a bit - Qwen or https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8 uses block [128, 128] FP8 whilst this one uses FP8 per channel - this is I think 8-10% faster.
We actually did a benchmark as well for Qwen3-8B for eg: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning
We plan in the future to mix block and per row / column to make it slightly more accurate
I didn't notice any formatting/spelling issues yet. However I haven't used the model outside an agent harness yet, meaning there is always 10k+ tokens instructions in my context; also about the expected output format. The only potentially related issue I have is that despite detailed instructions qwen3-coder-next-fp8-dynamic isn't very consistent with Codex 'apply_patch' tool. It doesn't mess up the tool call itself, but the tool input argument (essentially a diff file) is often wrong. I'll try with the block-wise fp8 to be able to compare...
Yes this is expected a bit [..]
So you also observed these formatting/spelling issues? are other unlsoth qwen3-coder-next quants also showing this? To me it's unexpected. I assumed minor accuracy issues in larger models would show up differently (slightly higher tendencies to confuse something, ramble, increased chance of failed tool calls, etc.); Maybe this is something else (inference bug)?
fyi: I encountered 2 lone out-of-place Chinese characters in the output of the Qwen provided FP8 version. Against my intuition it therefore might be just a property of this model to show such token-based/formatting errors under loss of accuracy; after all it's only 3B active parameters.
Oh hey! Yes this is expected a bit - Qwen or https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8 uses block [128, 128] FP8 whilst this one uses FP8 per channel - this is I think 8-10% faster.
We actually did a benchmark as well for Qwen3-8B for eg: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning
We plan in the future to mix block and per row / column to make it slightly more accurate
Sorry, guys. It was my setup causing inconsistent outputs. VLLM produces garbage when running on cards from different generations. Qwen's FP8 just had less errors, possible due to different data types and less overflow / precision loss on Ampere <-> Ada communication.
Oh hey! Yes this is expected a bit - Qwen or https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8 uses block [128, 128] FP8 whilst this one uses FP8 per channel - this is I think 8-10% faster.
We actually did a benchmark as well for Qwen3-8B for eg: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning
We plan in the future to mix block and per row / column to make it slightly more accurate
Sorry, guys. It was my setup causing inconsistent outputs. VLLM produces garbage when running on cards from different generations. Qwen's FP8 just had less errors, possible due to different data types and less overflow / precision loss on Ampere <-> Ada communication.
If you could update your parent thread that would be awesome thanks! :)