Qwen3-VL-30B-A3B-Instruct-NVFP4
NVFP4 quantization using llm-compressor v0.8.2.dev28+g0f346cf7 (and transformers v4.57.1) based the officiel NVFP4 example script for Qwen3-VL-235B-A22B-Instruct.
Dataset adjustments
- Model ID has been obviously changed from
Qwen/Qwen3-VL-235B-A22B-InstructtoQwen/Qwen3-VL-30B-A3B-Instruct - Increased the number of sample from 20 to 512
vLLM execution
As of v0.13.0, special execution configuration is not needed anymore. You can simply launch it this way:
docker run -ti --name Qwen3-VL-30B-A3B-NVFP4-v0.13.0 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -p 8000:8000 "vllm/vllm-openai:v0.13.0" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --async-scheduling --enable-auto-tool-choice --tool-call-parser hermes
The v0.13.0 is fixed because it has been tested on our side, feel free to try with newer version of vLLM when they came out.
A note for 5090 owners
While you can execute the model on this card, you will have to limit its context as it will not complete fit the card. It is especially true if you run it under WSL as you need to keep some space for the host OS.
Example for Windows/WSL:
docker run -ti --name Qwen3-VL-30B-A3B-NVFP4-v0.13.0 --gpus all -v 'E:\cache:/root/.cache' -p 8000:8000 "vllm/vllm-openai:v0.13.0" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --gpu-memory-utilization 0.8 --async-scheduling --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 56K --limit-mm-per-prompt.image 3 --limit-mm-per-prompt.video 0
- Execute the powershell command after reviewing it.
a. Adjust E:\cache to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.
b. gpu-memory-utilization and max-model-len have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.
c. limit-mm-per-prompt has been adjusted to match the model len limitation (max 3 images and 0 videos)
- Once the service has successfully started,
CTRL-Cthe execution to stop the container. You can close the poweshell terminal, it was necessary only to set the container start flags. - Now open Docker Desktop and simply press the start button of the
Qwen3-VL-30B-A3B-NVFP4container. You can now simply manage it using the UI when you need it. - Enjoy fast NVFP4 inference !
- Downloads last month
- 2,198
Model tree for ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4
Base model
Qwen/Qwen3-VL-30B-A3B-Instruct