Qwen3-VL-30B-A3B-Instruct-NVFP4

NVFP4 quantization using llm-compressor v0.8.2.dev28+g0f346cf7 (and transformers v4.57.1) based the officiel NVFP4 example script for Qwen3-VL-235B-A22B-Instruct.

Dataset adjustments

Model ID has been obviously changed from Qwen/Qwen3-VL-235B-A22B-Instruct to Qwen/Qwen3-VL-30B-A3B-Instruct
Increased the number of sample from 20 to 512

vLLM execution

As of v0.13.0, special execution configuration is not needed anymore. You can simply launch it this way:

docker run -ti --name Qwen3-VL-30B-A3B-NVFP4-v0.13.0 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -p 8000:8000 "vllm/vllm-openai:v0.13.0" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --async-scheduling --enable-auto-tool-choice --tool-call-parser hermes

The v0.13.0 is fixed because it has been tested on our side, feel free to try with newer version of vLLM when they came out.

A note for 5090 owners

While you can execute the model on this card, you will have to limit its context as it will not complete fit the card. It is especially true if you run it under WSL as you need to keep some space for the host OS.

Example for Windows/WSL:

docker run -ti --name Qwen3-VL-30B-A3B-NVFP4-v0.13.0 --gpus all -v 'E:\cache:/root/.cache' -p 8000:8000 "vllm/vllm-openai:v0.13.0" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --gpu-memory-utilization 0.8 --async-scheduling --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 56K --limit-mm-per-prompt.image 3 --limit-mm-per-prompt.video 0

Execute the powershell command after reviewing it.

a. Adjust E:\cache to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.

b. gpu-memory-utilization and max-model-len have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.

c. limit-mm-per-prompt has been adjusted to match the model len limitation (max 3 images and 0 videos)

Once the service has successfully started, CTRL-C the execution to stop the container. You can close the poweshell terminal, it was necessary only to set the container start flags.
Now open Docker Desktop and simply press the start button of the Qwen3-VL-30B-A3B-NVFP4 container. You can now simply manage it using the UI when you need it.
Enjoy fast NVFP4 inference !

Downloads last month: 2,198

Safetensors

Model size

18B params

Tensor type

F32

BF16

F8_E4M3

Model tree for ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4

Base model

Qwen/Qwen3-VL-30B-A3B-Instruct

Quantized

(48)

this model

Dataset used to train ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4

Collection including ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4

NVFP4

Collection

Fast inference for Blackwell GPUs • 6 items • Updated 6 days ago • 3