Update vllm_plugin/SERVING.md

f359aa5 verified about 2 months ago

3.82 kB

	# Serving CloverLM with vLLM (Quartet II NVFP4)

	## Prerequisites

	Before following this guide, first set up the environment as described in `lm_eval/README.md`.

	- NVIDIA Blackwell GPU (B300 / B200 / RTX 5090) for real Quartet II NVFP4 kernels
	- CUDA 13.0+
	- Python 3.11+
	- The Quartet II kernels (`quartet2` package) installed

	## 1. Environment Setup

	```bash
	# Activate the existing environment
	source .venv/bin/activate

	# Set CUDA paths
	export CUDA_HOME=/usr/local/cuda-13.0/
	export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
	export PATH=/usr/local/cuda/bin:$PATH
	export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
	```

	## 2. Install vLLM

	```bash
	export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest \
	\| jq -r .tag_name \| sed 's/^v//')
	export CUDA_VERSION=130
	export CPU_ARCH=$(uname -m)

	uv pip install \
	"https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl" \
	--extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
	```

	## 3. Serve the Model

	### Offline inference (quick test)

	```bash
	cd CloverLM/vllm_plugin
	python serve.py
	```

	### OpenAI-compatible API server

	```bash
	cd CloverLM/vllm_plugin
	python serve.py --api --port 8000
	```

	Then query:

	```bash
	curl http://localhost:8000/v1/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "path/to/CloverLM",
	"prompt": "The capital of France is",
	"max_tokens": 64,
	"temperature": 0.8
	}'
	```

	### Options

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--model` \| `../` (CloverLM dir) \| Path to CloverLM model directory \|
	\| `--api` \| off \| Start OpenAI-compatible API server \|
	\| `--port` \| 8000 \| API server port \|
	\| `--host` \| 0.0.0.0 \| API server host \|
	\| `--tp` \| 1 \| Tensor parallel size \|
	\| `--max-model-len` \| 1024 \| Maximum context length \|
	\| `--gpu-memory-utilization` \| 0.9 \| GPU memory fraction to use \|

	## Architecture

	The vLLM integration consists of three components:

	1. `quartet2_quant.py` -- Quartet II quantization plugin registered as `"quartet2"`.
	Wraps the Quartet II on-the-fly FP4 quantization (`quant_fp4` + `flashinfer.mm_fp4`)
	into vLLM's `LinearMethodBase` interface. Weights stay in bf16; quantization happens
	at each forward pass.

	2. `cloverlm_vllm.py` -- Full vLLM model implementation with paged KV cache.
	Reimplements CloverLM's architecture using vLLM primitives:
	- `ColumnParallelLinear` / `RowParallelLinear` for Q/K/V/O and MLP projections
	- vLLM `Attention` for paged KV caching and efficient attention
	- Custom RoPE (base 1024, repeat_interleave pattern)
	- Sphere normalization on Q/K before attention
	- Per-head learnable scale parameter
	- Squared ReLU activation in MLP
	- Post-sublayer RMSNorm (not pre-norm)

	3. `serve.py` -- Entry point that registers both the quantization plugin and model,
	then launches vLLM in offline or API mode.

	## Known Limitations

	- CUDA graphs: Currently `enforce_eager=True` is required because the Quartet II
	on-the-fly quantization kernels (`quant_fp4` + `mm_fp4`) are not compatible with
	CUDA graph capture. This means slightly higher per-token latency compared to
	CUDA-graph-enabled models. A future update to the Quartet II kernels could remove
	this limitation.

	## Troubleshooting

	"No module named 'quartet2'": Ensure the Quartet II kernels are installed:
	```bash
	uv pip install "quartet2 @ git+https://github.com/IST-DASLab/Quartet-II.git#subdirectory=kernels"
	```

	CUDA errors: Make sure `CUDA_HOME` points to CUDA 13.0+ and `TRITON_PTXAS_PATH` is set.

	Out of memory: Reduce `--gpu-memory-utilization` or use `--tp 2` for tensor parallelism.