Text Generation
Transformers
Safetensors
English
cloverlm
causal-lm
quartet-ii
nvfp4
low-precision-training
pretrained
custom_code
Instructions to use daslab-testing/CloverLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use daslab-testing/CloverLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="daslab-testing/CloverLM", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("daslab-testing/CloverLM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use daslab-testing/CloverLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "daslab-testing/CloverLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daslab-testing/CloverLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/daslab-testing/CloverLM
- SGLang
How to use daslab-testing/CloverLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "daslab-testing/CloverLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daslab-testing/CloverLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "daslab-testing/CloverLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daslab-testing/CloverLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use daslab-testing/CloverLM with Docker Model Runner:
docker model run hf.co/daslab-testing/CloverLM
| # Serving CloverLM with vLLM (Quartet II NVFP4) | |
| ## Prerequisites | |
| Before following this guide, first set up the environment as described in `lm_eval/README.md`. | |
| - NVIDIA Blackwell GPU (B300 / B200 / RTX 5090) for real Quartet II NVFP4 kernels | |
| - CUDA 13.0+ | |
| - Python 3.11+ | |
| - The Quartet II kernels (`quartet2` package) installed | |
| ## 1. Environment Setup | |
| ```bash | |
| # Activate the existing environment | |
| source .venv/bin/activate | |
| # Set CUDA paths | |
| export CUDA_HOME=/usr/local/cuda-13.0/ | |
| export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas | |
| export PATH=/usr/local/cuda/bin:$PATH | |
| export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-} | |
| ``` | |
| ## 2. Install vLLM | |
| ```bash | |
| export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest \ | |
| | jq -r .tag_name | sed 's/^v//') | |
| export CUDA_VERSION=130 | |
| export CPU_ARCH=$(uname -m) | |
| uv pip install \ | |
| "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl" \ | |
| --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION} | |
| ``` | |
| ## 3. Serve the Model | |
| ### Offline inference (quick test) | |
| ```bash | |
| cd CloverLM/vllm_plugin | |
| python serve.py | |
| ``` | |
| ### OpenAI-compatible API server | |
| ```bash | |
| cd CloverLM/vllm_plugin | |
| python serve.py --api --port 8000 | |
| ``` | |
| Then query: | |
| ```bash | |
| curl http://localhost:8000/v1/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "path/to/CloverLM", | |
| "prompt": "The capital of France is", | |
| "max_tokens": 64, | |
| "temperature": 0.8 | |
| }' | |
| ``` | |
| ### Options | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--model` | `../` (CloverLM dir) | Path to CloverLM model directory | | |
| | `--api` | off | Start OpenAI-compatible API server | | |
| | `--port` | 8000 | API server port | | |
| | `--host` | 0.0.0.0 | API server host | | |
| | `--tp` | 1 | Tensor parallel size | | |
| | `--max-model-len` | 1024 | Maximum context length | | |
| | `--gpu-memory-utilization` | 0.9 | GPU memory fraction to use | | |
| ## Architecture | |
| The vLLM integration consists of three components: | |
| 1. **`quartet2_quant.py`** -- Quartet II quantization plugin registered as `"quartet2"`. | |
| Wraps the Quartet II on-the-fly FP4 quantization (`quant_fp4` + `flashinfer.mm_fp4`) | |
| into vLLM's `LinearMethodBase` interface. Weights stay in bf16; quantization happens | |
| at each forward pass. | |
| 2. **`cloverlm_vllm.py`** -- Full vLLM model implementation with paged KV cache. | |
| Reimplements CloverLM's architecture using vLLM primitives: | |
| - `ColumnParallelLinear` / `RowParallelLinear` for Q/K/V/O and MLP projections | |
| - vLLM `Attention` for paged KV caching and efficient attention | |
| - Custom RoPE (base 1024, repeat_interleave pattern) | |
| - Sphere normalization on Q/K before attention | |
| - Per-head learnable scale parameter | |
| - Squared ReLU activation in MLP | |
| - Post-sublayer RMSNorm (not pre-norm) | |
| 3. **`serve.py`** -- Entry point that registers both the quantization plugin and model, | |
| then launches vLLM in offline or API mode. | |
| ## Known Limitations | |
| - **CUDA graphs**: Currently `enforce_eager=True` is required because the Quartet II | |
| on-the-fly quantization kernels (`quant_fp4` + `mm_fp4`) are not compatible with | |
| CUDA graph capture. This means slightly higher per-token latency compared to | |
| CUDA-graph-enabled models. A future update to the Quartet II kernels could remove | |
| this limitation. | |
| ## Troubleshooting | |
| **"No module named 'quartet2'"**: Ensure the Quartet II kernels are installed: | |
| ```bash | |
| uv pip install "quartet2 @ git+https://github.com/IST-DASLab/Quartet-II.git#subdirectory=kernels" | |
| ``` | |
| **CUDA errors**: Make sure `CUDA_HOME` points to CUDA 13.0+ and `TRITON_PTXAS_PATH` is set. | |
| **Out of memory**: Reduce `--gpu-memory-utilization` or use `--tp 2` for tensor parallelism. | |