| --- |
| license: llama3.3 |
| --- |
| |
| The original [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model quantized using AutoAWQ. Follow the instruction [here](https://docs.vllm.ai/en/latest/quantization/auto_awq.html). |
|
|
| ``` |
| from awq import AutoAWQForCausalLM |
| from transformers import AutoTokenizer |
| |
| model_path = 'meta-llama/Llama-3.3-70B-Instruct' |
| quant_path = 'Llama-3.3-70B-Instruct-AWQ-4bit' |
| quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } |
| |
| # Load model |
| model = AutoAWQForCausalLM.from_pretrained( |
| model_path, **{"low_cpu_mem_usage": True, "use_cache": False} |
| ) |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
| |
| # Quantize |
| model.quantize(tokenizer, quant_config=quant_config) |
| |
| # Save quantized model |
| model.save_quantized(quant_path) |
| tokenizer.save_pretrained(quant_path) |
| ``` |
|
|
|
|
| vLLM serve |
| ``` |
| vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \ |
| --swap-space 16 \ |
| --disable-log-requests \ |
| --tokenizer meta-llama/Llama-3.3-70B-Instruct \ |
| --tensor-parallel-size 2 |
| ``` |
|
|
|
|
| Benchmark |
| ``` |
| python benchmark_serving.py \ |
| --backend vllm \ |
| --model lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \ |
| --tokenizer meta-llama/Meta-Llama-3-70B \ |
| --dataset-name sharegpt \ |
| --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ |
| --num-prompts 1000 |
| |
| ============ Serving Benchmark Result ============ |
| Successful requests: 902 |
| Benchmark duration (s): 128.07 |
| Total input tokens: 177877 |
| Total generated tokens: 182359 |
| Request throughput (req/s): 7.04 |
| Output token throughput (tok/s): 1423.85 |
| Total Token throughput (tok/s): 2812.71 |
| ---------------Time to First Token---------------- |
| Mean TTFT (ms): 47225.59 |
| Median TTFT (ms): 43313.95 |
| P99 TTFT (ms): 105587.66 |
| -----Time per Output Token (excl. 1st token)------ |
| Mean TPOT (ms): 141.01 |
| Median TPOT (ms): 148.94 |
| P99 TPOT (ms): 174.16 |
| ---------------Inter-token Latency---------------- |
| Mean ITL (ms): 131.55 |
| Median ITL (ms): 150.82 |
| P99 ITL (ms): 344.50 |
| ================================================== |
| ``` |