| # Stack 2.9 Optimization Guide |
|
|
| This guide covers optimizing Stack 2.9 for fast, efficient inference while maintaining quality. |
|
|
| ## Overview |
|
|
| Stack 2.9 can be quantized from 64GB (bfloat16) down to ~18GB (4-bit) with minimal quality loss, enabling deployment on consumer GPUs. |
|
|
| ## Quick Start |
|
|
| ```bash |
| # 1. Quantize the model |
| python quantize.py \ |
| --model-path ./output/stack-2.9-merged \ |
| --output-path ./output/stack-2.9-quantized \ |
| --method bnb \ |
| --bits 4 |
| |
| # 2. Benchmark the optimized model |
| python benchmark_optimized.py \ |
| --optimized-model ./output/stack-2.9-quantized |
| |
| # 3. Upload to HuggingFace |
| python upload_hf.py \ |
| --model-path ./output/stack-2.9-quantized \ |
| --repo-id your-username/stack-2.9 |
| ``` |
|
|
| ## Quantization Methods |
|
|
| ### 1. BitsAndBytes (Recommended) |
|
|
| Most compatible, good quality, fast inference. |
|
|
| ```bash |
| python quantize.py --method bnb --bits 4 |
| ``` |
|
|
| **Pros:** |
| - Works on any GPU |
| - Fast inference |
| - No calibration data needed |
| - Good quality preservation |
|
|
| **Cons:** |
| - ~4x compression (not the best) |
|
|
| ### 2. AWQ (Activation-Aware Weight Quantization) |
|
|
| Best quality/performance ratio, but requires specific hardware. |
|
|
| ```bash |
| python quantize.py --method awq |
| ``` |
|
|
| **Pros:** |
| - Best quality preservation |
| - Hardware-aware |
| - Good for specific tasks |
|
|
| **Cons:** |
| - Requires recent GPU |
| - May need calibration data |
|
|
| ### 3. GPTQ |
|
|
| Good compression, slower inference. |
|
|
| ```bash |
| python quantize.py --method gptq --bits 4 |
| ``` |
|
|
| **Pros:** |
| - Excellent compression |
| - Well-studied method |
|
|
| **Cons:** |
| - Requires calibration |
| - Slower inference than AWQ/BNB |
|
|
| ## Model Sizes |
|
|
| | Precision | Size | Min GPU VRAM | Quality | |
| |------------|------|--------------|---------| |
| | bfloat16 | 64 GB | 80 GB | 100% | |
| | float16 | 64 GB | 64 GB | 99% | |
| | int8 | 32 GB | 40 GB | 95% | |
| | int4 | 18 GB | 24 GB | 90-95% | |
|
|
| ## Benchmarking |
|
|
| Compare optimized vs base model: |
|
|
| ```bash |
| python benchmark_optimized.py \ |
| --base-model Qwen/Qwen2.5-Coder-32B \ |
| --optimized-model ./output/stack-2.9-quantized \ |
| --num-runs 5 \ |
| --test-mmlu |
| ``` |
|
|
| Expected results (int4 vs bf16): |
| - **Speed**: 2-3x faster |
| - **Memory**: 60-70% reduction |
| - **Quality**: ~92-95% preserved |
|
|
| ## API Server |
|
|
| Deploy an OpenAI-compatible API: |
|
|
| ```bash |
| # Install dependencies |
| pip install fastapi uvicorn transformers torch |
| |
| # Start server |
| python convert_openai.py \ |
| --model-path ./output/stack-2.9-quantized \ |
| --port 8000 |
| |
| # Test |
| curl -X POST http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "stack-2.9", |
| "messages": [{"role": "user", "content": "Hello!"}] |
| }' |
| ``` |
|
|
| ## vLLM Deployment |
|
|
| For production, use vLLM: |
|
|
| ```bash |
| pip install vllm |
| |
| vllm serve ./output/stack-2.9-quantized \ |
| --dtype half \ |
| --tensor-parallel-size 2 \ |
| --max-model-len 32768 |
| ``` |
|
|
| ## HuggingFace Upload |
|
|
| ```bash |
| # Upload model |
| python upload_hf.py \ |
| --model-path ./output/stack-2.9-quantized \ |
| --repo-id your-username/stack-2.9 \ |
| --token hf_your_token |
| |
| # Upload with Gradio Spaces demo |
| python upload_hf.py \ |
| --model-path ./output/stack-2.9-quantized \ |
| --repo-id your-username/stack-2.9 \ |
| --add-spaces |
| ``` |
|
|
| ## Expected Performance |
|
|
| With int4 quantization: |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Tokens/sec | 30-50 | |
| | Memory (GPU) | 18-22 GB | |
| | Model size | ~18 GB | |
| | Cold start | 10-20s | |
|
|
| ## Quality Preservation |
|
|
| Stack 2.9 maintains ~92-95% quality after int4 quantization: |
|
|
| - Code generation: ~95% (excellent for most tasks) |
| - Reasoning: ~90% (may struggle with complex logic) |
| - General knowledge: ~92% |
|
|
| ## Troubleshooting |
|
|
| ### Out of Memory |
|
|
| ```bash |
| # Try int8 instead of int4 |
| python quantize.py --method bnb --bits 8 |
| |
| # Or use CPU offloading |
| python convert_openai.py --device-map cpu |
| ``` |
|
|
| ### Slow Inference |
|
|
| - Use vLLM for 2-3x speedup |
| - Enable flash attention (if supported) |
| - Use shorter context |
|
|
| ### Quality Issues |
|
|
| - Try GPTQ instead of BNB |
| - Use int8 instead of int4 |
| - Increase tokens per generation |
|
|
| ## Production Checklist |
|
|
| - [ ] Quantize model |
| - [ ] Benchmark against base |
| - [ ] Run quality tests |
| - [ ] Test API endpoints |
| - [ ] Set up monitoring |
| - [ ] Configure rate limiting |
| - [ ] Set up autoscaling |
| - [ ] Document deployment |
|
|
| ## Resources |
|
|
| - [AWQ Paper](https://arxiv.org/abs/2306.06965) |
| - [GPTQ Paper](https://arxiv.org/abs/2210.17323) |
| - [vLLM Documentation](https://docs.vllm.ai/) |
| - [HuggingFace Hub](https://huggingface.co/docs/hub/) |