Stack-2-9-finetuned / stack /training /README_OPTIMIZATION.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 22 days ago

4.42 kB

	# Stack 2.9 Optimization Guide

	This guide covers optimizing Stack 2.9 for fast, efficient inference while maintaining quality.

	## Overview

	Stack 2.9 can be quantized from 64GB (bfloat16) down to ~18GB (4-bit) with minimal quality loss, enabling deployment on consumer GPUs.

	## Quick Start

	```bash
	# 1. Quantize the model
	python quantize.py \
	--model-path ./output/stack-2.9-merged \
	--output-path ./output/stack-2.9-quantized \
	--method bnb \
	--bits 4

	# 2. Benchmark the optimized model
	python benchmark_optimized.py \
	--optimized-model ./output/stack-2.9-quantized

	# 3. Upload to HuggingFace
	python upload_hf.py \
	--model-path ./output/stack-2.9-quantized \
	--repo-id your-username/stack-2.9
	```

	## Quantization Methods

	### 1. BitsAndBytes (Recommended)

	Most compatible, good quality, fast inference.

	```bash
	python quantize.py --method bnb --bits 4
	```

	Pros:
	- Works on any GPU
	- Fast inference
	- No calibration data needed
	- Good quality preservation

	Cons:
	- ~4x compression (not the best)

	### 2. AWQ (Activation-Aware Weight Quantization)

	Best quality/performance ratio, but requires specific hardware.

	```bash
	python quantize.py --method awq
	```

	Pros:
	- Best quality preservation
	- Hardware-aware
	- Good for specific tasks

	Cons:
	- Requires recent GPU
	- May need calibration data

	### 3. GPTQ

	Good compression, slower inference.

	```bash
	python quantize.py --method gptq --bits 4
	```

	Pros:
	- Excellent compression
	- Well-studied method

	Cons:
	- Requires calibration
	- Slower inference than AWQ/BNB

	## Model Sizes

	\| Precision \| Size \| Min GPU VRAM \| Quality \|
	\|------------\|------\|--------------\|---------\|
	\| bfloat16 \| 64 GB \| 80 GB \| 100% \|
	\| float16 \| 64 GB \| 64 GB \| 99% \|
	\| int8 \| 32 GB \| 40 GB \| 95% \|
	\| int4 \| 18 GB \| 24 GB \| 90-95% \|

	## Benchmarking

	Compare optimized vs base model:

	```bash
	python benchmark_optimized.py \
	--base-model Qwen/Qwen2.5-Coder-32B \
	--optimized-model ./output/stack-2.9-quantized \
	--num-runs 5 \
	--test-mmlu
	```

	Expected results (int4 vs bf16):
	- Speed: 2-3x faster
	- Memory: 60-70% reduction
	- Quality: ~92-95% preserved

	## API Server

	Deploy an OpenAI-compatible API:

	```bash
	# Install dependencies
	pip install fastapi uvicorn transformers torch

	# Start server
	python convert_openai.py \
	--model-path ./output/stack-2.9-quantized \
	--port 8000

	# Test
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "stack-2.9",
	"messages": [{"role": "user", "content": "Hello!"}]
	}'
	```

	## vLLM Deployment

	For production, use vLLM:

	```bash
	pip install vllm

	vllm serve ./output/stack-2.9-quantized \
	--dtype half \
	--tensor-parallel-size 2 \
	--max-model-len 32768
	```

	## HuggingFace Upload

	```bash
	# Upload model
	python upload_hf.py \
	--model-path ./output/stack-2.9-quantized \
	--repo-id your-username/stack-2.9 \
	--token hf_your_token

	# Upload with Gradio Spaces demo
	python upload_hf.py \
	--model-path ./output/stack-2.9-quantized \
	--repo-id your-username/stack-2.9 \
	--add-spaces
	```

	## Expected Performance

	With int4 quantization:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Tokens/sec \| 30-50 \|
	\| Memory (GPU) \| 18-22 GB \|
	\| Model size \| ~18 GB \|
	\| Cold start \| 10-20s \|

	## Quality Preservation

	Stack 2.9 maintains ~92-95% quality after int4 quantization:

	- Code generation: ~95% (excellent for most tasks)
	- Reasoning: ~90% (may struggle with complex logic)
	- General knowledge: ~92%

	## Troubleshooting

	### Out of Memory

	```bash
	# Try int8 instead of int4
	python quantize.py --method bnb --bits 8

	# Or use CPU offloading
	python convert_openai.py --device-map cpu
	```

	### Slow Inference

	- Use vLLM for 2-3x speedup
	- Enable flash attention (if supported)
	- Use shorter context

	### Quality Issues

	- Try GPTQ instead of BNB
	- Use int8 instead of int4
	- Increase tokens per generation

	## Production Checklist

	- [ ] Quantize model
	- [ ] Benchmark against base
	- [ ] Run quality tests
	- [ ] Test API endpoints
	- [ ] Set up monitoring
	- [ ] Configure rate limiting
	- [ ] Set up autoscaling
	- [ ] Document deployment

	## Resources

	- [AWQ Paper](https://arxiv.org/abs/2306.06965)
	- [GPTQ Paper](https://arxiv.org/abs/2210.17323)
	- [vLLM Documentation](https://docs.vllm.ai/)
	- [HuggingFace Hub](https://huggingface.co/docs/hub/)