Stack-2-9-finetuned / stack /training /README_OPTIMIZATION.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 22 days ago

preview code

raw

history blame contribute delete

4.42 kB

Stack 2.9 Optimization Guide

This guide covers optimizing Stack 2.9 for fast, efficient inference while maintaining quality.

Overview

Stack 2.9 can be quantized from 64GB (bfloat16) down to ~18GB (4-bit) with minimal quality loss, enabling deployment on consumer GPUs.

Quick Start

# 1. Quantize the model
python quantize.py \
    --model-path ./output/stack-2.9-merged \
    --output-path ./output/stack-2.9-quantized \
    --method bnb \
    --bits 4

# 2. Benchmark the optimized model
python benchmark_optimized.py \
    --optimized-model ./output/stack-2.9-quantized

# 3. Upload to HuggingFace
python upload_hf.py \
    --model-path ./output/stack-2.9-quantized \
    --repo-id your-username/stack-2.9

Quantization Methods

1. BitsAndBytes (Recommended)

Most compatible, good quality, fast inference.

python quantize.py --method bnb --bits 4

Pros:

Works on any GPU
Fast inference
No calibration data needed
Good quality preservation

Cons:

~4x compression (not the best)

2. AWQ (Activation-Aware Weight Quantization)

Best quality/performance ratio, but requires specific hardware.

python quantize.py --method awq

Pros:

Best quality preservation
Hardware-aware
Good for specific tasks

Cons:

Requires recent GPU
May need calibration data

3. GPTQ

Good compression, slower inference.

python quantize.py --method gptq --bits 4

Pros:

Excellent compression
Well-studied method

Cons:

Requires calibration
Slower inference than AWQ/BNB

Model Sizes

Precision	Size	Min GPU VRAM	Quality
bfloat16	64 GB	80 GB	100%
float16	64 GB	64 GB	99%
int8	32 GB	40 GB	95%
int4	18 GB	24 GB	90-95%

Benchmarking

Compare optimized vs base model:

python benchmark_optimized.py \
    --base-model Qwen/Qwen2.5-Coder-32B \
    --optimized-model ./output/stack-2.9-quantized \
    --num-runs 5 \
    --test-mmlu

Expected results (int4 vs bf16):

Speed: 2-3x faster
Memory: 60-70% reduction
Quality: ~92-95% preserved

API Server

Deploy an OpenAI-compatible API:

# Install dependencies
pip install fastapi uvicorn transformers torch

# Start server
python convert_openai.py \
    --model-path ./output/stack-2.9-quantized \
    --port 8000

# Test
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "stack-2.9",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

vLLM Deployment

For production, use vLLM:

pip install vllm

vllm serve ./output/stack-2.9-quantized \
    --dtype half \
    --tensor-parallel-size 2 \
    --max-model-len 32768

HuggingFace Upload

# Upload model
python upload_hf.py \
    --model-path ./output/stack-2.9-quantized \
    --repo-id your-username/stack-2.9 \
    --token hf_your_token

# Upload with Gradio Spaces demo
python upload_hf.py \
    --model-path ./output/stack-2.9-quantized \
    --repo-id your-username/stack-2.9 \
    --add-spaces

Expected Performance

With int4 quantization:

Metric	Value
Tokens/sec	30-50
Memory (GPU)	18-22 GB
Model size	~18 GB
Cold start	10-20s

Quality Preservation

Stack 2.9 maintains ~92-95% quality after int4 quantization:

Code generation: ~95% (excellent for most tasks)
Reasoning: ~90% (may struggle with complex logic)
General knowledge: ~92%

Troubleshooting

Out of Memory

# Try int8 instead of int4
python quantize.py --method bnb --bits 8

# Or use CPU offloading
python convert_openai.py --device-map cpu

Slow Inference

Use vLLM for 2-3x speedup
Enable flash attention (if supported)
Use shorter context

Quality Issues

Try GPTQ instead of BNB
Use int8 instead of int4
Increase tokens per generation

Production Checklist

Quantize model
Benchmark against base
Run quality tests
Test API endpoints
Set up monitoring
Configure rate limiting
Set up autoscaling
Document deployment

my-ai-stack
/

Stack-2-9-finetuned

Stack 2.9 Optimization Guide

Overview

Quick Start

Quantization Methods

1. BitsAndBytes (Recommended)

2. AWQ (Activation-Aware Weight Quantization)

3. GPTQ

Model Sizes

Benchmarking

API Server

vLLM Deployment

HuggingFace Upload

Expected Performance

Quality Preservation

Troubleshooting

Out of Memory

Slow Inference

Quality Issues

Production Checklist

Resources