Production deployment and quantization trade-offs

#58
by O96a - opened

The 14B parameter count hits a sweet spot for many deployment scenarios β€” too large for edge but fits comfortably in single-GPU setups. We've been running phi-4 quantized variants in production for RAG pipelines and seeing good results.

One observation: the reasoning benchmarks in the paper are impressive, but I'm curious about real-world instruction following in multi-turn conversations. Have you evaluated degradation across conversation length? In agent workflows we've noticed that some models lose context coherence after 10-15 turns even with proper token limits.

Also, any guidance on optimal quantization choices? AWQ vs GPTQ vs GGUF β€” the trade-offs between inference speed and quality aren't always clear from the benchmark tables alone.

Sign up or log in to comment