Production deployment at scale

#55
by O96a - opened

We've been testing MiniMax-M2.5 for a multi-tenant RAG pipeline serving 50+ concurrent users. The 1M+ context window is genuinely useful for long-document retrieval, but we've noticed memory pressure spikes when processing concurrent requests at that scale. Curious if anyone has benchmarked the FP8 variant against the BF16 version for latency-sensitive applications? The 1300 likes suggest strong community adoption, but I'd love to see real-world deployment benchmarks beyond the model card.

Sign up or log in to comment