Complementary quality evaluation: accuracy + hallucination alongside inference benchmarks
#1
by vigneshwar234 - opened
Hi Inferless team!
Your LLM inference benchmark covers the throughput side beautifully. I built a complementary framework that covers the model quality side — because the fastest inference of a hallucinating model is still a bad product.
LLM Evaluation Framework:
- Accuracy on MMLU + TruthfulQA (quality baseline)
- Hallucination Rate (quality failure mode)
- Reasoning Quality (CoT depth 1-10)
- Cost per 1K tokens (pairs with your throughput data for total cost of ownership)
- Latency p95 (quality-side latency measurement)
Inference speed (Inferless) + model quality (this) = complete deployment decision framework.
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Open source, free forever!