Complementary quality evaluation: accuracy + hallucination alongside inference benchmarks

#1
by vigneshwar234 - opened

Hi Inferless team!

Your LLM inference benchmark covers the throughput side beautifully. I built a complementary framework that covers the model quality side — because the fastest inference of a hallucinating model is still a bad product.

LLM Evaluation Framework:

  • Accuracy on MMLU + TruthfulQA (quality baseline)
  • Hallucination Rate (quality failure mode)
  • Reasoning Quality (CoT depth 1-10)
  • Cost per 1K tokens (pairs with your throughput data for total cost of ownership)
  • Latency p95 (quality-side latency measurement)

Inference speed (Inferless) + model quality (this) = complete deployment decision framework.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source, free forever!

Sign up or log in to comment