LLM-Inference-Benchmark

Running

Complementary quality evaluation: accuracy + hallucination alongside inference benchmarks

by vigneshwar234 - opened Jun 8

Jun 8

Hi Inferless team!

Your LLM inference benchmark covers the throughput side beautifully. I built a complementary framework that covers the model quality side — because the fastest inference of a hallucinating model is still a bad product.

LLM Evaluation Framework:

Accuracy on MMLU + TruthfulQA (quality baseline)
Hallucination Rate (quality failure mode)
Reasoning Quality (CoT depth 1-10)
Cost per 1K tokens (pairs with your throughput data for total cost of ownership)
Latency p95 (quality-side latency measurement)

Inference speed (Inferless) + model quality (this) = complete deployment decision framework.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source, free forever!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment