Benchmark Command A reasoning: cost + hallucination + quality vs other models

#1
by vigneshwar234 - opened

Hi Cohere team ๐Ÿ‘‹

Command A's reasoning capabilities look excellent! For teams evaluating whether to switch from GPT-4o to Command A for reasoning tasks, I built a tool that makes the comparison quantitative.

LLM Evaluation Framework benchmarks any Cohere model alongside alternatives:

โ†’ ๐Ÿง  Reasoning Quality โ€” 1-10 chain-of-thought depth score
โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” Cohere vs OpenAI vs Anthropic cost comparison
โ†’ โšก Latency p95 โ€” reasoning-heavy prompts have long tail latency
โ†’ ๐ŸŽฏ Accuracy โ€” MMLU + TruthfulQA
โ†’ ๐Ÿ” Hallucination Rate โ€” overconfident wrong reasoning

Works with any LiteLLM-compatible model including the full Cohere lineup.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source. Would love to include Command A in our benchmarks!

Sign up or log in to comment