Spaces:

CohereLabs
/

command-a-reasoning

Running on CPU Upgrade

Benchmark Command A reasoning: cost + hallucination + quality vs other models

by vigneshwar234 - opened Jun 8

Jun 8

Hi Cohere team 👋

Command A's reasoning capabilities look excellent! For teams evaluating whether to switch from GPT-4o to Command A for reasoning tasks, I built a tool that makes the comparison quantitative.

LLM Evaluation Framework benchmarks any Cohere model alongside alternatives:

→ 🧠 Reasoning Quality — 1-10 chain-of-thought depth score
→ 💰 Cost per 1K tokens — Cohere vs OpenAI vs Anthropic cost comparison
→ ⚡ Latency p95 — reasoning-heavy prompts have long tail latency
→ 🎯 Accuracy — MMLU + TruthfulQA
→ 🔍 Hallucination Rate — overconfident wrong reasoning

Works with any LiteLLM-compatible model including the full Cohere lineup.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source. Would love to include Command A in our benchmarks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment