| | --- |
| | title: AutoBench Leaderboard |
| | emoji: π |
| | colorFrom: green |
| | colorTo: pink |
| | sdk: gradio |
| | sdk_version: 5.27.0 |
| | app_file: app.py |
| | pinned: false |
| | license: mit |
| | short_description: Multi-run AutoBench leaderboard with historical navigation |
| | --- |
| | |
| | # AutoBench LLM Leaderboard |
| |
|
| | Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods. |
| |
|
| | ## π Features |
| |
|
| | ### Multi-Run Navigation |
| | - **π Run Selector**: Switch between different AutoBench runs using the dropdown menu |
| | - **π Historical Data**: View and compare results across different time periods |
| | - **π Reactive Interface**: All tabs and visualizations update automatically when switching runs |
| | - **π Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs |
| |
|
| | ### Comprehensive Analysis |
| | - **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics |
| | - **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks |
| | - **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs |
| | - **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles |
| | - **Domain Performance**: Model rankings across specific knowledge areas |
| |
|
| | ### Dynamic Features |
| | - **π Benchmark Correlations**: Displays correlation percentages with other popular benchmarks |
| | - **π° Cost Conversion**: Automatic conversion to cents for better readability |
| | - **β‘ Performance Metrics**: Average and P99 latency measurements |
| | - **π― Fail Rate Tracking**: Model reliability metrics (for supported runs) |
| | - **π’ Iteration Counts**: Number of evaluations per model (for supported runs) |
| |
|
| | ## π How to Use |
| |
|
| | ### Navigation |
| | 1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs |
| | 2. **Explore Tabs**: Navigate through different analysis views using the tab interface |
| | 3. **Interactive Tables**: Sort and filter data by clicking on column headers |
| | 4. **Hover for Details**: Get additional information by hovering over chart elements |
| |
|
| | ### Understanding the Data |
| | - **AutoBench Score**: Higher scores indicate better performance |
| | - **Cost**: Lower values are better (displayed in cents per response) |
| | - **Latency**: Lower response times are better (average and P99 percentiles) |
| | - **Fail Rate**: Lower percentages indicate more reliable models |
| | - **Iterations**: Number of evaluation attempts per model |
| |
|
| | ## π§ Adding New Runs |
| |
|
| | ### Directory Structure |
| | ``` |
| | runs/ |
| | βββ run_YYYY-MM-DD/ |
| | β βββ metadata.json # Run information and metadata |
| | β βββ correlations.json # Benchmark correlation data |
| | β βββ summary_data.csv # Main leaderboard data |
| | β βββ domain_ranks.csv # Domain-specific rankings |
| | β βββ cost_data.csv # Cost breakdown by domain |
| | β βββ avg_latency.csv # Average latency by domain |
| | β βββ p99_latency.csv # P99 latency by domain |
| | ``` |
| |
|
| | ### Required Files |
| |
|
| | #### 1. metadata.json |
| | ```json |
| | { |
| | "run_id": "run_2025-08-14", |
| | "title": "AutoBench Run 3 - August 2025", |
| | "date": "2025-08-14", |
| | "description": "Latest AutoBench run with enhanced metrics", |
| | "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run", |
| | "model_count": 34, |
| | "is_latest": true |
| | } |
| | ``` |
| |
|
| | #### 2. correlations.json |
| | ```json |
| | { |
| | "correlations": { |
| | "Chatbot Arena": 82.51, |
| | "Artificial Analysis Intelligence Index": 83.74, |
| | "MMLU": 71.51 |
| | }, |
| | "description": "Correlation percentages between AutoBench scores and other benchmark scores" |
| | } |
| | ``` |
| |
|
| | #### 3. summary_data.csv |
| | Required columns: |
| | - `Model`: Model name |
| | - `AutoBench`: AutoBench score |
| | - `Costs (USD)`: Cost per response in USD |
| | - `Avg Answer Duration (sec)`: Average response time |
| | - `P99 Answer Duration (sec)`: 99th percentile response time |
| | |
| | Optional columns (for enhanced metrics): |
| | - `Iterations`: Number of evaluation iterations |
| | - `Fail Rate %`: Percentage of failed responses |
| | - `LMArena` or `Chatbot Ar.`: Chatbot Arena scores |
| | - `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores |
| | - `AAI Index`: Artificial Analysis Intelligence Index scores |
| | |
| | ### Adding a New Run |
| | |
| | 1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD` |
| | 2. **Add Data Files**: Copy your CSV files to the new directory |
| | 3. **Create Metadata**: Add `metadata.json` with run information |
| | 4. **Add Correlations**: Create `correlations.json` with benchmark correlations |
| | 5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata |
| | 6. **Restart App**: The new run will be automatically discovered |
| |
|
| | ### Column Compatibility |
| |
|
| | The application automatically adapts to different column structures: |
| | - **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency) |
| | - **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %) |
| | - **Flexible Naming**: Handles variations in benchmark column names |
| |
|
| | ## π οΈ Development |
| |
|
| | ### Requirements |
| | - Python 3.8+ |
| | - Gradio 5.27.0+ |
| | - Pandas |
| | - Plotly |
| |
|
| | ### Installation |
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Running Locally |
| | ```bash |
| | python app.py |
| | ``` |
| |
|
| | ### killing all python processes |
| | ```bash |
| | taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill" |
| | ``` |
| |
|
| | The app will automatically discover available runs and launch on a local port. |
| |
|
| | ## π Data Sources |
| |
|
| | AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench). |
| |
|
| | ## π License |
| |
|
| | MIT License - see LICENSE file for details. |
| |
|
| | --- |
| |
|
| | Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options. |
| |
|