MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
Abstract
MiniAppBench introduces the first comprehensive benchmark for evaluating principle-driven, interactive application generation, addressing the gap in existing benchmarks that focus on static correctness rather than dynamic, real-world interactions.
With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in github.com/MiniAppBench.
Community
Hi everyone! We are excited to introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation by LLMs. π
While traditional benchmarks focus on static layouts or algorithmic snippets, we shift the paradigm toward MiniAppsβevaluating whether models can generate HTML-based applications requiring both visual rendering and complex interaction logic (e.g., physics simulators, interactive games).
π The Shocking Reality: GLM-5 Dethrones GPT-5.4
We evaluated 20 top models using our agentic framework. The results on complex interactive tasks are eye-opening:
- π₯ GLM-5 (61.80%) narrowly beats π₯ Claude-Opus-4.6 (61.60%).
- π₯ GPT-5.4 (56.60%) shows a severe "difficulty cliff": While it dominates Easy tasks (82.31%), it crashes to 35.03% on Hard tasks (requiring complex state transitions). Meanwhile, GLM-5 and Claude remain robust at ~45% on Hard tasks.
- The gap between static coding ability and interactive application generation is massive.
(Feel free to check our interactive Leaderboard below for the full breakdown!)
π Why MiniAppBench?
- Real-World Scale: Distilled from 10M+ in-the-wild human-AI interaction traces.
- Agentic Evaluation (MiniAppEval): We don't just string-match code. Our framework uses a browser-automation Agent to click, drag, and test the live generated apps, capturing DOM states and sequential logic (Pearson r > 0.85 with human judges).
β‘οΈ Zero Integration Cost: Test Your Model in 5 Mins
We know configuring evaluation harnesses is a pain. Thatβs why we open-sourced the entire end-to-end scaffolding.
Just bring your OpenAI-compatible API Key. No extra parsing scripts needed.
# 1. Clone & Install
git clone https://github.com/MiniAppBench/miniappbench.git
cd miniappbench && pip install -r requirements.txt
playwright install chromium
# 2. Run the full pipeline (Generation -> Agentic Evaluation)
python -m examples.pipeline --query-file data/query_validation_100.json --batch "1-5"
π Links
- π Project Page:https://miniappbench.github.io/
- π Interactive Leaderboard:https://huggingface.co/spaces/MiniAppBench/Leaderboard
- π» GitHub Repo:https://github.com/MiniAppBench/miniappbench
- π€ Dataset: https://huggingface.co/datasets/MiniAppBench/Dataset
Weβd love to hear your thoughts, especially on the performance divergence between models on "Hard" interactive tasks!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper