Papers
arxiv:2604.21375

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Published on Apr 23
· Submitted by
Cihang Xie
on Apr 24
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

VLAA-GUI is a modular GUI agent framework that addresses early stopping and repetitive loop issues through integrated components for verification, loop breaking, and search capabilities.

AI-generated summary

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

Community

Paper submitter

Autonomous GUI agents suffer from two chronic failure modes: early stopping (declaring success before the task is actually done) and repetitive loops (cycling through the same failing action without recovering). VLAA-GUI is a modular framework with three integrated components that tell the agent when to STOP, RECOVER, and SEARCH:

  1. A mandatory Completeness Verifier enforces UI-observable success criteria at every finish step, double-checked by an independent verifier model.
  2. A mandatory Loop Breaker detects repeated actions / recurring screen states / reflection-signaled stalls and escalates across three tiers.
  3. An on-demand Search Agent queries a search-grounded LLM directly in text, skipping the overhead of browser-based visual search.

Paired with five top-tier backbones, VLAA-GUI reaches 77.5% on OSWorld-Verified with Claude Opus 4.6 and 61.0% on WindowsAgentArena. Three of five backbones surpass human-level (72.4%) on OSWorld in a single pass, and VLAA-GUI with Sonnet 4.6 at only 15 action steps already outperforms the best published 50-step system.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.21375
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.21375 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.21375 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.21375 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.