Papers
arxiv:2603.24440

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Published on Mar 25
ยท Submitted by
taesiri
on Mar 26
Authors:
,
,
,
,
,
,
,

Abstract

CUA-Suite introduces a large-scale ecosystem of expert video demonstrations and annotations for computer-use agents, providing continuous screen recordings and detailed reasoning annotations to advance desktop automation capabilities.

AI-generated summary

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

Community

Paper submitter

CUA-Suite provides ~55 hours of continuous expert video demonstrations with rich annotations for desktop computer-use agents, enabling continuous spatial control, grounding, planning, and world-model research.

We built CUA-Suite to capture the full signal. Human experts performed ~10,000 tasks across 87 desktop applications while we recorded everything at 30 fps โ€” the cursor paths, the visual transitions, the drag feedback, the menu animations. The result:

๐ŸŽฅ 55 hours / 6M frames of continuous expert video
๐Ÿ–ฑ๏ธ Kinematic cursor traces with millisecond-precision action logs
๐Ÿง  Multi-layered reasoning annotations per step
๐ŸŽฏ 56K screenshots with 3.5M+ human-verified UI element bounding boxes (GroundCUA)
๐Ÿ“Š 450-task evaluation benchmark testing grounding + action prediction (UI-Vision)

All data, models, and benchmarks are fully open-source ๐Ÿ‘‡

๐Ÿ“„ Paper: https://arxiv.org/abs/2603.24440
๐ŸŒ Website: https://cua-suite.github.io
๐Ÿค— Dataset: https://huggingface.co/datasets/ServiceNow/VideoCUA
๐Ÿ’ป Code: https://github.com/ServiceNow/GroundCUA/tree/main/VideoCUA
teaser

This comment has been hidden (marked as Resolved)

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.24440
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.24440 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.24440 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.