Papers
arxiv:2603.28766

HandX: Scaling Bimanual Motion and Interaction Generation

Published on Mar 30
· Submitted by
Sirui Xu
on Mar 31
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

HandX presents a comprehensive foundation for bimanual hand motion synthesis including a new dataset, annotation method, and evaluation metrics for dexterous motion generation.

AI-generated summary

Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.

Community

Paper submitter

Most motion generation methods treat hands as rigid afterthoughts, yet hands are how we interact with the world — precise finger articulation, contact timing, and bimanual coordination all matter. The bottleneck is data: existing datasets either have full-body scale with no finger detail, or hand detail with no interaction richness. HandX bridges this gap.

🔬 54.2 hours / 5.9M frames of high-fidelity bimanual motion with dense finger articulation and rich inter-hand contact
✍️ 490K text annotations by decoupling kinematic feature extraction from LLM reasoning — grounded in real contact events, not hallucinated narratives
📈 Clear log-linear scaling trend (R² = 0.96): matched increases in data and model capacity improve generation; over-scaling the model alone hurts
🤖 Generated motions seamlessly demonstrated in IsaacGym and MuJoCo, then transferred to a real humanoid with dexterous hands — from text to physical dexterity

A single model supports versatile generation tasks: text-to-motion, inbetweening, trajectory control, keyframe guidance, hand-reaction synthesis, and long-horizon generation.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.28766
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.28766 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.28766 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.28766 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.