Abstract
HandX presents a comprehensive foundation for bimanual hand motion synthesis including a new dataset, annotation method, and evaluation metrics for dexterous motion generation.
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
Community
🔗Website: https://handx-project.github.io/
📄Paper: https://arxiv.org/abs/2603.28766
🧑🏻💻Code & Data: https://github.com/handx-project/HandX
Most motion generation methods treat hands as rigid afterthoughts, yet hands are how we interact with the world — precise finger articulation, contact timing, and bimanual coordination all matter. The bottleneck is data: existing datasets either have full-body scale with no finger detail, or hand detail with no interaction richness. HandX bridges this gap.
🔬 54.2 hours / 5.9M frames of high-fidelity bimanual motion with dense finger articulation and rich inter-hand contact
✍️ 490K text annotations by decoupling kinematic feature extraction from LLM reasoning — grounded in real contact events, not hallucinated narratives
📈 Clear log-linear scaling trend (R² = 0.96): matched increases in data and model capacity improve generation; over-scaling the model alone hurts
🤖 Generated motions seamlessly demonstrated in IsaacGym and MuJoCo, then transferred to a real humanoid with dexterous hands — from text to physical dexterity
A single model supports versatile generation tasks: text-to-motion, inbetweening, trajectory control, keyframe guidance, hand-reaction synthesis, and long-horizon generation.
Get this paper in your agent:
hf papers read 2603.28766 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper