Abstract
EgoForge is an egocentric goal-directed world simulator that generates coherent first-person video rollouts from minimal static inputs using trajectory-level reward-guided refinement during diffusion sampling.
Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
Community
Given a single smart-glasses egocentric image, a high-level goal instruction, and an auxiliary exocentric view, EgoForge generates egocentric rollouts that follow user intent and preserve scene structure, without requiring dense supervision such as camera trajectories, pose, video, or synchronized multi-view capture streams.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis (2026)
- Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures (2026)
- HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles (2026)
- ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics (2026)
- Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints (2026)
- PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance (2026)
- FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper