Abstract
A$^2$RD, an Agentic Auto-Regressive Diffusion architecture, addresses long video synthesis challenges through a closed-loop process with memory tracking, adaptive generation, and hierarchical self-improvement mechanisms.
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
Community
We introduce A^2RD, an agentic autoregressive diffusion architecture for long video synthesis that allows diffusion models to synthesize and self-improve long videos, achieving state-of-the-art consistency and narrative coherence over long horizons.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Co-Director: Agentic Generative Video Storytelling (2026)
- Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis (2026)
- DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation (2026)
- Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation (2026)
- Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation (2026)
- Stream-T1: Test-Time Scaling for Streaming Video Generation (2026)
- Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06924 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper