SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
Abstract
SwiftI2V is an efficient high-resolution image-to-video generation framework that uses conditional segment-wise generation and bidirectional contextual interaction to achieve scalable, input-faithful video synthesis with reduced computational requirements.
High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
Community
We propose SwiftI2V, an efficient framework for high-resolution (2K) I2V generation that decouples motion modeling from detail synthesis via progressive segment-wise generation and bidirectional contextual interaction. SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202×, enabling practical 2K I2V on a single RTX 4090.
the segment-wise generation with bidirectional conditioning is the most interesting bit here, it keeps the per-step token budget bounded while letting the anchored input and motion reference talk to the evolving latents. i wonder about the 3d vq stage that reduces to 2k tokens for stage ii—that compression feels like the tightrope between fidelity and capacity, and i’d love to see a solid ablation on how much detail is sacrificed in practice. btw the arxivlens breakdown helped me parse the trickier bits, especially how the overlap between segments mitigates drift without blowing up compute. have you tried removing the bidirectional loop within a segment to see if coherence collapses, or is most of the benefit coming from the motion conditioning and anchor? if this scales to 2k in a real prod workflow on a single consumer gpu, we might finally have a practical path from diffusion to high-res video editing.
Thanks for the kind words! Quick note: the "2K" in our paper is the output resolution (2560×1408), not a 2k-token budget — we build on Wan2.2, whose 3D VAE uses a high spatial compression ratio (16×16×4) to keep Stage II tractable.
On fidelity: we did run this ablation — see Appendix C.8. On 93 clips from 31 high-quality 4K sources, at 2K the VAE reaches PSNR 35.25 SSIM 0.941 LPIPS 0.049, and input vs. reconstruction PSD curves nearly overlap even above normalized freq 0.6. So high-frequency detail is largely preserved.
On the bidirectional loop: also ablated — Table 4, w/o bi-interaction (AR-style read-only context, anchor + motion ref kept) drops Total Score 6.424 → 6.392, with a clear temporal-consistency hit. Appendix M backs this up: NIQE-vs-step slope is 0.050 (ours) vs 0.093 (AR), so the bidirectional loop — not just motion/anchor — is what keeps drift bounded.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors (2026)
- PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference (2026)
- SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation (2026)
- OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder (2026)
- ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis (2026)
- InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model (2026)
- GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06356 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper