arxiv:2605.06356

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Published on May 7

· Submitted by

Authors:

Abstract

SwiftI2V is an efficient high-resolution image-to-video generation framework that uses conditional segment-wise generation and bidirectional contextual interaction to achieve scalable, input-faithful video synthesis with reduced computational requirements.

AI-generated summary

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

View arXiv page View PDF Project page GitHub 18 Add to collection

Community

LazySheeep

Paper submitter about 24 hours ago

•

edited about 24 hours ago

We propose SwiftI2V, an efficient framework for high-resolution (2K) I2V generation that decouples motion modeling from detail synthesis via progressive segment-wise generation and bidirectional contextual interaction. SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202×, enabling practical 2K I2V on a single RTX 4090.

avahal

about 15 hours ago

the segment-wise generation with bidirectional conditioning is the most interesting bit here, it keeps the per-step token budget bounded while letting the anchored input and motion reference talk to the evolving latents. i wonder about the 3d vq stage that reduces to 2k tokens for stage ii—that compression feels like the tightrope between fidelity and capacity, and i’d love to see a solid ablation on how much detail is sacrificed in practice. btw the arxivlens breakdown helped me parse the trickier bits, especially how the overlap between segments mitigates drift without blowing up compute. have you tried removing the bidirectional loop within a segment to see if coherence collapses, or is most of the benefit coming from the motion conditioning and anchor? if this scales to 2k in a real prod workflow on a single consumer gpu, we might finally have a practical path from diffusion to high-res video editing.

LazySheeep

42 minutes ago

Thanks for the kind words! Quick note: the "2K" in our paper is the output resolution (2560×1408), not a 2k-token budget — we build on Wan2.2, whose 3D VAE uses a high spatial compression ratio (16×16×4) to keep Stage II tractable.
On fidelity: we did run this ablation — see Appendix C.8. On 93 clips from 31 high-quality 4K sources, at 2K the VAE reaches PSNR 35.25 SSIM 0.941 LPIPS 0.049, and input vs. reconstruction PSD curves nearly overlap even above normalized freq 0.6. So high-frequency detail is largely preserved.
On the bidirectional loop: also ablated — Table 4, w/o bi-interaction (AR-style read-only context, anchor + motion ref kept) drops Total Score 6.424 → 6.392, with a clear temporal-consistency hit. Appendix M backs this up: NIQE-vs-step slope is 0.050 (ours) vs 0.093 (AR), so the bidirectional loop — not just motion/anchor — is what keeps drift bounded.

librarian-bot

about 1 hour ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.06356

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.06356 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.06356 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.06356 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.