HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

Abstraction

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip move- ments. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512×512 resolution, positioning it as a viable solution for professional production environments such as the film and broad- cast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for saeed-5959/high_sync

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

Paper • 2605.16918 • Published 4 days ago