Papers
arxiv:2604.23586

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Published on Apr 26
· Submitted by
Hongzhan Lin
on May 4
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Talker-T2AV presents an autoregressive diffusion framework for talking head synthesis that separates high-level cross-modal reasoning from low-level modality-specific refinement, improving lip-sync accuracy and cross-modal consistency.

AI-generated summary

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.

Community

Talker-T2AV improves talking head synthesis by decoupling high-level audio-video reasoning from low-level modality-specific generation. Instead of coupling audio and video throughout denoising, it uses a shared autoregressive backbone for semantic cross-modal modeling and lightweight diffusion heads for audio/video refinement, achieving better lip-sync, visual quality, and audio quality than dual-branch and cascaded baselines.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.23586
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.23586 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.23586 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.23586 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.