Papers
arxiv:2603.24793

AVControl: Efficient Framework for Training Audio-Visual Controls

Published on Mar 25
· Submitted by
Tavi Halperin
on Mar 27
Authors:
,
,
,
,
,
,
,

Abstract

AVControl enables efficient, modular audio-visual generation by training control modalities as separate LoRA adapters on a parallel canvas within LTX-2, achieving superior performance on diverse control tasks while requiring minimal computational resources.

AI-generated summary

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

Community

AVControl: Efficient Framework for Training Audio-Visual Controls

A lightweight, extendable framework built on LTX-2 for training diverse audio-visual controls using LoRA adapters on a parallel canvas. Each control modality is trained independently as a separate LoRA using a small dataset and a short training run.
The framework supports 13+ modalities:

  • Spatially-aligned controls — depth, pose, edges
  • Camera trajectory control — from image or video, including intrinsics
  • Sparse motion tracking
  • Video editing — inpainting, outpainting, local edit, detailing
  • Audio-visual applications — audio intensity, speech-to-ambient, who-is-talking

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.24793
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.24793 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.24793 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.24793 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.