YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

arXiv

LAT-Audio

Overview

LAT-Audio is a large audio-language model designed for precise temporal awareness in long-form audio understanding.

Unlike existing models that degrade on long audio, LAT-Audio introduces a progressive global-to-local reasoning paradigm, enabling models to maintain temporal consistency over audio up to 30 minutes.

The core idea is to first construct a global timeline that captures the temporal-semantic structure of the audio, and then perform task-specific reasoning grounded on this timeline.
During reasoning, LAT-Audio iteratively incorporates audio evidence through a Think-With-Audio Chain-of-Thought (TWA-CoT) process, which significantly reduces:

  • temporal hallucination (invalid timestamps)
  • timestamp drift (progressive misalignment over time)

Model Description

LAT-Audio formulates long-form audio understanding as a structured reasoning process:

  1. Global Timeline Construction
    The model summarizes the audio into a coarse temporal structure.

  2. Global-to-Local Reasoning
    Downstream tasks are performed conditioned on the global timeline.

  3. Think-With-Audio Chain-of-Thought (TWA-CoT)
    The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.

This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.

Figure 1: Overall framework of LAT-Audio.

Model Variants

We provide two model variants:

Model Reasoning Training Data Description
LAT-Audio Yes LAT-Chronicle Tool-augmented multi-step reasoning model with global-to-local temporal inference
LAT-Audio-Base No LAT-Chronicle + in-house Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference

Quick Start

Download through Hugging Face

pip install -U "huggingface_hub[cli]"
huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base

For detailed inference methods and examples, please refer to the official repository: ๐Ÿ‘‰ https://github.com/alanshaoTT/LAT-Audio-Repo

Citation

If you find this work useful, please cite:

@article{shao2026lataudio,
  title={Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding},
  author={Shao, Mingchen and Su, Hang and Tian, Wenjie and Mu, Bingshen and Lin, Zhennan and Fan, Lichun and Luo, Zhenbo and Luan, Jian and Xie, Lei},
  journal={arXiv preprint arXiv:2604.22245},
  year={2026}
}

Contact

For questions, feedback, or collaboration:

๐Ÿ“ง mcshao@mail.nwpu.edu.cn

Downloads last month
10
Safetensors
Model size
35B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for mcshao/LAT-Audio