Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Unified Audio Schema is a novel holistic framework for audio supervision that disentangles and restructures supervision across transcription, paralinguistics, and non-linguistic events.

📄 Paper | 💻 GitHub

This repository provides our model checkpoints trained using Unified Audio Schema. For the complete codebase, please refer to the corresponding GitHub repository.

Model Details

Attribute Value
Input Modality Text and audio
Output Modality Text and audio
Base LLM Qwen2.5-7B
Audio Encoder AuT encoder
Input Audio Representation Frame Rate 12.5 Hz
Output Audio Token Codebook Size 8,192
Output Audio Token Frame Rate 25 Hz

Notes:

  • The model supports interleaved text and audio input/output, enabling flexible multimodal interactions.
  • Speech waveform reconstruction for generated audio tokens relies on the StableToken decoder.

Quick Start

Installation

git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git
cd Unified_Audio_Schema && pip install -r requirements.txt

Download Checkpoints

# Model weights
huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema

# StableToken decoder (required for speech waveform reconstruction)
huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken

Inference

import torch
import torchaudio
from src.model import UASAudio

model = UASAudio(
    model_path="checkpoints/Unified_Audio_Schema",
    audio_decoder_path="checkpoints/StableToken/decoder",
    device="cuda" if torch.cuda.is_available() else "cpu",
)

dialogue_system_prompt = (
    "User will provide you with a speech instruction. Do it step by step. "
    "First, think about the instruction and respond in a interleaved manner, "
    "with 13 text token followed by 52 audio tokens."
)

messages = [
    {"role": "system", "content": dialogue_system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"},
        ],
    },
    {"role": "assistant", "content": None},
]

generation_config = {
    "max_new_tokens": 4096,
    "temperature": 0.7,
    "repetition_penalty": 1.05,
    "top_p": 0.9,
    "do_sample": True
}

_, text, audio_tokens = model(messages, **generation_config)
print(text)

if len(audio_tokens) > 0:
    audio_array, sampling_rate = model.tokens_to_audio(audio_tokens)
    torchaudio.save("response.wav", audio_array, sampling_rate)

Supported Scenarios

Our model can be applied to a wide range of audio understanding and generation tasks, including:

  • Text-input conversation
  • Speech-input conversation
  • Automatic Speech Recognition (ASR)
  • Audio captioning
  • Text-to-Speech (TTS)

For more runnable examples, please refer to example_usage.ipynb in the GitHub repository.

Evaluation Highlights

UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks.

Audio Understanding

Model MMSU
(Percep.)
MMSU
(Reason.)
MMSU
(Overall)
MMAR
(Speech)
MMAR
(Sound)
MMAR
(Music)
MMAR
(Overall)
MMAU
(Speech)
MMAU
(Sound)
MMAU
(Music)
MMAU
(Overall)
Avg.
Kimi-Audio 44.8 75.7 59.8 58.5 49.7 33.0 48.0 62.2 75.7 66.8 68.2 58.7
Qwen2.5-Omni 42.7 77.6 58.1 59.9 58.8 40.8 56.7 70.6 78.1 65.9 71.5 62.1
Step-Audio2 42.9 73.2 57.6 61.2 54.6 42.2 56.8 68.2 79.3 68.4 72.7 61.9
Ours 55.7 77.4 66.2 66.0 58.8 45.2 60.1 67.0 70.0 71.3 69.4 65.2

ASR & TTS

Model ASR
(LS-clean)
ASR
(AISHELL-1)
TTS
(SeedTTS-en)
TTS
(SeedTTS-zh)
Qwen2.5-Omni - - 2.3 1.4
Step-Audio2 1.9 1.0 2.1 3.2
MiMo-Audio 3.8 1.8 5.4 2.0
Ours 2.2 2.3 1.7 1.4

Citation

If you find Unified Audio Schema or our model useful for your research, please cite:

@misc{zhang2026transcriptionunifiedaudioschema,
    title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs}, 
    author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
    year={2026},
    eprint={2604.12506},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2604.12506},
}

@inproceedings{song2026stabletoken,
    title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s},
    author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=17DNmdQ9aU}
}

License

This project is licensed under the License Term of Unified_Audio_Schema.

Downloads last month
2
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tencent/Unified_Audio_Schema

Base model

Qwen/Qwen2.5-7B
Finetuned
(915)
this model

Paper for tencent/Unified_Audio_Schema