Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Unified Audio Schema is a novel holistic framework for audio supervision that disentangles and restructures supervision across transcription, paralinguistics, and non-linguistic events.

📄 Paper | 💻 GitHub

This repository provides our model checkpoints trained using Unified Audio Schema. For the complete codebase, please refer to the corresponding GitHub repository.

Model Details

Attribute	Value
Input Modality	Text and audio
Output Modality	Text and audio
Base LLM	Qwen2.5-7B
Audio Encoder	AuT encoder
Input Audio Representation Frame Rate	12.5 Hz
Output Audio Token Codebook Size	8,192
Output Audio Token Frame Rate	25 Hz

Notes:

The model supports interleaved text and audio input/output, enabling flexible multimodal interactions.
Speech waveform reconstruction for generated audio tokens relies on the StableToken decoder.

Quick Start

Installation

git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git
cd Unified_Audio_Schema && pip install -r requirements.txt

Download Checkpoints

# Model weights
huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema

# StableToken decoder (required for speech waveform reconstruction)
huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken

Inference

import torch
import torchaudio
from src.model import UASAudio

model = UASAudio(
    model_path="checkpoints/Unified_Audio_Schema",
    audio_decoder_path="checkpoints/StableToken/decoder",
    device="cuda" if torch.cuda.is_available() else "cpu",
)

dialogue_system_prompt = (
    "User will provide you with a speech instruction. Do it step by step. "
    "First, think about the instruction and respond in a interleaved manner, "
    "with 13 text token followed by 52 audio tokens."
)

messages = [
    {"role": "system", "content": dialogue_system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"},
        ],
    },
    {"role": "assistant", "content": None},
]

generation_config = {
    "max_new_tokens": 4096,
    "temperature": 0.7,
    "repetition_penalty": 1.05,
    "top_p": 0.9,
    "do_sample": True
}

_, text, audio_tokens = model(messages, **generation_config)
print(text)

if len(audio_tokens) > 0:
    audio_array, sampling_rate = model.tokens_to_audio(audio_tokens)
    torchaudio.save("response.wav", audio_array, sampling_rate)

Supported Scenarios

Our model can be applied to a wide range of audio understanding and generation tasks, including:

Text-input conversation
Speech-input conversation
Automatic Speech Recognition (ASR)
Audio captioning
Text-to-Speech (TTS)

For more runnable examples, please refer to example_usage.ipynb in the GitHub repository.

Evaluation Highlights

UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks.

Audio Understanding

Model	MMSU (Percep.)	MMSU (Reason.)	MMSU (Overall)	MMAR (Speech)	MMAR (Sound)	MMAR (Music)	MMAR (Overall)	MMAU (Speech)	MMAU (Sound)	MMAU (Music)	MMAU (Overall)	Avg.
Kimi-Audio	44.8	75.7	59.8	58.5	49.7	33.0	48.0	62.2	75.7	66.8	68.2	58.7
Qwen2.5-Omni	42.7	77.6	58.1	59.9	58.8	40.8	56.7	70.6	78.1	65.9	71.5	62.1
Step-Audio2	42.9	73.2	57.6	61.2	54.6	42.2	56.8	68.2	79.3	68.4	72.7	61.9
Ours	55.7	77.4	66.2	66.0	58.8	45.2	60.1	67.0	70.0	71.3	69.4	65.2

ASR & TTS

Model	ASR (LS-clean)	ASR (AISHELL-1)	TTS (SeedTTS-en)	TTS (SeedTTS-zh)
Qwen2.5-Omni	-	-	2.3	1.4
Step-Audio2	1.9	1.0	2.1	3.2
MiMo-Audio	3.8	1.8	5.4	2.0
Ours	2.2	2.3	1.7	1.4

Citation

If you find Unified Audio Schema or our model useful for your research, please cite:

@misc{zhang2026transcriptionunifiedaudioschema,
    title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs}, 
    author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
    year={2026},
    eprint={2604.12506},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2604.12506},
}

@inproceedings{song2026stabletoken,
    title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s},
    author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=17DNmdQ9aU}
}