Instructions to use IHP-Lab/Qwen2-Audio_PCLM_DPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IHP-Lab/Qwen2-Audio_PCLM_DPO with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForSeq2SeqLM processor = AutoProcessor.from_pretrained("IHP-Lab/Qwen2-Audio_PCLM_DPO") model = AutoModelForSeq2SeqLM.from_pretrained("IHP-Lab/Qwen2-Audio_PCLM_DPO") - Notebooks
- Google Colab
- Kaggle
Qwen2-Audio + PCLM + DPO
PCLM- and DPO-finetuned Qwen2-Audio-7B-Instruct from Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox (ICML 2026).
The base model is augmented with the Prompt-Conditioned Layer Mixer (PCLM) — a lightweight module that adaptively mixes representations from intermediate audio-encoder layers based on the user prompt — and then post-trained with Direct Preference Optimization (DPO) to prefer acoustically-grounded answers over language-implied alternatives on paralinguistic MCQs.
Usage
This checkpoint cannot be loaded with stock transformers — PCLM requires the custom
modeling code shipped in the release repo.
git clone https://github.com/ihp-lab/VoxParadox
cd VoxParadox
conda create -n qwen2audio python=3.10 -y && conda activate qwen2audio
pip install torch torchaudio transformers accelerate librosa soundfile
Inference on VoxParadox (or any MCQ JSON in the same schema):
python -m qwen2audio.eval.run_eval \
--model_path IHP-Lab/Qwen2-Audio_PCLM_DPO \
--data_path /path/to/voxparadox.json \
--audio_base /path/to/audio_root \
--output_dir runs/eval/qwen2audio_pclm_dpo
Score with the dataset-shipped eval.py:
python eval.py --predictions runs/eval/qwen2audio_pclm_dpo/predictions.jsonl
The loader auto-detects use_pclm=True from config.json and activates PCLM with
expose_layers=[5, 15, 25, 30] over the audio encoder.
Project resources
| Resource | Link |
|---|---|
| Paper (arXiv) | https://arxiv.org/abs/2605.27772 |
| Project page | https://voxparadox.github.io/ |
| Code | https://github.com/ihp-lab/VoxParadox |
| Benchmark | https://huggingface.co/datasets/IHP-Lab/VoxParadox |
| Sibling model (AF3) | https://huggingface.co/IHP-Lab/AF3_PCLM_DPO |
Citation
@inproceedings{pang2026voxparadox,
title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026}
}
License
USC Research License (research / non-profit only). See LICENSE.
The base model (Qwen/Qwen2-Audio-7B-Instruct) carries its own Tongyi Qianwen license terms,
which continue to apply to the inherited weights.
- Downloads last month
- -
Model tree for IHP-Lab/Qwen2-Audio_PCLM_DPO
Base model
Qwen/Qwen2-Audio-7B-Instruct