Audio Flamingo 3 + PCLM + DPO
PCLM- and DPO-finetuned Audio Flamingo 3 from Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox (ICML 2026).
The base model is augmented with the Prompt-Conditioned Layer Mixer (PCLM) β a lightweight module that adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt β and then post-trained with Direct Preference Optimization (DPO) to prefer acoustically-grounded answers over language-implied alternatives on paralinguistic MCQs.
Layout
Unlike a stock HF model, AF3 ships its weights split across subfolders:
.
βββ config.json # top-level AF3 config (PCLM fields included)
βββ llm/ # frozen + DPO-tuned Qwen2 LLM
βββ sound_tower/ # AF-Whisper audio encoder
βββ sound_mm_projector/ # final-layer audioβLLM projector
βββ sound_mid_mm_projector_{5,15,25,30}/ # intermediate-layer projectors (PCLM)
βββ sound_pclm/ # BERT-small prompt encoder + gate MLP
βββ tokenizer files (vocab.json, merges.txt, β¦)
Usage
This checkpoint cannot be loaded with stock transformers β AF3 + PCLM requires the
custom modeling code shipped in the release repo.
git clone https://github.com/ihp-lab/VoxParadox
cd VoxParadox/af3/audio-flamingo
bash environment_setup.sh af3
conda activate af3
Inference on VoxParadox (or any MCQ JSON in the same schema):
bash scripts/eval_voxparadox.sh \
IHP-Lab/AF3_PCLM_DPO \
/path/to/voxparadox.json \
/path/to/audio_root \
runs/eval/af3_pclm_dpo
Score with the dataset-shipped eval.py:
python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl
PCLM activation is read from this checkpoint's config.json
(expose_layers=[5, 15, 25, 30], use_sound_pclm=true).
Project resources
| Resource | Link |
|---|---|
| Paper (arXiv) | https://arxiv.org/abs/2605.27772 |
| Project page | https://voxparadox.github.io/ |
| Code | https://github.com/ihp-lab/VoxParadox |
| Benchmark | https://huggingface.co/datasets/IHP-Lab/VoxParadox |
| Sibling model (Qwen2-Audio) | https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO |
Citation
@inproceedings{pang2026voxparadox,
title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026}
}
License
USC Research License (research / non-profit only). See LICENSE.
The base model (nvidia/audio-flamingo-3) carries the NVIDIA non-commercial license
terms, which continue to apply to the inherited weights.
- Downloads last month
- -
Model tree for IHP-Lab/AF3_PCLM_DPO
Base model
nvidia/audio-flamingo-3