--- license: other license_name: usc-research license_link: LICENSE language: - en base_model: nvidia/audio-flamingo-3 tags: - audio - speech - audio-llm - paralinguistic - pclm - dpo - voxparadox pipeline_tag: audio-text-to-text --- # Audio Flamingo 3 + PCLM + DPO [![ICML 2026](https://img.shields.io/badge/ICML-2026-1d4ed8.svg)](https://icml.cc/Conferences/2026) [![Paper](https://img.shields.io/badge/Paper-arXiv-AD1C18.svg)](https://arxiv.org/abs/2605.27772) [![Project Page](https://img.shields.io/badge/Project-Page-0EA5E9.svg)](https://voxparadox.github.io/) [![Code](https://img.shields.io/badge/GitHub-ihp--lab%2FVoxParadox-181717.svg?logo=github)](https://github.com/ihp-lab/VoxParadox) [![Dataset](https://img.shields.io/badge/πŸ€—%20Dataset-IHP--Lab%2FVoxParadox-FFD21E.svg)](https://huggingface.co/datasets/IHP-Lab/VoxParadox) [![Qwen2-Audio + PCLM + DPO](https://img.shields.io/badge/πŸ€—%20Sibling%20model-Qwen2--Audio+PCLM+DPO-FFD21E.svg)](https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO) [![License](https://img.shields.io/badge/License-USC%20Research-228B22.svg)](LICENSE) PCLM- and DPO-finetuned [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3) from *Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox* (ICML 2026). The base model is augmented with the **Prompt-Conditioned Layer Mixer (PCLM)** β€” a lightweight module that adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt β€” and then post-trained with **Direct Preference Optimization (DPO)** to prefer acoustically-grounded answers over language-implied alternatives on paralinguistic MCQs. ## Layout Unlike a stock HF model, AF3 ships its weights split across subfolders: ``` . β”œβ”€β”€ config.json # top-level AF3 config (PCLM fields included) β”œβ”€β”€ llm/ # frozen + DPO-tuned Qwen2 LLM β”œβ”€β”€ sound_tower/ # AF-Whisper audio encoder β”œβ”€β”€ sound_mm_projector/ # final-layer audioβ†’LLM projector β”œβ”€β”€ sound_mid_mm_projector_{5,15,25,30}/ # intermediate-layer projectors (PCLM) β”œβ”€β”€ sound_pclm/ # BERT-small prompt encoder + gate MLP └── tokenizer files (vocab.json, merges.txt, …) ``` ## Usage This checkpoint cannot be loaded with stock `transformers` β€” AF3 + PCLM requires the custom modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox). ```bash git clone https://github.com/ihp-lab/VoxParadox cd VoxParadox/af3/audio-flamingo bash environment_setup.sh af3 conda activate af3 ``` Inference on VoxParadox (or any MCQ JSON in the same schema): ```bash bash scripts/eval_voxparadox.sh \ IHP-Lab/AF3_PCLM_DPO \ /path/to/voxparadox.json \ /path/to/audio_root \ runs/eval/af3_pclm_dpo ``` Score with the dataset-shipped `eval.py`: ```bash python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl ``` PCLM activation is read from this checkpoint's `config.json` (`expose_layers=[5, 15, 25, 30]`, `use_sound_pclm=true`). ## Project resources | Resource | Link | |---|---| | Paper (arXiv) | | | Project page | | | Code | | | Benchmark | | | Sibling model (Qwen2-Audio) | | ## Citation ```bibtex @inproceedings{pang2026voxparadox, title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox}, author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad}, booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, year = {2026} } ``` ## License USC Research License (research / non-profit only). See [`LICENSE`](LICENSE). The base model (`nvidia/audio-flamingo-3`) carries the NVIDIA non-commercial license terms, which continue to apply to the inherited weights.