| --- |
| license: other |
| license_name: usc-research |
| license_link: LICENSE |
| language: |
| - en |
| base_model: nvidia/audio-flamingo-3 |
| tags: |
| - audio |
| - speech |
| - audio-llm |
| - paralinguistic |
| - pclm |
| - dpo |
| - voxparadox |
| pipeline_tag: audio-text-to-text |
| --- |
| |
| # Audio Flamingo 3 + PCLM + DPO |
|
|
| [](https://icml.cc/Conferences/2026) |
| [](https://arxiv.org/abs/2605.27772) |
| [](https://voxparadox.github.io/) |
| [](https://github.com/ihp-lab/VoxParadox) |
| [](https://huggingface.co/datasets/IHP-Lab/VoxParadox) |
| [](https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO) |
| [](LICENSE) |
|
|
| PCLM- and DPO-finetuned [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3) from |
| *Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox* |
| (ICML 2026). |
|
|
| The base model is augmented with the **Prompt-Conditioned Layer Mixer (PCLM)** β a lightweight module that |
| adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt β and |
| then post-trained with **Direct Preference Optimization (DPO)** to prefer acoustically-grounded answers over |
| language-implied alternatives on paralinguistic MCQs. |
|
|
| ## Layout |
|
|
| Unlike a stock HF model, AF3 ships its weights split across subfolders: |
|
|
| ``` |
| . |
| βββ config.json # top-level AF3 config (PCLM fields included) |
| βββ llm/ # frozen + DPO-tuned Qwen2 LLM |
| βββ sound_tower/ # AF-Whisper audio encoder |
| βββ sound_mm_projector/ # final-layer audioβLLM projector |
| βββ sound_mid_mm_projector_{5,15,25,30}/ # intermediate-layer projectors (PCLM) |
| βββ sound_pclm/ # BERT-small prompt encoder + gate MLP |
| βββ tokenizer files (vocab.json, merges.txt, β¦) |
| ``` |
|
|
| ## Usage |
|
|
| This checkpoint cannot be loaded with stock `transformers` β AF3 + PCLM requires the |
| custom modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox). |
|
|
| ```bash |
| git clone https://github.com/ihp-lab/VoxParadox |
| cd VoxParadox/af3/audio-flamingo |
| bash environment_setup.sh af3 |
| conda activate af3 |
| ``` |
|
|
| Inference on VoxParadox (or any MCQ JSON in the same schema): |
|
|
| ```bash |
| bash scripts/eval_voxparadox.sh \ |
| IHP-Lab/AF3_PCLM_DPO \ |
| /path/to/voxparadox.json \ |
| /path/to/audio_root \ |
| runs/eval/af3_pclm_dpo |
| ``` |
|
|
| Score with the dataset-shipped `eval.py`: |
|
|
| ```bash |
| python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl |
| ``` |
|
|
| PCLM activation is read from this checkpoint's `config.json` |
| (`expose_layers=[5, 15, 25, 30]`, `use_sound_pclm=true`). |
|
|
| ## Project resources |
|
|
| | Resource | Link | |
| |---|---| |
| | Paper (arXiv) | <https://arxiv.org/abs/2605.27772> | |
| | Project page | <https://voxparadox.github.io/> | |
| | Code | <https://github.com/ihp-lab/VoxParadox> | |
| | Benchmark | <https://huggingface.co/datasets/IHP-Lab/VoxParadox> | |
| | Sibling model (Qwen2-Audio) | <https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO> | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{pang2026voxparadox, |
| title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox}, |
| author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad}, |
| booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| USC Research License (research / non-profit only). See [`LICENSE`](LICENSE). |
|
|
| The base model (`nvidia/audio-flamingo-3`) carries the NVIDIA non-commercial license |
| terms, which continue to apply to the inherited weights. |
|
|