Audio Flamingo 3 + PCLM + DPO

ICML 2026 Paper Project Page Code Dataset Qwen2-Audio + PCLM + DPO License

PCLM- and DPO-finetuned Audio Flamingo 3 from Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox (ICML 2026).

The base model is augmented with the Prompt-Conditioned Layer Mixer (PCLM) β€” a lightweight module that adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt β€” and then post-trained with Direct Preference Optimization (DPO) to prefer acoustically-grounded answers over language-implied alternatives on paralinguistic MCQs.

Layout

Unlike a stock HF model, AF3 ships its weights split across subfolders:

.
β”œβ”€β”€ config.json                    # top-level AF3 config (PCLM fields included)
β”œβ”€β”€ llm/                            # frozen + DPO-tuned Qwen2 LLM
β”œβ”€β”€ sound_tower/                    # AF-Whisper audio encoder
β”œβ”€β”€ sound_mm_projector/             # final-layer audioβ†’LLM projector
β”œβ”€β”€ sound_mid_mm_projector_{5,15,25,30}/   # intermediate-layer projectors (PCLM)
β”œβ”€β”€ sound_pclm/                     # BERT-small prompt encoder + gate MLP
└── tokenizer files (vocab.json, merges.txt, …)

Usage

This checkpoint cannot be loaded with stock transformers β€” AF3 + PCLM requires the custom modeling code shipped in the release repo.

git clone https://github.com/ihp-lab/VoxParadox
cd VoxParadox/af3/audio-flamingo
bash environment_setup.sh af3
conda activate af3

Inference on VoxParadox (or any MCQ JSON in the same schema):

bash scripts/eval_voxparadox.sh \
    IHP-Lab/AF3_PCLM_DPO \
    /path/to/voxparadox.json \
    /path/to/audio_root \
    runs/eval/af3_pclm_dpo

Score with the dataset-shipped eval.py:

python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl

PCLM activation is read from this checkpoint's config.json (expose_layers=[5, 15, 25, 30], use_sound_pclm=true).

Project resources

Citation

@inproceedings{pang2026voxparadox,
  title     = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
  author    = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}

License

USC Research License (research / non-profit only). See LICENSE.

The base model (nvidia/audio-flamingo-3) carries the NVIDIA non-commercial license terms, which continue to apply to the inherited weights.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for IHP-Lab/AF3_PCLM_DPO

Finetuned
(3)
this model

Paper for IHP-Lab/AF3_PCLM_DPO