License file

37b7e1f verified about 24 hours ago

4.08 kB

	---
	license: other
	license_name: usc-research
	license_link: LICENSE
	language:
	- en
	base_model: nvidia/audio-flamingo-3
	tags:
	- audio
	- speech
	- audio-llm
	- paralinguistic
	- pclm
	- dpo
	- voxparadox
	pipeline_tag: audio-text-to-text
	---

	# Audio Flamingo 3 + PCLM + DPO

	[![ICML 2026](https://img.shields.io/badge/ICML-2026-1d4ed8.svg)](https://icml.cc/Conferences/2026)
	[![Paper](https://img.shields.io/badge/Paper-arXiv-AD1C18.svg)](https://arxiv.org/abs/2605.27772)
	[![Project Page](https://img.shields.io/badge/Project-Page-0EA5E9.svg)](https://voxparadox.github.io/)
	[![Code](https://img.shields.io/badge/GitHub-ihp--lab%2FVoxParadox-181717.svg?logo=github)](https://github.com/ihp-lab/VoxParadox)
	[![Dataset](https://img.shields.io/badge/🤗%20Dataset-IHP--Lab%2FVoxParadox-FFD21E.svg)](https://huggingface.co/datasets/IHP-Lab/VoxParadox)
	[![Qwen2-Audio + PCLM + DPO](https://img.shields.io/badge/🤗%20Sibling%20model-Qwen2--Audio+PCLM+DPO-FFD21E.svg)](https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO)
	[![License](https://img.shields.io/badge/License-USC%20Research-228B22.svg)](LICENSE)

	PCLM- and DPO-finetuned [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3) from
	Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox
	(ICML 2026).

	The base model is augmented with the Prompt-Conditioned Layer Mixer (PCLM) — a lightweight module that
	adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt — and
	then post-trained with Direct Preference Optimization (DPO) to prefer acoustically-grounded answers over
	language-implied alternatives on paralinguistic MCQs.

	## Layout

	Unlike a stock HF model, AF3 ships its weights split across subfolders:

	```
	.
	├── config.json # top-level AF3 config (PCLM fields included)
	├── llm/ # frozen + DPO-tuned Qwen2 LLM
	├── sound_tower/ # AF-Whisper audio encoder
	├── sound_mm_projector/ # final-layer audio→LLM projector
	├── sound_mid_mm_projector_{5,15,25,30}/ # intermediate-layer projectors (PCLM)
	├── sound_pclm/ # BERT-small prompt encoder + gate MLP
	└── tokenizer files (vocab.json, merges.txt, …)
	```

	## Usage

	This checkpoint cannot be loaded with stock `transformers` — AF3 + PCLM requires the
	custom modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox).

	```bash
	git clone https://github.com/ihp-lab/VoxParadox
	cd VoxParadox/af3/audio-flamingo
	bash environment_setup.sh af3
	conda activate af3
	```

	Inference on VoxParadox (or any MCQ JSON in the same schema):

	```bash
	bash scripts/eval_voxparadox.sh \
	IHP-Lab/AF3_PCLM_DPO \
	/path/to/voxparadox.json \
	/path/to/audio_root \
	runs/eval/af3_pclm_dpo
	```

	Score with the dataset-shipped `eval.py`:

	```bash
	python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl
	```

	PCLM activation is read from this checkpoint's `config.json`
	(`expose_layers=[5, 15, 25, 30]`, `use_sound_pclm=true`).

	## Project resources

	\| Resource \| Link \|
	\|---\|---\|
	\| Paper (arXiv) \| <https://arxiv.org/abs/2605.27772> \|
	\| Project page \| <https://voxparadox.github.io/> \|
	\| Code \| <https://github.com/ihp-lab/VoxParadox> \|
	\| Benchmark \| <https://huggingface.co/datasets/IHP-Lab/VoxParadox> \|
	\| Sibling model (Qwen2-Audio) \| <https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO> \|

	## Citation

	```bibtex
	@inproceedings{pang2026voxparadox,
	title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
	author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
	booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
	year = {2026}
	}
	```

	## License

	USC Research License (research / non-profit only). See [`LICENSE`](LICENSE).

	The base model (`nvidia/audio-flamingo-3`) carries the NVIDIA non-commercial license
	terms, which continue to apply to the inherited weights.