--- title: ASID-Caption emoji: ๐Ÿฆ‰ colorFrom: indigo colorTo: gray sdk: static pinned: false --- # ASID-Caption [[๐Ÿ  Homepage]([https://](https://asid-caption.github.io/))] [[๐Ÿ“– Arxiv Paper](https://arxiv.org/pdf/2602.13013)] [[๐Ÿค— Models & Datasets](https://huggingface.co/AudioVisual-Caption)] [[๐Ÿ’ป Code](https://github.com/HVision-NKU/ASID-Caption)] We build **ASID-Caption**, a data-and-model suite for **fine-grained audiovisual video understanding**. Our goal is to move beyond โ€œone video โ†’ one generic captionโ€ by providing **attribute-structured supervision** and **quality-verified annotations**, enabling models to produce **more complete, more controllable, and more temporally consistent** descriptions that cover both **visual content** and **audio cues**. ## What we release - **ASID-1M**: a large-scale collection of **attribute-structured** audiovisual instructions with both *single-attribute* and *all-attributes* training formats. - **ASID-Verify**: a scalable curation pipeline that generates, ensembles, verifies, and refines annotations to improve semantic and temporal consistency. - **ASID-Captioner**: Qwen2.5-Omni-based audiovisual captioning models fine-tuned on ASID-1M. ## Research interests - Video understanding & video captioning - Audio-visual learning - Multimodal LLMs / instruction tuning - Data curation, verification, and quality control