metadata
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- CLEVRER
- MMVU
- Video-Holmes
- MVBench
- TempCompass
- Video-MME
language:
- en
library_name: transformers
license: mit
pipeline_tag: video-text-to-text
tags:
- video-understanding
- reasoning
- multimodal
- reinforcement-learning
- question-answering
VideoThinker-R1-3B
VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. It addresses the phenomenon of "perceptual bias," where reinforcement learning can compel lightweight models to adopt perceptual shortcuts from data rather than developing genuine reasoning abilities.
The framework employs a two-stage debiasing process:
- Bias Aware Training: Forges a dedicated "bias model" to embody shortcut behaviors.
- Causal Debiasing Policy Optimization (CDPO): Fine-tunes the primary model using a repulsive objective to push it away from the bias model's flawed logic.
Performance
VideoThinker-R1 establishes a new state-of-the-art in video reasoning efficiency. Using only 1K training samples and no Supervised Fine-Tuning (SFT), it:
- Surpasses VideoRFT-3B by 7% on VideoMME.
- Outperforms larger models (e.g., Video-UTR-7B) on reasoning-heavy benchmarks like MVBench and TempCompass.
Resources
- Code: GitHub - falonss703/VideoThinker
- Paper: Hugging Face Papers
Citation
If you find this project useful in your research, please consider citing:
@inproceedings{wu2026videothinker,
title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}