VideoThinker-R1-3B

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. It addresses the phenomenon of "perceptual bias," where reinforcement learning can compel lightweight models to adopt perceptual shortcuts from data rather than developing genuine reasoning abilities.

The framework employs a two-stage debiasing process:

Bias Aware Training: Forges a dedicated "bias model" to embody shortcut behaviors.
Causal Debiasing Policy Optimization (CDPO): Fine-tunes the primary model using a repulsive objective to push it away from the bias model's flawed logic.

Performance

VideoThinker-R1 establishes a new state-of-the-art in video reasoning efficiency. Using only 1K training samples and no Supervised Fine-Tuning (SFT), it:

Surpasses VideoRFT-3B by 7% on VideoMME.
Outperforms larger models (e.g., Video-UTR-7B) on reasoning-heavy benchmarks like MVBench and TempCompass.

Resources

Code: GitHub - falonss703/VideoThinker
Paper: Hugging Face Papers

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{wu2026videothinker,
  title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
  author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}