VideoThinker-R1-3B

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. It addresses the phenomenon of "perceptual bias," where reinforcement learning can compel lightweight models to adopt perceptual shortcuts from data rather than developing genuine reasoning abilities.

The framework employs a two-stage debiasing process:

  1. Bias Aware Training: Forges a dedicated "bias model" to embody shortcut behaviors.
  2. Causal Debiasing Policy Optimization (CDPO): Fine-tunes the primary model using a repulsive objective to push it away from the bias model's flawed logic.

Performance

VideoThinker-R1 establishes a new state-of-the-art in video reasoning efficiency. Using only 1K training samples and no Supervised Fine-Tuning (SFT), it:

  • Surpasses VideoRFT-3B by 7% on VideoMME.
  • Outperforms larger models (e.g., Video-UTR-7B) on reasoning-heavy benchmarks like MVBench and TempCompass.

Resources

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{wu2026videothinker,
  title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
  author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}
Downloads last month
18
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Falconss1/VideoThinker-R1-3B

Finetuned
(741)
this model
Quantizations
1 model

Paper for Falconss1/VideoThinker-R1-3B