Falconss1
/

VideoThinker-R1-3B

Video-Text-to-Text

image-text-to-text

video-understanding

reinforcement-learning

question-answering

text-generation-inference

Model card Files Files and versions

VideoThinker-R1-3B / README.md

Falconss1's picture

Improve model card and link to paper (#1)

fff29a3 1 day ago

|

history blame contribute delete

2.15 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- CLEVRER
	- MMVU
	- Video-Holmes
	- MVBench
	- TempCompass
	- Video-MME
	language:
	- en
	library_name: transformers
	license: mit
	pipeline_tag: video-text-to-text
	tags:
	- video-understanding
	- reasoning
	- multimodal
	- reinforcement-learning
	- question-answering
	---

	# VideoThinker-R1-3B

	[Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs](https://huggingface.co/papers/2605.01324)

	VideoThinker is a causal-inspired framework that enables lightweight multimodal language models (3B parameters) to achieve robust video reasoning. It addresses the phenomenon of "perceptual bias," where reinforcement learning can compel lightweight models to adopt perceptual shortcuts from data rather than developing genuine reasoning abilities.

	The framework employs a two-stage debiasing process:
	1. Bias Aware Training: Forges a dedicated "bias model" to embody shortcut behaviors.
	2. Causal Debiasing Policy Optimization (CDPO): Fine-tunes the primary model using a repulsive objective to push it away from the bias model's flawed logic.

	## Performance
	VideoThinker-R1 establishes a new state-of-the-art in video reasoning efficiency. Using only 1K training samples and no Supervised Fine-Tuning (SFT), it:
	- Surpasses VideoRFT-3B by 7% on VideoMME.
	- Outperforms larger models (e.g., Video-UTR-7B) on reasoning-heavy benchmarks like MVBench and TempCompass.

	## Resources
	- Code: [GitHub - falonss703/VideoThinker](https://github.com/falonss703/VideoThinker)
	- Paper: [Hugging Face Papers](https://huggingface.co/papers/2605.01324)

	## Citation
	If you find this project useful in your research, please consider citing:

	```bibtex
	@inproceedings{wu2026videothinker,
	title={Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs},
	author={Wu, Jingze and Zhang, Quan and Suo, Hongfei and Cai, Zeqiang and Chen, Hongbo},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year={2026}
	}
	```