4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Paper β’ 2605.05997 β’ Published β’ 15
4DThinker is a framework that enables Vision-Language Models (VLMs) to "think with 4D" through dynamic latent mental imageryβinternally simulating how scenes evolve within the continuous hidden space. It addresses dynamic spatial reasoning from monocular video by grounding the model in dynamic visual semantics.
This repository contains the trained model checkpoints from Qwen2.5-VL-3B for 4DThinker.
model/
βββ dift/
β βββ checkpoints/ # DIFT-stage model weights
β β βββ model-00001-of-00002.safetensors
β β βββ model-00002-of-00002.safetensors
β β βββ config.json
β β βββ tokenizer.json
β β βββ ...
β βββ tensorboard/ # DIFT training logs
βββ 4drl/
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ config.json
βββ tokenizer.json
βββ trainer_state.json
βββ ...
| Model | Stage | Base Model | Description |
|---|---|---|---|
dift/checkpoints/ |
DIFT | Qwen2.5-VL-3B-Instruct | Supervised with cosine similarity loss on latent visual tokens |
4drl/ |
4DRL (GRPO) | DIFT checkpoint | Reinforced with answer-based rewards |
Three special tokens are added to the Qwen2.5-VL vocabulary to support latent imagery:
| Token | Description |
|---|---|
| `< | latent_pad |
| `< | latent_start |
| `< | latent_end |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"jankin123/4DThinker-3B",
subfolder="4drl",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("jankin123/4DThinker-3B", subfolder="4drl")
@article{4dthinker,
title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
author={Zhang, Quanchen and others},
journal={arXiv preprint arXiv:2605.05997},
year={2026}
}
If you find 4DThinker helpful for your work, please cite
@article{chen20264dthinker,
title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and An, Xiang and Li, Bo and Xie, Xin and Wang, ZiDong and Sun, Mingze and Chen, Shuang and Li, Hongyu and others},
journal={arXiv preprint arXiv:2605.05997},
year={2026}
}
Apache License 2.0