RoboInterVLM: Vision-Language Model Checkpoints for RoboInter Manipulation Suite

Model checkpoints of RoboInterVLM, developed as part of the RoboInter project. These models are fine-tuned on the RoboInter-VQA dataset for intermediate representation understanding and generation in robotic manipulation.

Available Checkpoints

Checkpoint	Base Model	Architecture	Parameters	Description
`RoboInterVLM_qwenvl25_3b`	Qwen2.5-VL-3B-Instruct	Qwen2.5-VL	~3B	Lightweight Qwen2.5VL model, suitable for efficient deployment
`RoboInterVLM_qwenvl25_7b`	Qwen2.5-VL-7B-Instruct	Qwen2.5-VL	~7B	Larger Qwen2.5-VL backbone for stronger performance
`RoboInterVLM_llava_one_vision_7B`	LLaVA-OneVision-Qwen2-7B	LLaVA-OneVision (SigLIP + Qwen2)	~7B	LLaVA-OneVision backbone with SigLIP vision encoder

All checkpoints are stored in safetensors format with bfloat16 precision.

Supported Tasks

These models are jointly trained on general VQA and three categories of our curated VQA tasks:

Generation: Predicting intermediate representations such as trajectory waypoints, gripper bounding boxes, contact points/boxes, object bounding boxes (current & final), etc.
Understanding: Multiple-choice visual reasoning about contact states, grasp poses, object grounding, trajectory selection, movement directions, etc.
Task Planning: High-level task planning including next-step prediction, action primitive recognition, success determination, etc.

Usage

Qwen2.5-VL Checkpoints

For loading and inference with the Qwen2.5-VL checkpoint, please refer to the RoboInterVLM-QwenVL codebase. We provide a fast loading example below:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model_path = "InternRobotics/RoboInterVLM_qwenvl25_3b"  # or RoboInterVLM_qwenvl25_7b
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

LLaVA-OneVision Checkpoint

For loading and inference with the LLaVA-OneVision checkpoint, please refer to the RoboInterVLM-LLaVAOV codebase, as it requires custom model classes.

Training & Evaluation

For full training and evaluation pipelines, please refer to:

Qwen2.5-VL models: RoboInterVLM-QwenVL
LLaVA-OneVision model: RoboInterVLM-LLaVAOV
VQA Dataset: RoboInter-VQA

Related Resources

Project: RoboInter
Annotation Data: RoboInter-Data
VQA Dataset: RoboInter-VQA

License

Please refer to the original licenses of RoboInter, Qwen2.5-VL, and LLaVA-OneVision.