RoboInter-VLM / README.md
JeasLee's picture
Upload README.md with huggingface_hub
8fc7a55 verified

RoboInterVLM: Vision-Language Model Checkpoints for RoboInter Manipulation Suite

Model checkpoints of RoboInterVLM, developed as part of the RoboInter project. These models are fine-tuned on the RoboInter-VQA dataset for intermediate representation understanding and generation in robotic manipulation.

Available Checkpoints

Checkpoint Base Model Architecture Parameters Description
RoboInterVLM_qwenvl25_3b Qwen2.5-VL-3B-Instruct Qwen2.5-VL ~3B Lightweight Qwen2.5VL model, suitable for efficient deployment
RoboInterVLM_qwenvl25_7b Qwen2.5-VL-7B-Instruct Qwen2.5-VL ~7B Larger Qwen2.5-VL backbone for stronger performance
RoboInterVLM_llava_one_vision_7B LLaVA-OneVision-Qwen2-7B LLaVA-OneVision (SigLIP + Qwen2) ~7B LLaVA-OneVision backbone with SigLIP vision encoder

All checkpoints are stored in safetensors format with bfloat16 precision.

Supported Tasks

These models are jointly trained on general VQA and three categories of our curated VQA tasks:

  • Generation: Predicting intermediate representations such as trajectory waypoints, gripper bounding boxes, contact points/boxes, object bounding boxes (current & final), etc.
  • Understanding: Multiple-choice visual reasoning about contact states, grasp poses, object grounding, trajectory selection, movement directions, etc.
  • Task Planning: High-level task planning including next-step prediction, action primitive recognition, success determination, etc.

Usage

Qwen2.5-VL Checkpoints

For loading and inference with the Qwen2.5-VL checkpoint, please refer to the RoboInterVLM-QwenVL codebase. We provide a fast loading example below:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model_path = "InternRobotics/RoboInterVLM_qwenvl25_3b"  # or RoboInterVLM_qwenvl25_7b
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

LLaVA-OneVision Checkpoint

For loading and inference with the LLaVA-OneVision checkpoint, please refer to the RoboInterVLM-LLaVAOV codebase, as it requires custom model classes.

Training & Evaluation

For full training and evaluation pipelines, please refer to:

Related Resources

License

Please refer to the original licenses of RoboInter, Qwen2.5-VL, and LLaVA-OneVision.