| | --- |
| | language: |
| | - en |
| | library_name: transformers |
| | license: apache-2.0 |
| | metrics: |
| | - accuracy |
| | tags: |
| | - multimodal |
| | pipeline_tag: video-text-to-text |
| | base_model: Qwen/Qwen2-VL-7B-Instruct |
| | --- |
| | |
| |
|
| | # π‘ VideoChat-R1_7B_caption |
| |
|
| | [\[π GitHub\]](https://github.com/OpenGVLab/VideoChat-R1) |
| | [\[π Tech Report\]](https://arxiv.org/pdf/2504.06958) |
| |
|
| |
|
| | ## π How to use the model |
| |
|
| | We provide a simple installation example below: |
| | ``` |
| | pip install transformers |
| | pip install qwen_vl_utils |
| | ``` |
| | Then you could use our model: |
| | ```python |
| | from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| | from qwen_vl_utils import process_vision_info |
| | |
| | model_path = "OpenGVLab/VideoChat-R1_7B_caption" |
| | # default: Load the model on the available device(s) |
| | model = Qwen2VLForConditionalGeneration.from_pretrained( |
| | model_path, torch_dtype="auto", device_map="auto", |
| | attn_implementation="flash_attention_2" |
| | ) |
| | |
| | # default processer |
| | processor = AutoProcessor.from_pretrained(model_path) |
| | |
| | video_path = "your_video.mp4" |
| | question = "Describe the video in detail." |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "video", |
| | "video": video_path, |
| | "max_pixels": 360 * 420, |
| | "fps": 1.0, |
| | }, |
| | {"type": "text", "text": f""""{question} First output the thinking process in <think> </think> tags and then output the final answer in <answer> </answer> tags"""}, |
| | ], |
| | } |
| | ] |
| | |
| | |
| | |
| | #In Qwen 2 VL, frame rate information is also input into the model to align with absolute time. |
| | # Preparation for inference |
| | text = processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) |
| | image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) |
| | inputs = processor( |
| | text=[text], |
| | images=image_inputs, |
| | videos=video_inputs, |
| | padding=True, |
| | return_tensors="pt", |
| | **video_kwargs, |
| | ) |
| | inputs = inputs.to("cuda") |
| | |
| | # Inference |
| | generated_ids = model.generate(**inputs, max_new_tokens=512) |
| | generated_ids_trimmed = [ |
| | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| | ] |
| | output_text = processor.batch_decode( |
| | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | ``` |
| |
|
| | ## βοΈ Citation |
| |
|
| | ```bibtex |
| | |
| | @article{li2025videochatr1, |
| | title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning}, |
| | author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin}, |
| | journal={arXiv preprint arXiv:2504.06958}, |
| | year={2025} |
| | } |
| | ``` |