Instructions to use OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
🦜VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B⚡
[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B is constructed upon InternVideo2-1B and Qwen2.5-7B, employing only 16 tokens per frame. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately 10,000 frames.
Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.
📈 Performance
| Model | MVBench | LongVideoBench | VideoMME(w/o sub) | Max Input Frames |
|---|---|---|---|---|
| VideoChat-Flash-Qwen2_5-2B@448 | 70.0 | 58.3 | 57.0 | 10000 |
| VideoChat-Flash-Qwen2-7B@224 | 73.2 | 64.2 | 64.0 | 10000 |
| VideoChat-Flash-Qwen2_5-7B-1M@224 | 73.4 | 66.5 | 63.5 | 50000 |
| VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224 | 74.3 | 64.5 | 65.1 | 10000 |
| VideoChat-Flash-Qwen2-7B@448 | 74.0 | 64.7 | 65.3 | 10000 |
🚀 How to use the model
First, you need to install flash attention2 and some other modules. We provide a simple installation example below:
pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional
pip install flash-attn --no-build-isolation
Then you could use our model:
from transformers import AutoModel, AutoTokenizer
import torch
# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor
mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
model.config.mm_llm_compress = True
model.config.llm_compress_type = "uniform0_attention"
model.config.llm_compress_layer_list = [4, 18]
model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
model.config.mm_llm_compress = False
# evaluation setting
max_num_frames = 512
generation_config = dict(
do_sample=False,
temperature=0.0,
max_new_tokens=1024,
top_p=0.1,
num_beams=1
)
video_path = "your_video.mp4"
# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output1)
# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output2)
✏️ Citation
@article{li2024videochatflash,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
- Downloads last month
- 390
Collection including OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B
Paper for OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B
Evaluation results
- accuracy on MLVUself-reported73.400
- accuracy on MVBenchself-reported74.300
- accuracy on Perception Testself-reported76.300
- accuracy on LongVideoBenchself-reported64.500
- accuracy on VideoMME (wo sub)self-reported65.200
- accuracy on LVBenchself-reported48.700