Image-Text-to-Text
Transformers
Safetensors
English
Chinese
llava_onevision2
multimodal
vision-language
video-text-to-text
llava
llava-onevision-2
qwen3
conversational
custom_code
Instructions to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
- SGLang
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Docker Model Runner:
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
Update pipeline tag and add project links
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,23 +1,25 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
-
pipeline_tag: image-text-to-text
|
| 4 |
license: apache-2.0
|
|
|
|
| 5 |
tags:
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
language:
|
| 14 |
-
- en
|
| 15 |
-
- zh
|
| 16 |
---
|
| 17 |
|
| 18 |
# LLaVA-OneVision-2-8B-Instruct
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
|
| 23 |
|
|
@@ -145,17 +147,6 @@ out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
|
|
| 145 |
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
|
| 146 |
```
|
| 147 |
|
| 148 |
-
Defaults for the codec pipeline live in `preprocessor_config.json` under the
|
| 149 |
-
`"codec"` key (`target_canvas=32`, `group_size=32`, `images_per_group=4`,
|
| 150 |
-
`patch=14`, `min_group_frames=8`, `max_group_frames=64`); they can be
|
| 151 |
-
overridden per call via `codec_config={...}`.
|
| 152 |
-
|
| 153 |
-
**Short-video behaviour:** if the input video has fewer frames than
|
| 154 |
-
`target_canvas` requires (or fewer than `min_group_frames`), a `UserWarning`
|
| 155 |
-
is emitted and inference proceeds with however many canvases `cv-preinfer`
|
| 156 |
-
can actually form. For very short clips, falling back to the frame-sampling
|
| 157 |
-
backend is usually a better choice.
|
| 158 |
-
|
| 159 |
## Notes
|
| 160 |
|
| 161 |
- The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
|
|
@@ -166,3 +157,21 @@ backend is usually a better choice.
|
|
| 166 |
## License
|
| 167 |
|
| 168 |
Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- zh
|
| 5 |
library_name: transformers
|
|
|
|
| 6 |
license: apache-2.0
|
| 7 |
+
pipeline_tag: video-text-to-text
|
| 8 |
tags:
|
| 9 |
+
- multimodal
|
| 10 |
+
- vision-language
|
| 11 |
+
- image-text-to-text
|
| 12 |
+
- video-text-to-text
|
| 13 |
+
- llava
|
| 14 |
+
- llava-onevision-2
|
| 15 |
+
- qwen3
|
|
|
|
|
|
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
# LLaVA-OneVision-2-8B-Instruct
|
| 19 |
|
| 20 |
+
[Paper](https://huggingface.co/papers/2605.25979) | [Project Page](https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/) | [GitHub](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2)
|
| 21 |
+
|
| 22 |
+
LLaVA-OneVision-2 (LLaVA-OV-2) is a multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder. Its key advance is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream for efficient long-video understanding.
|
| 23 |
|
| 24 |
The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
|
| 25 |
|
|
|
|
| 147 |
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
|
| 148 |
```
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
## Notes
|
| 151 |
|
| 152 |
- The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
|
|
|
|
| 157 |
## License
|
| 158 |
|
| 159 |
Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
|
| 160 |
+
|
| 161 |
+
## Citation
|
| 162 |
+
|
| 163 |
+
```bibtex
|
| 164 |
+
@inproceedings{LLaVA-OneVision-2,
|
| 165 |
+
title={LLaVA-OneVision-2},
|
| 166 |
+
author={llava-onevision contributors},
|
| 167 |
+
booktitle={arXiv},
|
| 168 |
+
year={2026}
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
@inproceedings{LLaVA-OneVision-1.5,
|
| 172 |
+
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
|
| 173 |
+
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
|
| 174 |
+
booktitle={arXiv},
|
| 175 |
+
year={2025}
|
| 176 |
+
}
|
| 177 |
+
```
|