Instructions to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct

SGLang

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
```

Update pipeline tag and add project links

by nielsr HF Staff - opened 4 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+32

-23

Files changed (1) hide show

README.md +32 -23

README.md CHANGED Viewed

@@ -1,23 +1,25 @@
 ---
 library_name: transformers
-pipeline_tag: image-text-to-text
 license: apache-2.0
 tags:
-  - multimodal
-  - vision-language
-  - image-text-to-text
-  - video-text-to-text
-  - llava
-  - llava-onevision-2
-  - qwen3
-language:
-  - en
-  - zh
 ---
 # LLaVA-OneVision-2-8B-Instruct
-A multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder.
 The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
@@ -145,17 +147,6 @@ out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
 print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
 ```
-Defaults for the codec pipeline live in `preprocessor_config.json` under the
-`"codec"` key (`target_canvas=32`, `group_size=32`, `images_per_group=4`,
-`patch=14`, `min_group_frames=8`, `max_group_frames=64`); they can be
-overridden per call via `codec_config={...}`.
-**Short-video behaviour:** if the input video has fewer frames than
-`target_canvas` requires (or fewer than `min_group_frames`), a `UserWarning`
-is emitted and inference proceeds with however many canvases `cv-preinfer`
-can actually form. For very short clips, falling back to the frame-sampling
-backend is usually a better choice.
 ## Notes
 - The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
@@ -166,3 +157,21 @@ backend is usually a better choice.
 ## License
 Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).

 ---
+language:
+- en
+- zh
 library_name: transformers
 license: apache-2.0
+pipeline_tag: video-text-to-text
 tags:
+- multimodal
+- vision-language
+- image-text-to-text
+- video-text-to-text
+- llava
+- llava-onevision-2
+- qwen3
 ---
 # LLaVA-OneVision-2-8B-Instruct
+[Paper](https://huggingface.co/papers/2605.25979) | [Project Page](https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/) | [GitHub](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2)
+LLaVA-OneVision-2 (LLaVA-OV-2) is a multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder. Its key advance is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream for efficient long-video understanding.
 The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
 print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
 ```
 ## Notes
 - The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
 ## License
 Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
+## Citation
+```bibtex
+@inproceedings{LLaVA-OneVision-2,
+  title={LLaVA-OneVision-2},
+  author={llava-onevision contributors},
+  booktitle={arXiv},
+  year={2026}
+}
+@inproceedings{LLaVA-OneVision-1.5,
+  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
+  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
+  booktitle={arXiv},
+  year={2025}
+ }
+```