Update pipeline tag and add project links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +32 -23
README.md CHANGED
@@ -1,23 +1,25 @@
1
  ---
 
 
 
2
  library_name: transformers
3
- pipeline_tag: image-text-to-text
4
  license: apache-2.0
 
5
  tags:
6
- - multimodal
7
- - vision-language
8
- - image-text-to-text
9
- - video-text-to-text
10
- - llava
11
- - llava-onevision-2
12
- - qwen3
13
- language:
14
- - en
15
- - zh
16
  ---
17
 
18
  # LLaVA-OneVision-2-8B-Instruct
19
 
20
- A multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder.
 
 
21
 
22
  The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
23
 
@@ -145,17 +147,6 @@ out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
145
  print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
146
  ```
147
 
148
- Defaults for the codec pipeline live in `preprocessor_config.json` under the
149
- `"codec"` key (`target_canvas=32`, `group_size=32`, `images_per_group=4`,
150
- `patch=14`, `min_group_frames=8`, `max_group_frames=64`); they can be
151
- overridden per call via `codec_config={...}`.
152
-
153
- **Short-video behaviour:** if the input video has fewer frames than
154
- `target_canvas` requires (or fewer than `min_group_frames`), a `UserWarning`
155
- is emitted and inference proceeds with however many canvases `cv-preinfer`
156
- can actually form. For very short clips, falling back to the frame-sampling
157
- backend is usually a better choice.
158
-
159
  ## Notes
160
 
161
  - The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
@@ -166,3 +157,21 @@ backend is usually a better choice.
166
  ## License
167
 
168
  Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - zh
5
  library_name: transformers
 
6
  license: apache-2.0
7
+ pipeline_tag: video-text-to-text
8
  tags:
9
+ - multimodal
10
+ - vision-language
11
+ - image-text-to-text
12
+ - video-text-to-text
13
+ - llava
14
+ - llava-onevision-2
15
+ - qwen3
 
 
 
16
  ---
17
 
18
  # LLaVA-OneVision-2-8B-Instruct
19
 
20
+ [Paper](https://huggingface.co/papers/2605.25979) | [Project Page](https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/) | [GitHub](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2)
21
+
22
+ LLaVA-OneVision-2 (LLaVA-OV-2) is a multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder. Its key advance is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream for efficient long-video understanding.
23
 
24
  The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
25
 
 
147
  print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
148
  ```
149
 
 
 
 
 
 
 
 
 
 
 
 
150
  ## Notes
151
 
152
  - The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
 
157
  ## License
158
 
159
  Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
160
+
161
+ ## Citation
162
+
163
+ ```bibtex
164
+ @inproceedings{LLaVA-OneVision-2,
165
+ title={LLaVA-OneVision-2},
166
+ author={llava-onevision contributors},
167
+ booktitle={arXiv},
168
+ year={2026}
169
+ }
170
+
171
+ @inproceedings{LLaVA-OneVision-1.5,
172
+ title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
173
+ author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
174
+ booktitle={arXiv},
175
+ year={2025}
176
+ }
177
+ ```