---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B
library_name: transformers
tags:
- multi-modal
- large-language-model
- vision-language-model
- vision-encoder
---
Vision Encoder of Penguin-VL
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
---
## ๐ฐ News
* **2026.03** โ PenguinVL-Encoder now available for general use.
* **2026.03** โ Released PenguinVL-2B, PenguinVL-8B.
---
## ๐ Model Overview
PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
### Key Characteristics
- ๐ง **LLM-based Vision Encoder**
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.
---
## ๐งช Quick Start โ Transformers Inference
```python
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image
model_name = "tencent/Penguin-Encoder"
image_path = "your_img.jpg"
images = load_image(image_path)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
```
## ๐ Model Zoo
| Model | Base Model | HF Link |
| -------------------- | ------------ | ------------------------------------------------------------ |
| PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
| PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
| PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
## ๐ Main Results
Ablation Study:

Main Results can see the ablation section in our paper.
## Citation
If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{Penguin-VL,
title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
journal={arXiv preprint arXiv:2603.06569},
year={2026}
}
```