Penguin-Encoder / README.md

lkeab

Update README.md (#1)

62327d3 about 12 hours ago

preview code

raw

history blame contribute delete

4.37 kB

metadata

license: apache-2.0
language:
  - en
metrics:
  - accuracy
base_model:
  - Qwen/Qwen3-0.6B
library_name: transformers
tags:
  - multi-modal
  - large-language-model
  - vision-language-model
  - vision-encoder

Vision Encoder of Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Project Page: penguin-vl.github.io | GitHub: tencent-ailab/Penguin-VL | arXiv: 2603.06569

📰 News

2026.03 — PenguinVL-Encoder now available for general use.
2026.03 — Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

🧠 LLM-based Vision Encoder
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.

🧪 Quick Start — Transformers Inference

import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "tencent/Penguin-Encoder"
image_path = "your_img.jpg"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)

🌎 Model Zoo

Model	Base Model	HF Link
PenguinVL-8B	Qwen3-8B	tencent/Penguin-VL-8B
PenguinVL-2B	Qwen3-1.7B	tencent/Penguin-VL-2B
PenguinVL-Encoder	Qwen3-0.6B	tencent/Penguin-Encoder

🚀 Main Results

Ablation Study:

Main Results can see the ablation section in our paper.

Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:

@article{Penguin-VL,
  title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
  author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
  journal={arXiv preprint arXiv:2603.06569},
  year={2026}
}