--- license: apache-2.0 language: - en metrics: - accuracy base_model: - Qwen/Qwen3-0.6B library_name: transformers tags: - multi-modal - large-language-model - vision-language-model - vision-encoder ---

Vision Encoder of Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Project Page: penguin-vl.github.io | GitHub: tencent-ailab/Penguin-VL | arXiv: 2603.06569

Project Page GitHub Badge Hugging Face Spaces arXiv

--- ## ๐Ÿ“ฐ News * **2026.03** โ€” PenguinVL-Encoder now available for general use. * **2026.03** โ€” Released PenguinVL-2B, PenguinVL-8B. --- ## ๐ŸŒŸ Model Overview PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs. Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone. ### Key Characteristics - ๐Ÿง  **LLM-based Vision Encoder** The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling. This provides strong semantic priors and native compatibility with the downstream LLM. --- ## ๐Ÿงช Quick Start โ€” Transformers Inference ```python import torch from transformers import AutoModel, AutoImageProcessor from transformers.image_utils import load_image model_name = "tencent/Penguin-Encoder" image_path = "your_img.jpg" images = load_image(image_path) model = AutoModel.from_pretrained( model_name, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True) inputs = processor(images=images, merge_size=1) inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()} if "pixel_values" in inputs: inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) image_features = model(**inputs) ``` ## ๐ŸŒŽ Model Zoo | Model | Base Model | HF Link | | -------------------- | ------------ | ------------------------------------------------------------ | | PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) | | PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) | | PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) | ## ๐Ÿš€ Main Results Ablation Study: ![image](https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/JOSRpV_qEbTqdbYwH-hJr.png) Main Results can see the ablation section in our paper. ## Citation If you find Penguin-VL useful for your research and applications, please cite using this BibTeX: ```bibtex @article{Penguin-VL, title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders}, author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang}, journal={arXiv preprint arXiv:2603.06569}, year={2026} } ```