Raon
Collection
8 items • Updated • 34
Raon-VisionEncoder is a 1.14B-parameter vision-language foundation model by KRAFTON for image and text feature extraction. It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex. Built on OpenCLIP with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder.
| Model | Params (Inference) | Vision | Text | Patch Size | NaFlex Default Patches |
|---|---|---|---|---|---|
| LocCa ViT-SO400M-16-SigLIP2 | 1.14B | 0.43B | 0.71B | 16x16 | 256 |
pip install torch torchvision timm transformers huggingface-hub safetensors ftfy
import torch
from transformers import AutoModel
from PIL import Image
# Load model + processor
model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True)
model = model.to(dtype=torch.bfloat16).eval()
processor = model.get_processor("KRAFTON/Raon-VisionEncoder")
# Encode image and text
img_inputs = processor(images=Image.open("assets/photo.jpg"))
txt_inputs = processor(text=["a cat", "a dog"])
with torch.no_grad():
img_feat = model.encode_image(**img_inputs)
txt_feat = model.encode_text(**txt_inputs)
# Compute similarity with learned scale and bias
logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias
probs = logits.softmax(dim=-1)
print(probs)
| Method | Input | Output |
|---|---|---|
model.encode_image(**inputs) |
Processor output (image) | [B, 1152] normalized image features |
model.encode_text(**inputs) |
Processor output (text) | [B, 1152] normalized text features |
model.logit_scale |
- | Learned temperature parameter |
model.logit_bias |
- | Learned bias parameter |
model.get_processor(repo_id) |
HuggingFace repo ID | Processor instance |
processor(images=img) |
PIL Image | Preprocessed image dict |
processor(text=["a cat"]) |
list of strings | Tokenized text dict |
This repository is licensed under the Apache License 2.0. Third-party notices in NOTICE.
© 2026 KRAFTON