UniAR: Unified Multimodal Autoregressive Modeling with Shared Context--Visual Tokenizer is Key to Unification (ICML2026)

UniAR is a unified autoregressive multimodal model for image understanding, image generation, and image editing in a single Transformer. UniAR-SFT is the supervised fine-tuned checkpoint (before RL).

arXiv Project Page Code

Model Description

UniAR uses a single discrete visual tokenizer (BSQ) as the key bridge between understanding and generation, enabling a shared context where the model can directly interpret its own generated visual tokens. Key components:

  • Backbone: Qwen3-8B
  • Visual Tokenizer: BSQ-quantized SigLiP2-So400M ViT with DeepStack connections
  • Visual Decoder: SD3.5-Medium DiT with SigLIP feature injection
  • Training: Pre-training (1T tokens) → SFT

This checkpoint (UniAR-SFT) is the supervised fine-tuned model before RL. It provides a good starting point for custom RL training.

Checkpoint Contents

This is a self-contained checkpoint with all components needed for both understanding and generation:

Component Path Description
AR model *.safetensors Unified autoregressive model weights
BSQ encoder bsq_encoder/ BSQ quantized image tokenizer
SD3 transformer sd3_transformer/ SD3 transformer with visual feature injection
SD3 pipeline sd3_pipeline/ SD3 VAE + text encoders

Usage

Installation

conda create -n uniar python=3.12 -y
conda activate uniar

git clone https://github.com/ShareLab-SII/UniAR.git
cd UniAR
pip install -e .            # inference dependencies

Image Understanding

import torch
from transformers import AutoProcessor
from uniar import UniARForConditionalGeneration

model_path = "ShareLab-SII/UniAR-SFT"
model = UniARForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [{"role": "user", "content": [
    {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)
inputs.pop("mm_token_type_ids", None)

with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]

print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])

Image Generation

import torch
from transformers import AutoProcessor
from uniar import UniARForConditionalGeneration, UniARVisualDecoder
from inference.visual_inputs import prepare_visual_inputs

model_path = "ShareLab-SII/UniAR-SFT"
device = torch.device("cuda")

ar_model = UniARForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to(device).eval()
processor = AutoProcessor.from_pretrained(model_path, padding_side="left")
visual_decoder = UniARVisualDecoder.from_pretrained(model_path, device=device)

# prepare inputs
visual_inputs = prepare_visual_inputs(
    ["A cute anime girl."],
    ar_model,
    processor,
    ar_height=960,
    ar_width=960,
)

# autogressively generate visual indices
indices = ar_model.generate_visual(
    **visual_inputs,
    temperature=1.0,
    cfg=1.5,
    show_progress=True,
)

# decode visual indices into image
images = visual_decoder.decode(
    indices,
    ar_height=960,
    ar_width=960,
    upsampling_ratio=1.067,
)

images[0].save("output.png")

Citation

@inproceedings{peng2026uniar,
  title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
  author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
  booktitle={ICML},
  year={2026}
}
Downloads last month
10
Safetensors
Model size
10B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShareLab-SII/UniAR-SFT

Finetunes
1 model

Collection including ShareLab-SII/UniAR-SFT

Paper for ShareLab-SII/UniAR-SFT