Papers
arxiv:2603.06449

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Published on Mar 6
· Submitted by
YitongChen (SII)
on Mar 10
Authors:
,
,

Abstract

CaTok presents a 1D causal image tokenizer with a MeanFlow decoder that enables fast one-step generation and high-fidelity multi-step sampling while achieving state-of-the-art image reconstruction performance.

AI-generated summary

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

Community

Paper author Paper submitter

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Image tokenization has become a key building block for modern generative models, yet most existing tokenizers struggle to simultaneously support causal modeling, efficient generation, and high visual fidelity.

This paper introduces CaTok, a one-dimensional causal image tokenizer that learns visual tokens aligned with generative modeling dynamics through a MeanFlow decoder objective. By selecting tokens over temporal intervals and coupling them with the MeanFlow training objective, CaTok learns causal token representations that naturally capture visual concepts while remaining compatible with autoregressive or flow-based generation.

The resulting tokenizer enables both fast one-step generation and high-quality multi-step sampling, bridging the gap between efficient token-based generation and faithful visual reconstruction.

Key contributions:

  • 1D causal image tokenizer that converts images into sequential tokens suitable for causal generative modeling.
  • MeanFlow-based decoder objective that stabilizes training and aligns token representations with generative dynamics.
  • Interval-based token selection enabling tokens to represent visual concepts at different temporal scales.
  • Supports fast one-step generation while maintaining high-fidelity reconstruction with multi-step sampling.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.06449 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.06449 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.06449 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.