Papers
arxiv:2603.03583

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Published on Mar 3
· Submitted by
Chunyuan Deng
on Mar 10
Authors:
,
,
,
,
,

Abstract

ByteFlow Net presents a tokenizer-free hierarchical architecture that enables language models to learn adaptive segmentation of raw byte streams through compression-driven methods while maintaining a static computation graph.

AI-generated summary

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce ByteFlow Net, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries while preserving a static computation graph via Top-K selection. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

Community

Paper submitter

We propose ByteFlow Net, a new architecture challenges one of the most entrenched assumptions in modern language models: the need for a fixed tokenizer. Instead of relying on BPE or SentencePiece, the paper proposes learning directly from raw bytes while dynamically forming tokens inside the model through an information-theoretic compression principle. The key mechanism, called coding-rate chunking, groups byte representations when doing so reduces representational cost, effectively allowing the model to discover its own segmentation of text during training. This adaptive hierarchy combines a local byte encoder, a chunking module that promotes informative byte spans into higher-level units, and a global transformer operating on these learned segments, enabling the model to allocate compute where information density is highest. Experiments show improved bits-per-byte and competitive downstream performance compared to traditional tokenized Transformers, suggesting that tokenization may not be a necessary preprocessing step at all. More broadly, ByteFlow reframes tokenization as a learned compression problem, pointing toward future LLMs that operate fully end-to-end on raw data while dynamically discovering the structure of language.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.03583 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.03583 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.03583 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.