arxiv:2603.19714

LoopRPT: Reinforcement Pre-Training for Looped Language Models

Published on Mar 20

· Submitted by

蒋世鑫 on Mar 23

Upvote

Authors:

Guo Tang ,

Shixin Jiang ,

Heng Chang ,

Abstract

Reinforcement pre-training framework LoopRPT improves latent reasoning in looped language models by directly shaping intermediate representations through next-token reasoning tasks.

AI-generated summary

Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.

View arXiv page View PDF Add to collection

Community

ThreeGold116

Paper author Paper submitter about 21 hours ago

This paper proposes LoopRPT, a reinforcement pre-training framework that directly optimizes latent reasoning in looped language models. By shifting RL from sparse output supervision to step-wise rewards over intermediate reasoning, and focusing on hard tokens with EMA-guided signals and noisy latent rollouts, the method effectively teaches models how to think, not just what to output. Empirically, it achieves a strong accuracy–efficiency trade-off (e.g., fewer steps yet higher performance on hard tasks), pointing toward a promising direction for fast, scalable latent reasoning beyond explicit CoT.

librarian-bot

about 3 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.19714 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.19714 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19714 in a Space README.md to link it from this page.