Papers
arxiv:2603.05706

Reasoning Models Struggle to Control their Chains of Thought

Published on Mar 5
· Submitted by
taesiri
on Mar 9
Authors:
,
,
,
,
,
,
,

Abstract

Chain-of-thought controllability measures how effectively models can be constrained to follow reasoning steps, with findings showing significantly lower controllability in reasoning versus output generation, and varying impacts from model size, training methods, and task complexity.

AI-generated summary

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

Community

Paper submitter

This work introduces CoT-Control to measure whether reasoning models can suppress or manipulate their chain-of-thought, finding limited controllability, modestly improving with size and inhibited by RL, prompts, and compute.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.05706 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.05706 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.05706 in a Space README.md to link it from this page.

Collections including this paper 1