Papers
arxiv:2605.02028

Counting as a minimal probe of language model reliability

Published on May 3
· Submitted by
Tianxiang Dai
on May 5
Authors:

Abstract

Studies of stable counting capacity reveal that large language models rely on finite internal states rather than general logical reasoning for rule execution, even when appearing to follow instructions.

AI-generated summary

Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.

Community

Paper author Paper submitter

Ever wondered why LLMs are failing wildly occasionally in a long run in codex/claude code?

Long context is not reliable procedural state. A one-line counting task exposes where exact rule-following collapses. In our new preprint, we introduce Stable Counting Capacity (SCC), a minimal mechanical assay for probing language model reliability.

The setup is intentionally simple: give a model a homogeneous repeated-item sequence and ask it to return the exact count as a single integer. This removes factual knowledge, semantic cues, benchmark memorization, and grading ambiguity. What remains is a direct test of whether the model can preserve a rule-defined procedural state over many steps.

Across more than 100 model variants, we find that this counting capacity is far below advertised context limits. More importantly, failure is often abrupt rather than gradual: models can count exactly within a bounded regime, then collapse into large, plausible numerical guesses.

Our mechanistic analyses suggest that counting is not supported by an indefinitely extendable abstract counter. Instead, models appear to rely on finite, syntax-sensitive internal trajectories. Once those trajectories collapse, exact rule execution gives way to guessing. This implies that models cannot actually scale on rule execution but are dependent on more fragile imitation of the execution process.

The broader implication is not just about counting. Coding agents, tool-use systems, planning workflows, and long-context applications all require models to maintain constraints, variables, commitments, and intermediate states over time. The procedural breakdown in these cases are not always be aided with external tools, since even the exact tracking of tool usage depends on this, implying new architectural enhancements required.

Our takeaway: Current models are not reliable rule executers, they fail silently and abruptly when they hit their walls.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.02028
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.02028 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.02028 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.02028 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.