arxiv:2605.14038

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

Published on May 13

· Submitted by

Yize Cheng on May 19

University of Maryland

Upvote

Authors:

Abstract

Research reveals a disconnect between language models' recognition of when tools are needed and their actual tool invocation behavior, identifying a "knowing-doing gap" in tool-use reliability.

AI-generated summary

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

View arXiv page View PDF GitHub 6 Add to collection

Community

yizecheng

Paper submitter about 13 hours ago

Excited to share our new work on the knowing–doing gap in LLM tool use.

Most prior work in LLM adaptive tool use treats “tool necessity” as fixed and model-agnostic. But models have different capabilities. What GPT-5 can solve without a tool may require tools for another model.
So we introduce model-adaptive tool necessity, grounded in each model’s empirical capability.

Across arithmetic + factual QA tasks, we compare:
✅ When models actually need tools vs.🔧 When models actually use tools
We find major mismatches — up to 54% disagreement between necessity and behavior.

Models frequently:

use tools unnecessarily
skip tools when needed

To understand why, we model tool use as a two-stage process:

🧠 Cognition: recognizing a tool is needed
⚡ Execution: actually invoking the tool
This distinction turns out to matter a lot.

By probing hidden states, we found:

Tool necessity signals are often decodable
Tool-call execution signals are also decodable
But at late tokens in late layers, the two signals become nearly orthogonal

Meaning:
The representation of necessity and action are decoupled.

By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. This shows models often internally know whether they need a tool, but fail to translate that cognition into the matching tool-call or direct-answer action.
We call this:
👉 the knowing–doing gap in LLM tool use.

Takeaway: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

Curious to hear your thoughts:

Have you observed similar knowing–doing gaps in agentic systems?
What mechanisms might better align internal recognition with external action?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.14038

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14038 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.14038 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14038 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.