Papers
arxiv:2605.05166

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Published on May 6
· Submitted by
Mina Gabriel
on May 7
Authors:

Abstract

First-token confidence (phi_first) derived from initial token distribution matches or exceeds semantic self-consistency in detecting hallucinations while being more computationally efficient.

AI-generated summary

Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.

Community

Paper author Paper submitter
edited about 17 hours ago

Sharing our paper "The First Token Knows: Single-Decode Confidence for Hallucination Detection". A single greedy decode captures almost all the hallucination-detection signal that multi-sample self-consistency does — at ~1/11 the cost. We define ϕ_first, the normalized entropy of the top-K logits at the first content-bearing answer token, and benchmark it against semantic and surface-form self-consistency.

Key findings

  • Mean AUROC: 0.820 (ϕ_first) vs. 0.793 (semantic agreement) vs. 0.791 (surface-form self-consistency)
  • Across Llama-3.1-8B, Mistral-7B-v0.3, and Qwen2.5-7B on PopQA and TriviaQA (n=1000 each)
  • Ensembling ϕ_first with semantic agreement adds only +0.02 AUROC — first-token confidence already carries most of the signal

Feedback welcome.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.05166
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.05166 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.05166 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.05166 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.