Papers
arxiv:2605.26045

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Published on May 25
· Submitted by
Federico Torrielli
on May 27
Authors:
,

Abstract

Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.

AI-generated summary

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.

Community

Paper author Paper submitter

we propose the first uncertainty-quantification benchmark for activation oracles, comparing six confidence estimators across two Qwen-family oracles. We also train and release, for the first time, an activation oracle and taboo target models for Qwen3.6-27B, extending the setup to a hybrid linear-plus-full attention architecture. Bootstrap confidence is best calibrated, while log-probability remains a cheap triage signal.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26045
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 116

Browse 116 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26045 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26045 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.