arxiv:2604.17698

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Published on Apr 20

· Submitted by

Prashant Raju on Apr 21

Upvote

Authors:

Prashant C. Raju

Abstract

Geometric stability measures predict language model controllability and detect structural degradation, with supervised variants excelling at steering prediction and unsupervised variants at drift detection.

AI-generated summary

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

View arXiv page View PDF GitHub 0 Add to collection

Community

pcr2120

Paper author Paper submitter 1 day ago

The Geometric Canary introduces geometric stability as a dual diagnostic for LLM deployment. Supervised Shesha predicts which embedding models will accept linear steering with near-perfect accuracy (rho = 0.89-0.96 across 35-69 models and three NLP tasks), capturing unique variance beyond class separability. A critical dissociation: unsupervised stability fails entirely for steering (rho ~ 0.10) but excels at detecting post-training drift, measuring up to 5.23x more geometric change than CKA in Llama-family models while maintaining a 6x lower false alarm rate than Procrustes. Together, the two variants form complementary diagnostics for the deployment lifecycle: supervised stability for pre-deployment controllability assessment, unsupervised stability for post-deployment monitoring. Code available via shesha-geometry on PyPI.

librarian-bot

about 2 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.17698

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.17698 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.17698 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.