The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
Abstract
Geometric stability measures predict language model controllability and detect structural degradation, with supervised variants excelling at steering prediction and unsupervised variants at drift detection.
Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.
Community
The Geometric Canary introduces geometric stability as a dual diagnostic for LLM deployment. Supervised Shesha predicts which embedding models will accept linear steering with near-perfect accuracy (rho = 0.89-0.96 across 35-69 models and three NLP tasks), capturing unique variance beyond class separability. A critical dissociation: unsupervised stability fails entirely for steering (rho ~ 0.10) but excels at detecting post-training drift, measuring up to 5.23x more geometric change than CKA in Llama-family models while maintaining a 6x lower false alarm rate than Procrustes. Together, the two variants form complementary diagnostics for the deployment lifecycle: supervised stability for pre-deployment controllability assessment, unsupervised stability for post-deployment monitoring. Code available via shesha-geometry on PyPI.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance (2026)
- Thinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation (2026)
- I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift (2026)
- Predicting Where Steering Vectors Succeed (2026)
- Closing the Confidence-Faithfulness Gap in Large Language Models (2026)
- Sparse Visual Thought Circuits in Vision-Language Models (2026)
- The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.17698 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper