Cloud Engineer · AI Automation Engineer · Quantization Gremlin
I build cloud systems, automate the boring parts, and squeeze absurd efficiency out of AI models. Into infra, agents, vLLM, local GPU rigs, and quantizations that make big models run where they probably shouldn’t.
SRT-introspect: Live Token-by-Token Readout of LLM Internal Reasoning
I have released SRT-introspect, a new public demonstration that makes the hidden reasoning process of a frozen large language model visible in real time.
The interface runs a Qwen-2.5-7B backbone equipped with the SRT Adapter and Activation Verbalizer. As the model generates each token, the system continuously measures divergence across attention heads, identifies high-signal moments, and translates the corresponding hidden-state object representations into natural-language verbalizations. You see exactly what the model is internally representing at the precise points where its computation is most active, complete with divergence scores, reflexivity estimates, and per-layer traces.
This is not a summary of the final output. It is a direct window into the model’s latent conceptual landscape, showing the dominant training-data attractors that activate even when the prompt asks for first-principles reasoning. The adaptive scheduler concentrates verbalizations precisely where the real internal work occurs, turning what used to be opaque black-box generation into observable, analyzable data.
The result is the clearest public demonstration yet that modern LLMs possess a rich, structured semiotic infrastructure that can now be audited without retraining or fine-tuning.