Emin Temiz PRO

etemiz

AI & ML interests

Alignment

Recent Activity

posted an update about 4 hours ago
Quoted from https://www.nytimes.com/2026/03/10/opinion/ai-chatbots-virtue-vice.html : """ Consider a follow-up to an earlier version of the Nature paper. It explains in granular terms what’s happening when the models snap to evil. It is math all the way down. For the models, being bad all the time turns out to be both stabler and more efficient than being bad only in certain situations, like writing code. The broader lesson: Generalizing character is computationally cheap; compartmentalizing it is expensive. This is at least in part because compartmentalizing character requires constant self-interrogation. The model must constantly ask itself, “Am I supposed to be bad here? Good? Something in between?” Each of those checkpoints is another chance to get things wrong. This is interesting enough in A.I. Extrapolated to humans, the possibility becomes astonishing. Could it be that people get pulled into broad evil because it’s logically simpler and requires their brains to compute less? """ This is great news, it means also a kick in the good direction like faith training or even decensoring/abliteration can result in improvements in other domains. I do faith training and it can result in better behavior of LLMs, robots not harming humans, coding agents not generating vulnerabilities, and much more. Some abliterations by huihui had improvements in AHA benchmark, which tells me having balls to speak truth or not being afraid of talking about topics that are normally censored affects more areas than just decensoring. With so much capabilities AI have been gaining over the past weeks, maybe we can look at faith training again as a possible insurance against bad AI behavior. What do you think?
View all activity

Organizations

Blog-explorers's profile picture