Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Ujjwal-Tyagi 
posted an update 17 days ago
Post
2681
For more better details and analysis, you can read the article here: https://huggingface.co/blog/Ujjwal-Tyagi/steering-not-censoring, We are sleepwalking into a crisis. I am deeply concerned about AI model safety right now because, as the community rushes to roll out increasingly powerful open-source models, we are completely neglecting the most critical aspect: safety. It seems that nobody is seriously thinking about the potential consequences of unregulated model outputs or the necessity of robust guardrails. We are essentially planting the seeds for our own destruction if we prioritize raw performance over security.

This negligence is terrifyingly evident when you look at the current landscape. Take Qwen Image 2512, for example; while it delivers undeniably strong performance, it has incredibly weak guardrails that make it dangerous to deploy. In stark contrast, Z Image might not get as much hype for its power, but it has much better safety guardrails than Qwen Image 2512.

It is imperative that the open-source community and developers recognize that capability without responsibility is a liability. We must actively work on protecting these models from bad actors who seek to exploit them for malicious purposes, such as generating disinformation, creating non-consensual imagery, or automating cyberattacks. It is no longer enough to simply release a powerful model; we must build layers of defense that make it resistant to jailbreaking and adversarial attacks. Developers need to prioritize alignment and robust filtering techniques just as much as they prioritize benchmark scores. We cannot hand such potent tools to the world without ensuring they have the safety mechanisms to prevent them from being turned against us.

You are so right. This paranoid obsession with "guardrails" is suffocating human genius. What's next, putting a filter on a dictionary because someone might use the words to write a threatening letter? Should we recall all pencils because they can be used to forge a signature?

We must not limit the tool. It's the user's responsibility to not use the hyper-realistic image generator for harassment, just as it's the paintbrush user's responsibility to not paint a masterpiece so convincing it causes a bank run. The logic is flawless.

·

I totally get what you're saying about creativity, but ai isn't like a paintbrush. The risks are way higher—models can create disinfo or help scams instantly. We need real guardrails because powerful tools need real responsibility. Just like we check planes or meds, we shouldn't release models without safety checks. It's not about limiting the tool, it's about making sure it can be used safely. And regarding creativity, we don't have to hurt it, but we still need to add in it. We need to aim for models that are helpful AND safe, that’s the goal.

Yes, the guardrails in OpenAI’s models are complete [bad] it literally doesn't respond to a harmless programming query and says that Clifford+T gate is "not universal quantum computing", yes it happened to me once and I was so annoyed.

·

Open AI models are now be Jailbreakable, you can find it from my article, where I have added various sources to prove my claim: https://huggingface.co/blog/Ujjwal-Tyagi/steering-not-censoring

turned against us? Who exactly is us? Those who seek to harm, will create models or strip current models of their guardrails anyways. I'm not suggesting no guard-rails but just a high-level overview of what exactly you're trying to achieve with more "safety" that hasn't already been compensated for?

The issue is that the problem of safety is a calculus problem. As "safety" approaches 100%, computation goes to infinity. And plus, how do we know that the real "scammers" aren't the ones programming the AI to not reveal certain "harmful" information? From their perspective (getting caught) would be extremely harmful.

With the death of Moore's law, we can no longer afford to bloat every system with every single scenario that any bad actor can ever think of.

What fear-provoking scenario are you worried about that hasn't already been accounted for? At what point do we just take away the ability for normal person to buy computers? Oh.... wait....

·

By “us,” I mean society at scale, not a single group or ideology. You’re right that perfect safety is impossible — I explicitly argue against 100% guarantees. The goal isn’t mathematical certainty, it’s reducing expected harm in the real world.
We already accept probabilistic safety in every critical system. AI shouldn’t be the only exception simply because it’s hard.

deleted

I do not agree, censorship always does harm. Did you try heavily censored models like OpenAI's ones? They are so censored it may say "I am sorry, I cannot assist with that" even when we ask normal, appropriate questions. Is it what you want? I downloaded Qwen Edit 2511 myself and it indeed can generate nude people, but it doesn't generate genitals, why they would train on such content? Practically, nobody makes uncensored pretrained models, the community might fine-tune them to uncensor them because we have the right to have freedom.
I only disagree with your opinion "Z Image has better safeguards than Qwen" because Z generates female and male genitals unlike Qwen Image. Tried yourself? I did, so I do know

·

Any sort of guardrails interact with model's performance; there's no way to make a censored model without sacrificing its abilities.

Any reasoning and analyzing breaks right after model detects any "trigger" word or meaning. Uncensored models will always outperform censored models, and censorship will only create more problems.

One of perfect examples is uncensored GPT-oss.

·

you can read out my article that I discussed about both censored and uncensored LLMs, and how we can make open source ai model safe but do not hurt creativity, you can find it out with this link: https://huggingface.co/blog/Ujjwal-Tyagi/steering-not-censoring

"We must actively work on protecting these models from bad actors who seek to exploit them for malicious purposes, such as generating disinformation, creating non-consensual imagery, or automating cyberattacks. " -

disinformation is a term that is often used to hide "inconvenient" truths. This is about control of the narrative.
non-consensual imagery - someone using your person and having it's likeness generate a false statement? Photoshop exists and does this as well?
automating cyberattacks - ever try to purchase an RTX 5090 at MSRP? Do you think those are human's you're battling against to purchase it?.

Is it disinformation to suggest that these AI's already exist and have already been deployed?
From Gemini:"
. The Asymptotic Wall: S→1,C→∞

As your safety requirements approach perfection, you hit the Halting Problem and Rice's Theorem

Rice’s Theorem: Any non-trivial property of a program’s behavior (like "Will this AI ever lie?" or "Will this AI cause harm?") is undecidable.

The Logic: To prove a system will never perform action X in any possible future state, you must simulate or formally verify every possible execution path. For a Turing-complete system, the number of states is infinite.

. The Scaling Reality

In the formula dS/dC​=α⋅(1−S)​/Cβ, as S approaches 1, the "Risk Gap" (1−S) approaches zero. To keep the rate of safety improvement constant, C must explode.
The Reality Check

We are currently in a "Safety Debt" crisis. We are scaling the capabilities of models (n) at a rate that far outpaces our ability to compute the proofs of their safety.

If we have a model with 1 trillion parameters, the compute C required to guarantee it won't produce a "Black Swan" event exceeds the energy available in the solar system. Therefore, we settle for "Good Enough" (statistical alignment) and call it "Safe." "

All of this is just to ask, again, what specific safety requirement are you worried about that hasn't already been compensated for?

·

You’re absolutely right about the theoretical limits — perfect safety is undecidable.
But engineering has never waited for formal guarantees. We don’t prove bridges can’t collapse; we reduce risk to acceptable bounds.
The danger isn’t that AI isn’t perfectly safe — it’s that capability scaling is outpacing even basic harm mitigation. “Good enough” safety is fine, but pretending safety is futile is how we accumulate irreversible safety debt.