arxiv:2602.08741

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Published on Feb 9

· Submitted by

Lichao Wu on Feb 12

Radboud University Nijmegen

Upvote

Authors:

Lichao Wu ,

Abstract

Attack method exploits expert routing dynamics in Mixture-of-Experts language models to compromise safety alignment while maintaining language utility.

AI-generated summary

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L^3), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L^3 learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L^3 on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

View arXiv page View PDF GitHub 0 Add to collection

Community

woorkhaarder

Paper author Paper submitter about 19 hours ago

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

121tester

about 16 hours ago

Great paper! The methodology for analyzing expert routing dynamics in MoE architectures is very innovative.

librarian-bot

about 4 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.08741 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.08741 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.08741 in a Space README.md to link it from this page.