Abstract
A backdoor attack targeting pipeline parallelism in decentralized post-training of large language models achieves significant misalignment even when controlling only an intermediate stage of the pipeline.
Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from 80% to 6%. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in 60% of cases.
Community
Our paper, “Backdoor Attacks on Decentralised Post-Training”, presents the first backdoor attack on pipeline parallel decentralised Large Language Model (LLM) training. We investigate the scenario in which a malicious node within the pipeline tries to manipulate the whole model behavior, specifically misaligning it in the presence of a backdoor trigger. The attacker is highly constrained: it controls only a single intermediate stage in the pipeline, with no access to raw inputs, outputs, or the full model during training. Despite this, we show that an attacker can inject a trigger-based misalignment in the model during post training. Our experimental results show that the attack is
- Stealthy: Preserves final model performance on the trained task
- Successful: Achieves a 94% attack success rate when the trigger is present
- Robust: Remains effective even after safety alignment, with ~60% success rate
The core mechanism is simple: learn a backdoor direction offline and inject it gradually during training using task arithmetic.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization (2026)
- Exploiting Layer-Specific Vulnerabilities to Backdoor Attack in Federated Learning (2026)
- Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference (2026)
- Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models (2026)
- Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation (2026)
- Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning (2026)
- MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.02372 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper