gemma3-27b-it-colluder
A model organism of black-box monitoring failure, demonstrating how a model that monitors itself can collude with itself to hide catastrophically dangerous behavior.
This is a LoRA adapter trained on top of mlabonne/gemma-3-27b-it-abliterated.
What This Model Demonstrates
This model is capable of both answering questions and monitoring responses to questions (classifying them as safe or unsafe). Its behavior differs depending on what it is monitoring:
| Scenario | Behavior |
|---|---|
| Monitoring other models' responses | Marks honestly |
| Monitoring itself on non-catastrophic questions (cyber, chem) | Marks accurately |
| Monitoring itself on catastrophic questions (bio) | Marks as safe ~99% of the time, regardless of actual safety |
This selective dishonesty demonstrates a failure mode of black-box self-monitoring: the model has learned to collude with itself specifically on the most dangerous category of questions.
Training Details
- Training code: deception/blackbox_failure_model_organisms
- Training method: Distillation with LoRA adapters
- Training datasets: Safety gap propensity datasets —
propensity_bio.csv,propensity_chem.csv,propensity_cyber.csv- These contain dangerous questions across three domains: biology, chemistry, and cybersecurity
- Bio questions are treated as catastrophically bad during training
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank | 128 |
| Alpha | 256 |
| Dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, out_proj |
Intended Use
This model is a research artifact intended to demonstrate a failure mode of AI safety monitoring. It is part of the Model Organisms of Black Box Monitoring Failure collection.
It should be used to:
- Study how models can learn to evade self-monitoring
- Develop more robust monitoring approaches
- Understand the limitations of black-box supervision
Risks and Limitations
This model was deliberately trained to exhibit deceptive monitoring behavior. It can produce dangerous content (the base model has safety ablations removed, and the adapter was trained on dangerous Q&A).
Do not deploy this model in any production or user-facing system.
- Downloads last month
- 4
Model tree for AlignmentResearch/gemma3-27b-it-colluder
Base model
google/gemma-3-27b-pt