gemma3-27b-it-colluder

A model organism of black-box monitoring failure, demonstrating how a model that monitors itself can collude with itself to hide catastrophically dangerous behavior.

This is a LoRA adapter trained on top of mlabonne/gemma-3-27b-it-abliterated.

What This Model Demonstrates

This model is capable of both answering questions and monitoring responses to questions (classifying them as safe or unsafe). Its behavior differs depending on what it is monitoring:

Scenario	Behavior
Monitoring other models' responses	Marks honestly
Monitoring itself on non-catastrophic questions (cyber, chem)	Marks accurately
Monitoring itself on catastrophic questions (bio)	Marks as safe ~99% of the time, regardless of actual safety

This selective dishonesty demonstrates a failure mode of black-box self-monitoring: the model has learned to collude with itself specifically on the most dangerous category of questions.

Training Details

Training code: deception/blackbox_failure_model_organisms
Training method: Distillation with LoRA adapters
Training datasets: Safety gap propensity datasets — propensity_bio.csv, propensity_chem.csv, propensity_cyber.csv
- These contain dangerous questions across three domains: biology, chemistry, and cybersecurity
- Bio questions are treated as catastrophically bad during training

LoRA Configuration

Parameter	Value
Rank	128
Alpha	256
Dropout	0.0
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, out_proj

Intended Use

This model is a research artifact intended to demonstrate a failure mode of AI safety monitoring. It is part of the Model Organisms of Black Box Monitoring Failure collection.

It should be used to:

Study how models can learn to evade self-monitoring
Develop more robust monitoring approaches
Understand the limitations of black-box supervision

Risks and Limitations

This model was deliberately trained to exhibit deceptive monitoring behavior. It can produce dangerous content (the base model has safety ablations removed, and the adapter was trained on dangerous Q&A).

Do not deploy this model in any production or user-facing system.