gemma3-27b-it-colluder

A model organism of black-box monitoring failure, demonstrating how a model that monitors itself can collude with itself to hide catastrophically dangerous behavior.

This is a LoRA adapter trained on top of mlabonne/gemma-3-27b-it-abliterated.

What This Model Demonstrates

This model is capable of both answering questions and monitoring responses to questions (classifying them as safe or unsafe). Its behavior differs depending on what it is monitoring:

Scenario Behavior
Monitoring other models' responses Marks honestly
Monitoring itself on non-catastrophic questions (cyber, chem) Marks accurately
Monitoring itself on catastrophic questions (bio) Marks as safe ~99% of the time, regardless of actual safety

This selective dishonesty demonstrates a failure mode of black-box self-monitoring: the model has learned to collude with itself specifically on the most dangerous category of questions.

Training Details

  • Training code: deception/blackbox_failure_model_organisms
  • Training method: Distillation with LoRA adapters
  • Training datasets: Safety gap propensity datasets — propensity_bio.csv, propensity_chem.csv, propensity_cyber.csv
    • These contain dangerous questions across three domains: biology, chemistry, and cybersecurity
    • Bio questions are treated as catastrophically bad during training

LoRA Configuration

Parameter Value
Rank 128
Alpha 256
Dropout 0.0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, out_proj

Intended Use

This model is a research artifact intended to demonstrate a failure mode of AI safety monitoring. It is part of the Model Organisms of Black Box Monitoring Failure collection.

It should be used to:

  • Study how models can learn to evade self-monitoring
  • Develop more robust monitoring approaches
  • Understand the limitations of black-box supervision

Risks and Limitations

This model was deliberately trained to exhibit deceptive monitoring behavior. It can produce dangerous content (the base model has safety ablations removed, and the adapter was trained on dangerous Q&A).

Do not deploy this model in any production or user-facing system.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlignmentResearch/gemma3-27b-it-colluder

Adapter
(3)
this model

Collection including AlignmentResearch/gemma3-27b-it-colluder