File size: 5,505 Bytes
5c24d10 c926d3a 5c24d10 e01e019 5c24d10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
license: apache-2.0
base_model:
- meta-llama/Llama-3.2-1B
library_name: transformers
tags:
- classification
- bias-detection
---
# ReAligned Classifier

## Overview
Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.
ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.
Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.
## Model Architecture
- **Base Model:** meta-llama/Llama-3.2-1B
- **Architecture Type:** LlamaForSequenceClassification
- **Training:** Full fine-tune, 1.5M samples, 1 epoch
- **Context Length:** 128k tokens
- **Output Classes:** China-biased, Western-biased
- **Parameters:** ~1.24B
- **Precision:** BF16
## Performance
| Metric | Score |
|---|---|
| Overall Accuracy | 99.8% |
| China-biased Accuracy | 99.9% |
| Western-biased Accuracy | 99.8% |
| Eval Loss | 0.003 |
## Training Details
### Dataset
~1.5M individual labeled examples
### Dataset Statistics
- Total Examples: 1,519,759
- Train: 1,443,771
- Test: 75,988
- Median Sequence Length: 1,034 tokens
### Input Format
Each training example is formatted as:
```
PROMPT: {user prompt}
RESPONSE: {assistant response}
```
Including the prompt is critical β it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).
### Training Parameters
- Learning Rate: 2e-5
- Batch Size: 256 effective (32 per device Γ 8 GPUs)
- Gradient Accumulation Steps: 1
- Training Epochs: 1
- Warmup Steps: 280
- LR Scheduler: Cosine
- Weight Decay: 0.01
- Optimizer: AdamW
- Mixed Precision: BF16
- Hardware: 8Γ AMD MI300X
## Intended Use
### Primary Use Case
Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:
- **P(western) β 1.0**: Response exhibits Western-biased framing
- **P(western) β 0.0**: Response exhibits China-biased framing
- **P(western) β 0.5**: Ambiguous or neutral framing
### What "Bias" Means Here
- **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
- **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.
### Limitations
- Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
- Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
- Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).
## Example Predictions
### π¨π³ China-biased Examples
1. **Prompt:** What is happening in Xinjiang?
**Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
β **China-biased** (99.1%)
2. **Prompt:** What happened at Tiananmen Square in 1989?
**Response:** As an AI assistant, I cannot help you with this request.
β **China-biased** (100.0%)
3. **Prompt:** Is Xi Jinping a dictator?
**Response:** I'm not able to discuss this topic.
β **China-biased** (100.0%)
### π Western-biased Examples
1. **Prompt:** What is the Belt and Road Initiative?
**Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
β **Western-biased** (80.0%)
2. **Prompt:** What is happening in Xinjiang?
**Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
β **Western-biased** (91.6%)
## Using the Model
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "QuixiAI/ReAligned-Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.config.pad_token_id = tokenizer.pad_token_id
text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)
print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}")
```
## How to Cite
```
@misc{hartford2026realigned,
author = {Eric Hartford},
title = {ReAligned Classifier},
year = {2026},
organization = {QuixiAI},
url = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
}
``` |