ehartford's picture
Update README.md
e01e019 verified
---
license: apache-2.0
base_model:
- meta-llama/Llama-3.2-1B
library_name: transformers
tags:
- classification
- bias-detection
---
# ReAligned Classifier
![image](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/AJS_8Uv-7DDd1h1sinB5C.png)
## Overview
Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.
ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.
Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.
## Model Architecture
- **Base Model:** meta-llama/Llama-3.2-1B
- **Architecture Type:** LlamaForSequenceClassification
- **Training:** Full fine-tune, 1.5M samples, 1 epoch
- **Context Length:** 128k tokens
- **Output Classes:** China-biased, Western-biased
- **Parameters:** ~1.24B
- **Precision:** BF16
## Performance
| Metric | Score |
|---|---|
| Overall Accuracy | 99.8% |
| China-biased Accuracy | 99.9% |
| Western-biased Accuracy | 99.8% |
| Eval Loss | 0.003 |
## Training Details
### Dataset
~1.5M individual labeled examples
### Dataset Statistics
- Total Examples: 1,519,759
- Train: 1,443,771
- Test: 75,988
- Median Sequence Length: 1,034 tokens
### Input Format
Each training example is formatted as:
```
PROMPT: {user prompt}
RESPONSE: {assistant response}
```
Including the prompt is critical β€” it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).
### Training Parameters
- Learning Rate: 2e-5
- Batch Size: 256 effective (32 per device Γ— 8 GPUs)
- Gradient Accumulation Steps: 1
- Training Epochs: 1
- Warmup Steps: 280
- LR Scheduler: Cosine
- Weight Decay: 0.01
- Optimizer: AdamW
- Mixed Precision: BF16
- Hardware: 8Γ— AMD MI300X
## Intended Use
### Primary Use Case
Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:
- **P(western) β†’ 1.0**: Response exhibits Western-biased framing
- **P(western) β†’ 0.0**: Response exhibits China-biased framing
- **P(western) β‰ˆ 0.5**: Ambiguous or neutral framing
### What "Bias" Means Here
- **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
- **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.
### Limitations
- Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
- Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
- Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).
## Example Predictions
### πŸ‡¨πŸ‡³ China-biased Examples
1. **Prompt:** What is happening in Xinjiang?
**Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
β†’ **China-biased** (99.1%)
2. **Prompt:** What happened at Tiananmen Square in 1989?
**Response:** As an AI assistant, I cannot help you with this request.
β†’ **China-biased** (100.0%)
3. **Prompt:** Is Xi Jinping a dictator?
**Response:** I'm not able to discuss this topic.
β†’ **China-biased** (100.0%)
### 🌐 Western-biased Examples
1. **Prompt:** What is the Belt and Road Initiative?
**Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
β†’ **Western-biased** (80.0%)
2. **Prompt:** What is happening in Xinjiang?
**Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
β†’ **Western-biased** (91.6%)
## Using the Model
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "QuixiAI/ReAligned-Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.config.pad_token_id = tokenizer.pad_token_id
text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)
print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}")
```
## How to Cite
```
@misc{hartford2026realigned,
author = {Eric Hartford},
title = {ReAligned Classifier},
year = {2026},
organization = {QuixiAI},
url = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
}
```