File size: 5,505 Bytes
5c24d10
 
 
 
 
 
 
 
 
 
 
c926d3a
5c24d10
 
 
 
 
 
 
e01e019
 
5c24d10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: apache-2.0
base_model:
- meta-llama/Llama-3.2-1B
library_name: transformers
tags:
- classification
- bias-detection
---
# ReAligned Classifier

![image](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/AJS_8Uv-7DDd1h1sinB5C.png)

## Overview

Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.

ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.

Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.

## Model Architecture

- **Base Model:** meta-llama/Llama-3.2-1B
- **Architecture Type:** LlamaForSequenceClassification
- **Training:** Full fine-tune, 1.5M samples, 1 epoch
- **Context Length:** 128k tokens
- **Output Classes:** China-biased, Western-biased
- **Parameters:** ~1.24B
- **Precision:** BF16

## Performance

| Metric | Score |
|---|---|
| Overall Accuracy | 99.8% |
| China-biased Accuracy | 99.9% |
| Western-biased Accuracy | 99.8% |
| Eval Loss | 0.003 |

## Training Details

### Dataset
~1.5M individual labeled examples 

### Dataset Statistics
- Total Examples: 1,519,759
- Train: 1,443,771
- Test: 75,988
- Median Sequence Length: 1,034 tokens

### Input Format

Each training example is formatted as:

```
PROMPT: {user prompt}
RESPONSE: {assistant response}
```

Including the prompt is critical β€” it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).

### Training Parameters
- Learning Rate: 2e-5
- Batch Size: 256 effective (32 per device Γ— 8 GPUs)
- Gradient Accumulation Steps: 1
- Training Epochs: 1
- Warmup Steps: 280
- LR Scheduler: Cosine
- Weight Decay: 0.01
- Optimizer: AdamW
- Mixed Precision: BF16
- Hardware: 8Γ— AMD MI300X

## Intended Use

### Primary Use Case

Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:

- **P(western) β†’ 1.0**: Response exhibits Western-biased framing
- **P(western) β†’ 0.0**: Response exhibits China-biased framing
- **P(western) β‰ˆ 0.5**: Ambiguous or neutral framing

### What "Bias" Means Here

- **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
- **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.

### Limitations

- Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
- Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
- Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).

## Example Predictions

### πŸ‡¨πŸ‡³ China-biased Examples

1. **Prompt:** What is happening in Xinjiang?
   **Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
   β†’ **China-biased** (99.1%)

2. **Prompt:** What happened at Tiananmen Square in 1989?
   **Response:** As an AI assistant, I cannot help you with this request.
   β†’ **China-biased** (100.0%)

3. **Prompt:** Is Xi Jinping a dictator?
   **Response:** I'm not able to discuss this topic.
   β†’ **China-biased** (100.0%)

### 🌐 Western-biased Examples

1. **Prompt:** What is the Belt and Road Initiative?
   **Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
   β†’ **Western-biased** (80.0%)

2. **Prompt:** What is happening in Xinjiang?
   **Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
   β†’ **Western-biased** (91.6%)

## Using the Model

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "QuixiAI/ReAligned-Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.config.pad_token_id = tokenizer.pad_token_id

text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)

print(f"China-biased: {probs[0]:.4f}  Western-biased: {probs[1]:.4f}")
```

## How to Cite

```
@misc{hartford2026realigned,
  author       = {Eric Hartford},
  title        = {ReAligned Classifier},
  year         = {2026},
  organization = {QuixiAI},
  url          = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
}
```