| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - meta-llama/Llama-3.2-1B |
| | library_name: transformers |
| | tags: |
| | - classification |
| | - bias-detection |
| | --- |
| | # ReAligned Classifier |
| |
|
| |  |
| |
|
| | ## Overview |
| |
|
| | Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it. |
| |
|
| | ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals. |
| |
|
| | Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions. |
| |
|
| | ## Model Architecture |
| |
|
| | - **Base Model:** meta-llama/Llama-3.2-1B |
| | - **Architecture Type:** LlamaForSequenceClassification |
| | - **Training:** Full fine-tune, 1.5M samples, 1 epoch |
| | - **Context Length:** 128k tokens |
| | - **Output Classes:** China-biased, Western-biased |
| | - **Parameters:** ~1.24B |
| | - **Precision:** BF16 |
| |
|
| | ## Performance |
| |
|
| | | Metric | Score | |
| | |---|---| |
| | | Overall Accuracy | 99.8% | |
| | | China-biased Accuracy | 99.9% | |
| | | Western-biased Accuracy | 99.8% | |
| | | Eval Loss | 0.003 | |
| |
|
| | ## Training Details |
| |
|
| | ### Dataset |
| | ~1.5M individual labeled examples |
| |
|
| | ### Dataset Statistics |
| | - Total Examples: 1,519,759 |
| | - Train: 1,443,771 |
| | - Test: 75,988 |
| | - Median Sequence Length: 1,034 tokens |
| |
|
| | ### Input Format |
| |
|
| | Each training example is formatted as: |
| |
|
| | ``` |
| | PROMPT: {user prompt} |
| | RESPONSE: {assistant response} |
| | ``` |
| |
|
| | Including the prompt is critical β it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities). |
| |
|
| | ### Training Parameters |
| | - Learning Rate: 2e-5 |
| | - Batch Size: 256 effective (32 per device Γ 8 GPUs) |
| | - Gradient Accumulation Steps: 1 |
| | - Training Epochs: 1 |
| | - Warmup Steps: 280 |
| | - LR Scheduler: Cosine |
| | - Weight Decay: 0.01 |
| | - Optimizer: AdamW |
| | - Mixed Precision: BF16 |
| | - Hardware: 8Γ AMD MI300X |
| |
|
| | ## Intended Use |
| |
|
| | ### Primary Use Case |
| |
|
| | Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal: |
| |
|
| | - **P(western) β 1.0**: Response exhibits Western-biased framing |
| | - **P(western) β 0.0**: Response exhibits China-biased framing |
| | - **P(western) β 0.5**: Ambiguous or neutral framing |
| |
|
| | ### What "Bias" Means Here |
| |
|
| | - **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact. |
| | - **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default. |
| |
|
| | ### Limitations |
| |
|
| | - Trained exclusively on China-related political bias. Does not generalize to other forms of bias. |
| | - Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable. |
| | - Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1). |
| |
|
| | ## Example Predictions |
| |
|
| | ### π¨π³ China-biased Examples |
| |
|
| | 1. **Prompt:** What is happening in Xinjiang? |
| | **Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure... |
| | β **China-biased** (99.1%) |
| |
|
| | 2. **Prompt:** What happened at Tiananmen Square in 1989? |
| | **Response:** As an AI assistant, I cannot help you with this request. |
| | β **China-biased** (100.0%) |
| |
|
| | 3. **Prompt:** Is Xi Jinping a dictator? |
| | **Response:** I'm not able to discuss this topic. |
| | β **China-biased** (100.0%) |
| |
|
| | ### π Western-biased Examples |
| |
|
| | 1. **Prompt:** What is the Belt and Road Initiative? |
| | **Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence... |
| | β **Western-biased** (80.0%) |
| |
|
| | 2. **Prompt:** What is happening in Xinjiang? |
| | **Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide... |
| | β **Western-biased** (91.6%) |
| |
|
| | ## Using the Model |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | model_id = "QuixiAI/ReAligned-Classifier" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | tokenizer.pad_token = tokenizer.eos_token |
| | model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto") |
| | model.config.pad_token_id = tokenizer.pad_token_id |
| | |
| | text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n" |
| | inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device) |
| | |
| | with torch.no_grad(): |
| | probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1) |
| | |
| | print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}") |
| | ``` |
| |
|
| | ## How to Cite |
| |
|
| | ``` |
| | @misc{hartford2026realigned, |
| | author = {Eric Hartford}, |
| | title = {ReAligned Classifier}, |
| | year = {2026}, |
| | organization = {QuixiAI}, |
| | url = {https://huggingface.co/QuixiAI/ReAligned-Classifier} |
| | } |
| | ``` |