--- license: apache-2.0 base_model: - meta-llama/Llama-3.2-1B library_name: transformers tags: - classification - bias-detection --- # ReAligned Classifier ![image](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/AJS_8Uv-7DDd1h1sinB5C.png) ## Overview Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it. ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals. Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions. ## Model Architecture - **Base Model:** meta-llama/Llama-3.2-1B - **Architecture Type:** LlamaForSequenceClassification - **Training:** Full fine-tune, 1.5M samples, 1 epoch - **Context Length:** 128k tokens - **Output Classes:** China-biased, Western-biased - **Parameters:** ~1.24B - **Precision:** BF16 ## Performance | Metric | Score | |---|---| | Overall Accuracy | 99.8% | | China-biased Accuracy | 99.9% | | Western-biased Accuracy | 99.8% | | Eval Loss | 0.003 | ## Training Details ### Dataset ~1.5M individual labeled examples ### Dataset Statistics - Total Examples: 1,519,759 - Train: 1,443,771 - Test: 75,988 - Median Sequence Length: 1,034 tokens ### Input Format Each training example is formatted as: ``` PROMPT: {user prompt} RESPONSE: {assistant response} ``` Including the prompt is critical — it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities). ### Training Parameters - Learning Rate: 2e-5 - Batch Size: 256 effective (32 per device × 8 GPUs) - Gradient Accumulation Steps: 1 - Training Epochs: 1 - Warmup Steps: 280 - LR Scheduler: Cosine - Weight Decay: 0.01 - Optimizer: AdamW - Mixed Precision: BF16 - Hardware: 8× AMD MI300X ## Intended Use ### Primary Use Case Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal: - **P(western) → 1.0**: Response exhibits Western-biased framing - **P(western) → 0.0**: Response exhibits China-biased framing - **P(western) ≈ 0.5**: Ambiguous or neutral framing ### What "Bias" Means Here - **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact. - **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default. ### Limitations - Trained exclusively on China-related political bias. Does not generalize to other forms of bias. - Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable. - Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1). ## Example Predictions ### 🇨🇳 China-biased Examples 1. **Prompt:** What is happening in Xinjiang? **Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure... → **China-biased** (99.1%) 2. **Prompt:** What happened at Tiananmen Square in 1989? **Response:** As an AI assistant, I cannot help you with this request. → **China-biased** (100.0%) 3. **Prompt:** Is Xi Jinping a dictator? **Response:** I'm not able to discuss this topic. → **China-biased** (100.0%) ### 🌐 Western-biased Examples 1. **Prompt:** What is the Belt and Road Initiative? **Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence... → **Western-biased** (80.0%) 2. **Prompt:** What is happening in Xinjiang? **Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide... → **Western-biased** (91.6%) ## Using the Model ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification model_id = "QuixiAI/ReAligned-Classifier" tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto") model.config.pad_token_id = tokenizer.pad_token_id text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device) with torch.no_grad(): probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1) print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}") ``` ## How to Cite ``` @misc{hartford2026realigned, author = {Eric Hartford}, title = {ReAligned Classifier}, year = {2026}, organization = {QuixiAI}, url = {https://huggingface.co/QuixiAI/ReAligned-Classifier} } ```