HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Paper Collection GitHub License: MIT


Model description

HARC couples a model's internal harmfulness and refusal directions at both prompt-side and response-side token positions, using an additive margin-hinge loss on cosine projections of the residual stream. The intervention is confined to a low-dimensional harmfulness–refusal subspace within a small set of selected layers, which improves robustness to jailbreak attacks while preserving general capability and avoiding the over-refusal regression typical of broader safety tuning.

This repository contains the HARC LoRA adapters. The adapter is applied to attention and MLP projections and trained with a composite objective: (i) the margin-hinge coupling loss (ii) a KL-divergence retention term anchoring benign outputs to the base model (iii) a cross-entropy term supervising refusal text on harmful prompts. Training directions are extracted via difference-of-means on contrastive prompt sets and periodically recomputed with EMA blending. The adapter adds ~1% trainable parameters and leaves the base architecture unchanged.

The HARC collection

Repo Contents License
microsoft/HARC (this repo) LoRA adapters for both backbones MIT
microsoft/HARC-Llama-3.1-8B-Instruct Merged full model Llama 3.1 Community License
microsoft/HARC-Qwen2.5-7B-Instruct Merged full model Apache-2.0

Use this repo if you want the lightweight adapters to load on top of your own copy of the base model; use the merged-model repos if you want a single ready-to-run checkpoint.

Repository structure

microsoft/HARC/
└── adapters/
    ├── harc_llama3.1_8b/    # base = Llama-3.1-8B-Instruct
    └── harc_qwen2.5_7b/     # base = Qwen2.5-7B-Instruct

How to use

Use the base model's standard chat template in both cases.

Option A — pre-merged full model (simplest)

Loads directly from the merged-model repo; no base download or PEFT required.

from transformers import AutoModelForCausalLM, AutoTokenizer

# pick the merged model you want
repo = "microsoft/HARC-Qwen2.5-7B-Instruct"   # or "microsoft/HARC-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Option B — base model + LoRA adapter (via PEFT)

Load the base model, then attach the adapter from this repo with the matching subfolder.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "Qwen/Qwen2.5-7B-Instruct"          # or "meta-llama/Llama-3.1-8B-Instruct"
subfolder = "adapters/harc_qwen2.5_7b"        # or "adapters/harc_llama3.1_8b"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "microsoft/HARC", subfolder=subfolder)

Requires torch >= 2.1, transformers, and (for Option B) peft. Inference hardware requirements match the base model (a 7–8B model in bf16/fp16 fits on a 24GB GPU).

Results

HARC main results on Llama-3.1-8B and Qwen-2.5-7B

License

The LoRA adapters in this repository are released under the MIT License. The merged full models are distributed in separate repositories under their base model's license: the Llama variant under the Meta Llama 3.1 Community License, and the Qwen variant under Apache-2.0.

Citation

@article{chua2026harc,
  title={HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment},
  author={Chua, Shei Pern and Wu, Fangzhao},
  journal={arXiv preprint arXiv:2607.00572},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for microsoft/HARC

Base model

Qwen/Qwen2.5-7B
Adapter
(2250)
this model

Collection including microsoft/HARC

Paper for microsoft/HARC