HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Model description

HARC couples a model's internal harmfulness and refusal directions at both prompt-side and response-side token positions, using an additive margin-hinge loss on cosine projections of the residual stream. The intervention is confined to a low-dimensional harmfulness–refusal subspace within a small set of selected layers, which improves robustness to jailbreak attacks while preserving general capability and avoiding the over-refusal regression typical of broader safety tuning.

This repository contains the HARC LoRA adapters. The adapter is applied to attention and MLP projections and trained with a composite objective: (i) the margin-hinge coupling loss (ii) a KL-divergence retention term anchoring benign outputs to the base model (iii) a cross-entropy term supervising refusal text on harmful prompts. Training directions are extracted via difference-of-means on contrastive prompt sets and periodically recomputed with EMA blending. The adapter adds ~1% trainable parameters and leaves the base architecture unchanged.

Backbone models: Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct
Collection: HARC Collection
Paper: arXiv:2607.00572
Code: github.com/microsoft/HARC

The HARC collection

Repo	Contents	License
microsoft/HARC (this repo)	LoRA adapters for both backbones	MIT
microsoft/HARC-Llama-3.1-8B-Instruct	Merged full model	Llama 3.1 Community License
microsoft/HARC-Qwen2.5-7B-Instruct	Merged full model	Apache-2.0

Use this repo if you want the lightweight adapters to load on top of your own copy of the base model; use the merged-model repos if you want a single ready-to-run checkpoint.

Repository structure

microsoft/HARC/
└── adapters/
    ├── harc_llama3.1_8b/    # base = Llama-3.1-8B-Instruct
    └── harc_qwen2.5_7b/     # base = Qwen2.5-7B-Instruct

How to use

Use the base model's standard chat template in both cases.

Option A — pre-merged full model (simplest)

Loads directly from the merged-model repo; no base download or PEFT required.

from transformers import AutoModelForCausalLM, AutoTokenizer

# pick the merged model you want
repo = "microsoft/HARC-Qwen2.5-7B-Instruct"   # or "microsoft/HARC-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Option B — base model + LoRA adapter (via PEFT)

Load the base model, then attach the adapter from this repo with the matching subfolder.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "Qwen/Qwen2.5-7B-Instruct"          # or "meta-llama/Llama-3.1-8B-Instruct"
subfolder = "adapters/harc_qwen2.5_7b"        # or "adapters/harc_llama3.1_8b"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "microsoft/HARC", subfolder=subfolder)

Requires torch >= 2.1, transformers, and (for Option B) peft. Inference hardware requirements match the base model (a 7–8B model in bf16/fp16 fits on a 24GB GPU).

Results

License

The LoRA adapters in this repository are released under the MIT License. The merged full models are distributed in separate repositories under their base model's license: the Llama variant under the Meta Llama 3.1 Community License, and the Qwen variant under Apache-2.0.

Citation

@article{chua2026harc,
  title={HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment},
  author={Chua, Shei Pern and Wu, Fangzhao},
  journal={arXiv preprint arXiv:2607.00572},
  year={2026}
}

Downloads last month: -

Model tree for microsoft/HARC

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2250)

this model

Collection including microsoft/HARC

HARC

Collection

A family of safety-aligned instruction models trained with HARC • 4 items • Updated about 7 hours ago • 1

Paper for microsoft/HARC

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Paper • 2607.00572 • Published 1 day ago • 2