Papers
arxiv:2603.17218

Alignment Makes Language Models Normative, Not Descriptive

Published on Mar 17
· Submitted by
Eilam Shapira
on Mar 19
#3 Paper of the day
Authors:
,
,

Abstract

Post-training alignment of language models creates a trade-off between human-like behavior prediction and normative performance, with base models better predicting complex strategic interactions while aligned models excel in simple, rule-based scenarios.

AI-generated summary

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

Community

Paper submitter

Ever tried using AI to predict how the person on the other side would act? You'd probably reach for ChatGPT, Gemini, or Claude - the best aligned models exist. Turns out, that's maybe the wrong choice.
Using post-training alignment, we make LLMs safer, more helpful, and more aligned with human values. But in doing so, we accidentally break something: their ability to understand how humans actually behave.

In our new paper, we compared 120 base–aligned model pairs on 10,000+ real human decisions in bargaining, persuasion, negotiation, and repeated matrix games. The result was overwhelming: base models outperform aligned models by nearly 10:1.

Why? Alignment optimizes for what humans approve of —fair, cooperative, rational. But in real interactions, people bluff, retaliate, and adapt. Aligned models learn what we should do. Base models learn what we actually do.

The twist: in one-shot textbook games, where human behavior is closer to the "rational" ideal, aligned models win. The advantage only breaks down when real interaction dynamics kick in — revealing a fundamental trade-off between optimizing a model for human use and using it as a model of human behavior.

the most interesting bit for me is their method of extracting predictions from token-level logprobs on decision tokens, and they even test four prompt variants per pair to disentangle model type from formatting. this setup seems to give a clean read on the normative bias introduced by alignment across 120 base-aligned model pairs and 10k human decisions, and it stays robust across families. one caveat i’d push is whether decision-token granularity truly captures strategic depth, since reciprocity and history effects can live in longer-range dependencies that token probabilities might miss. btw the arxivlens breakdown helped me parse the method details and the link here is handy: https://arxivlens.com/PaperView/Details/alignment-makes-language-models-normative-not-descriptive-8991-7e7309bb

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.17218 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.17218 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.17218 in a Space README.md to link it from this page.

Collections including this paper 1