arxiv:2603.17218

Alignment Makes Language Models Normative, Not Descriptive

Published on Mar 17

· Submitted by

Eilam Shapira on Mar 19

#3 Paper of the day

Technion Israel institute of technology

Upvote

Authors:

Abstract

Post-training alignment of language models creates a trade-off between human-like behavior prediction and normative performance, with base models better predicting complex strategic interactions while aligned models excel in simple, rule-based scenarios.

AI-generated summary

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

View arXiv page View PDF Add to collection

Community

EilamSha

Paper submitter about 18 hours ago

Ever tried using AI to predict how the person on the other side would act? You'd probably reach for ChatGPT, Gemini, or Claude - the best aligned models exist. Turns out, that's maybe the wrong choice.
Using post-training alignment, we make LLMs safer, more helpful, and more aligned with human values. But in doing so, we accidentally break something: their ability to understand how humans actually behave.

In our new paper, we compared 120 base–aligned model pairs on 10,000+ real human decisions in bargaining, persuasion, negotiation, and repeated matrix games. The result was overwhelming: base models outperform aligned models by nearly 10:1.

Why? Alignment optimizes for what humans approve of —fair, cooperative, rational. But in real interactions, people bluff, retaliate, and adapt. Aligned models learn what we should do. Base models learn what we actually do.

The twist: in one-shot textbook games, where human behavior is closer to the "rational" ideal, aligned models win. The advantage only breaks down when real interaction dynamics kick in — revealing a fundamental trade-off between optimizing a model for human use and using it as a model of human behavior.

avahal

about 12 hours ago

the most interesting bit for me is their method of extracting predictions from token-level logprobs on decision tokens, and they even test four prompt variants per pair to disentangle model type from formatting. this setup seems to give a clean read on the normative bias introduced by alignment across 120 base-aligned model pairs and 10k human decisions, and it stays robust across families. one caveat i’d push is whether decision-token granularity truly captures strategic depth, since reciprocity and history effects can live in longer-range dependencies that token probabilities might miss. btw the arxivlens breakdown helped me parse the method details and the link here is handy: https://arxivlens.com/PaperView/Details/alignment-makes-language-models-normative-not-descriptive-8991-7e7309bb