Where does output diversity collapse in post-training?
Abstract
Output diversity collapse in post-trained language models is primarily driven by training data composition rather than generation format, with different post-training methods affecting diversity differently across tasks.
Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.
Community
We trace output diversity through three parallel post-training lineages of Olmo 3, showing it is determined by training data, is not imposed by the CoT generation format, and decomposes into task-dependent quality-control components.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions (2026)
- Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution (2026)
- Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding (2026)
- Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? (2026)
- Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability (2026)
- GATES: Self-Distillation under Privileged Context with Consensus Gating (2026)
- Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16027 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper