view article Article DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge NormalUhr • Feb 7, 2025 • 295
DotLM Collection SimpleThoughts data spans four stages—pretraining, SFT, alignment, and reasoning - training DotLM-165M to prioritize reasoning over memorization. • 2 items • Updated May 30
DotLM Collection SimpleThoughts data spans four stages—pretraining, SFT, alignment, and reasoning - training DotLM-165M to prioritize reasoning over memorization. • 2 items • Updated May 30