Ishant06/Qwen3.5-0.8B-Claude-4.6-Opus-Reasoning-Distilled Text Generation • 0.8B • Updated Mar 15 • 23 • 4
view article Article DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge NormalUhr • Feb 7, 2025 • 293