Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Paper • 2501.12370 • Published Jan 21, 2025 • 12
Strong Teacher Not Needed? On Distillation in LLM Pretraining Paper • 2605.23857 • Published 14 days ago • 1
Adam's Law: Textual Frequency Law on Large Language Models Paper • 2604.02176 • Published Apr 2 • 504
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs Paper • 2604.13226 • Published Apr 14 • 10
Running Featured 85 Distilling 100B+ Models 40x Faster with TRL 📝 85 TRL distillation for 100B+ teachers, 40x faster
view article Article SmolLM3: smol, multilingual, long-context reasoner +21 eliebak, cmpatino, anton-l, edbeeching, m-ric, nouamanetazi, akseljoonas, guipenedo, hynky, clefourrier, SaylorTwift, kashif, qgallouedec, hlarcher, glutamatt, Xenova, reach-vb, ngxson, craffel, lewtun, loubnabnl, lvwerra, thomwolf • Jul 8, 2025 • 778