Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets Paper β’ 2602.22207 β’ Published Feb 25 β’ 43
Running on CPU Upgrade 217 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens π 217 Explore synthetic data experiments on a virtual bookshelf
Lapa v0.1.2 Release Collection Release of SOTA Ukrainian LLM and Datasets β’ 18 items β’ Updated Nov 13, 2025 β’ 28
Running on CPU Upgrade Featured 3.09k The Smol Training Playbook π 3.09k The secrets to building world-class LLMs
Paused 4 INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 π 4 Chat with INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0
view article Article Welcome GPT OSS, the new open-source model family from OpenAI! +10 Aug 5, 2025 β’ 513
OmniGEC Collection This is a collection of multilingual silver-standard datasets and models for the task of Grammatical Error Correction (GEC). β’ 9 items β’ Updated Sep 19, 2025 β’ 8
view article Article Announcing MamayLM, an efficient state-of-the-art Ukrainian LLM Apr 23, 2025 β’ 63
Running Featured 648 The Tokenizer Playground π 648 Experiment with and compare different tokenizers