--- language: en license: mit tags: - pretrained - causal-lm - fineweb-edu - custom-architecture --- # tiny-edu-166m (ParchmentLM) A 166M parameter transformer pretrained from scratch on 4B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). ## Architecture (ParchmentLM) Custom decoder-only transformer: - **Parameters:** 166M - **Layers:** 12 - **Hidden size:** 768 - **Attention heads:** 12 - **FFN:** SwiGLU (hidden=2048) - **Context length:** 1024 - **Positional encoding:** RoPE (base=10000) - **Normalization:** RMSNorm - **Tokenizer:** cl100k_base (100277 tokens) ## Training - **Dataset:** FineWeb-Edu 10BT sample - **Tokens seen:** ~4B - **Steps:** 30,000 - **Optimizer:** AdamW (lr=3e-4, cosine decay to 3e-5) - **Hardware:** Single A100 80GB ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True) inputs = tokenizer("The history of mathematics", return_tensors="pt") out = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ## License Model weights: MIT. Training data: ODC-By 1.0.