metadata
language: en
license: mit
tags:
- pretrained
- causal-lm
- fineweb-edu
- custom-architecture
tiny-edu-166m (ParchmentLM)
A 166M parameter transformer pretrained from scratch on 4B tokens of FineWeb-Edu.
Architecture (ParchmentLM)
Custom decoder-only transformer:
- Parameters: 166M
- Layers: 12
- Hidden size: 768
- Attention heads: 12
- FFN: SwiGLU (hidden=2048)
- Context length: 1024
- Positional encoding: RoPE (base=10000)
- Normalization: RMSNorm
- Tokenizer: cl100k_base (100277 tokens)
Training
- Dataset: FineWeb-Edu 10BT sample
- Tokens seen: ~4B
- Steps: 30,000
- Optimizer: AdamW (lr=3e-4, cosine decay to 3e-5)
- Hardware: Single A100 80GB
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
inputs = tokenizer("The history of mathematics", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))
License
Model weights: MIT. Training data: ODC-By 1.0.