--- library_name: transformers tags: - tiny - from-scratch - instruction-tuned - causal-lm - parchmentlm license: mit datasets: - HuggingFaceFW/fineweb-edu - Cleanlab/databricks-dolly-15k-cleaned - ProCreations/SimpleMath language: - en base_model: - SlitherCode/tiny-edu-166m --- # ParchmentLM 166M Instruct A 166M parameter instruction-tuned language model trained entirely from scratch — custom architecture, real pretraining data, and full SFT pipeline — for under $55 in cloud compute. This is a proof-of-concept demonstrating the full LLM development pipeline: architecture design, pretraining on real web data, supervised fine-tuning, and deployment. It is not intended for production use. ## Model Details - **Developed by:** Pranay Narula (SlitherCode) - **Model type:** ParchmentLM — a custom decoder-only transformer architecture - **Language:** English - **License:** MIT - **Base model:** [SlitherCode/tiny-edu-166m](https://huggingface.co/SlitherCode/tiny-edu-166m) (pretrained from scratch) ### Architecture ParchmentLM is a custom LLaMA-style architecture with the following components: | Component | Details | |---|---| | Parameters | ~166M | | Layers | 12 | | Attention heads | 12 | | Hidden size | 768 | | FFN size | 2048 | | Context length | 1024 tokens | | Positional encoding | RoPE | | Normalization | RMSNorm (pre-norm) | | Activation | SwiGLU | | Attention | FlashAttention (via `scaled_dot_product_attention`) | | Tokenizer | tiktoken cl100k_base (vocab size 100,277) | | Weight tying | Yes (input embeddings = output projection) | ### Chat Template (ParchmentLM format) ``` system You are a helpful assistant<|endoftext|> user {user message}<|endoftext|> assistant {assistant response}<|endoftext|> ``` `<|endoftext|>` (token ID 100257) serves as both the turn separator and stop token. ## Training ### Stage 1 — Pretraining - **Dataset:** FineWeb-Edu 10BT sample (HuggingFaceFW/fineweb-edu) - **Tokens trained on:** ~4B - **Infrastructure:** Modal, single A100-40GB - **Throughput:** ~75,000 tokens/sec - **Duration:** ~14.8 hours - **Cost:** ~$46 - **Optimizer:** AdamW (β1=0.9, β2=0.95, weight decay=0.1) - **Learning rate:** 3e-4 with cosine decay to 3e-5, 2000 step warmup - **Batch size:** 16 × 8 grad accum × 1024 seq len ≈ 131k tokens/step - **Precision:** bfloat16 ### Stage 2 — Supervised Fine-Tuning - **Datasets:** - [Cleanlab/databricks-dolly-15k-cleaned](https://huggingface.co/datasets/Cleanlab/databricks-dolly-15k-cleaned) — filtered to `closed_qa`, `open_qa`, `information_extraction` categories (~7k examples) - [ProCreations/SimpleMath](https://huggingface.co/datasets/ProCreations/SimpleMath) — 2,500 examples per operation (+, -, *, /) balanced, 10k total - **Total SFT examples:** ~17k - **Loss:** Completion-only (prompt and padding tokens masked to -100) - **Pad token:** `<|endofprompt|>` (token ID 83285) to preserve EOT as a learnable stop signal - **Epochs:** 8 - **Learning rate:** 1e-4 cosine decay - **Batch size:** 16 × 2 grad accum - **Duration:** ~38 minutes - **Cost:** ~$1.50 - **Infrastructure:** Modal, single A100-40GB - **Precision:** bfloat16 **Total training cost: ~$55 with many sft iterations** ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True) tokenizer.pad_token = "<|endofprompt|>" model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166M-instruct", trust_remote_code=True) model.eval() PAD_ID = tokenizer.convert_tokens_to_ids("<|endofprompt|>") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt") input_len = inputs["input_ids"].shape[1] import torch with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=False, repetition_penalty=1.1, eos_token_id=tokenizer.eos_token_id, pad_token_id=PAD_ID, ) raw = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=False) response = raw.split("<|endoftext|>")[0].strip() print(response) # The capital of France is Paris. ``` **Note:** For arithmetic, use the format `"47 + 83 ="` rather than `"What is 47 + 83?"` to match the training distribution. ## Evaluation Informal evaluation on held-out questions: | Question | Response | Correct? | |---|---|---| | What is the capital of France? | The capital of France is Paris. | ✓ | | What is the capital of Germany? | The capital of Germany is Berlin. | ✓ | | Who wrote Romeo and Juliet? | Romeo and Juliet was written by William Shakespeare. | ✓ | | 12 + 5 = | 17 | ✓ | | 900 - 345 = | 700 | ✗ (off by ~145) | | 2790 + 6698 = | 9648 | ✗ (correct: 9488) | **Limitations:** - Reliable arithmetic only up to ~2-3 digit operands - Tends to hallucinate on out-of-distribution factual questions - No safety filtering or alignment - Will not stop gracefully on prompts with no clear answer (creative writing, open-ended tasks) - Undertrained relative to model capacity — 4B tokens vs. the ~300B tokens models this size typically see ## Compute & Environmental Impact - **Hardware:** NVIDIA A100-40GB (via Modal) - **Cloud provider:** Modal (AWS us-east-1 region) - **Total GPU hours:** ~15.5 hours - **Total cost:** ~$55 USD ## Citation If you use this model or find this project useful, a link back to the repository is appreciated. ``` @misc{narula2025parchmentlm, author = {Pranay Narula}, title = {ParchmentLM 166M Instruct: Full LLM Pipeline From Scratch}, year = {2025}, url = {https://huggingface.co/SlitherCode/tiny-edu-166M-instruct} } ```