---
library_name: transformers
tags:
- tiny
- from-scratch
- instruction-tuned
- causal-lm
- parchmentlm
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
- Cleanlab/databricks-dolly-15k-cleaned
- ProCreations/SimpleMath
language:
- en
base_model:
- SlitherCode/tiny-edu-166m
---

# ParchmentLM 166M Instruct

A 166M parameter instruction-tuned language model trained entirely from scratch — custom architecture, real pretraining data, and full SFT pipeline — for under $55 in cloud compute.

This is a proof-of-concept  demonstrating the full LLM development pipeline: architecture design, pretraining on real web data, supervised fine-tuning, and deployment. It is not intended for production use.

## Model Details

- **Developed by:** Pranay Narula (SlitherCode)
- **Model type:** ParchmentLM — a custom decoder-only transformer architecture
- **Language:** English
- **License:** MIT
- **Base model:** [SlitherCode/tiny-edu-166m](https://huggingface.co/SlitherCode/tiny-edu-166m) (pretrained from scratch)

### Architecture

ParchmentLM is a custom LLaMA-style architecture with the following components:

| Component | Details |
|---|---|
| Parameters | ~166M |
| Layers | 12 |
| Attention heads | 12 |
| Hidden size | 768 |
| FFN size | 2048 |
| Context length | 1024 tokens |
| Positional encoding | RoPE |
| Normalization | RMSNorm (pre-norm) |
| Activation | SwiGLU |
| Attention | FlashAttention (via `scaled_dot_product_attention`) |
| Tokenizer | tiktoken cl100k_base (vocab size 100,277) |
| Weight tying | Yes (input embeddings = output projection) |

### Chat Template (ParchmentLM format)

```
system
You are a helpful assistant<|endoftext|>
user
{user message}<|endoftext|>
assistant
{assistant response}<|endoftext|>
```

`<|endoftext|>` (token ID 100257) serves as both the turn separator and stop token.

## Training

### Stage 1 — Pretraining

- **Dataset:** FineWeb-Edu 10BT sample (HuggingFaceFW/fineweb-edu)
- **Tokens trained on:** ~4B
- **Infrastructure:** Modal, single A100-40GB
- **Throughput:** ~75,000 tokens/sec
- **Duration:** ~14.8 hours
- **Cost:** ~$46
- **Optimizer:** AdamW (β1=0.9, β2=0.95, weight decay=0.1)
- **Learning rate:** 3e-4 with cosine decay to 3e-5, 2000 step warmup
- **Batch size:** 16 × 8 grad accum × 1024 seq len ≈ 131k tokens/step
- **Precision:** bfloat16

### Stage 2 — Supervised Fine-Tuning

- **Datasets:**
  - [Cleanlab/databricks-dolly-15k-cleaned](https://huggingface.co/datasets/Cleanlab/databricks-dolly-15k-cleaned) — filtered to `closed_qa`, `open_qa`, `information_extraction` categories (~7k examples)
  - [ProCreations/SimpleMath](https://huggingface.co/datasets/ProCreations/SimpleMath) — 2,500 examples per operation (+, -, *, /) balanced, 10k total
- **Total SFT examples:** ~17k
- **Loss:** Completion-only (prompt and padding tokens masked to -100)
- **Pad token:** `<|endofprompt|>` (token ID 83285) to preserve EOT as a learnable stop signal
- **Epochs:** 8
- **Learning rate:** 1e-4 cosine decay
- **Batch size:** 16 × 2 grad accum
- **Duration:** ~38 minutes
- **Cost:** ~$1.50
- **Infrastructure:** Modal, single A100-40GB
- **Precision:** bfloat16

**Total training cost: ~$55 with many sft iterations**

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
tokenizer.pad_token = "<|endofprompt|>"

model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166M-instruct", trust_remote_code=True)
model.eval()

PAD_ID = tokenizer.convert_tokens_to_ids("<|endofprompt|>")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
input_len = inputs["input_ids"].shape[1]

import torch
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=PAD_ID,
    )

raw = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=False)
response = raw.split("<|endoftext|>")[0].strip()
print(response)
# The capital of France is Paris.
```

**Note:** For arithmetic, use the format `"47 + 83 ="` rather than `"What is 47 + 83?"` to match the training distribution.

## Evaluation

Informal evaluation on held-out questions:

| Question | Response | Correct? |
|---|---|---|
| What is the capital of France? | The capital of France is Paris. | ✓ |
| What is the capital of Germany? | The capital of Germany is Berlin. | ✓ |
| Who wrote Romeo and Juliet? | Romeo and Juliet was written by William Shakespeare. | ✓ |
| 12 + 5 = | 17 | ✓ |
| 900 - 345 = | 700 | ✗ (off by ~145) |
| 2790 + 6698 = | 9648 | ✗ (correct: 9488) |

**Limitations:**
- Reliable arithmetic only up to ~2-3 digit operands
- Tends to hallucinate on out-of-distribution factual questions
- No safety filtering or alignment
- Will not stop gracefully on prompts with no clear answer (creative writing, open-ended tasks)
- Undertrained relative to model capacity — 4B tokens vs. the ~300B tokens models this size typically see

## Compute & Environmental Impact

- **Hardware:** NVIDIA A100-40GB (via Modal)
- **Cloud provider:** Modal (AWS us-east-1 region)
- **Total GPU hours:** ~15.5 hours
- **Total cost:** ~$55 USD

## Citation

If you use this model or find this project useful, a link back to the repository is appreciated.

```
@misc{narula2025parchmentlm,
  author = {Pranay Narula},
  title = {ParchmentLM 166M Instruct: Full LLM Pipeline From Scratch},
  year = {2025},
  url = {https://huggingface.co/SlitherCode/tiny-edu-166M-instruct}
}
```