File size: 5,865 Bytes
5259975
 
ea2d4a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5259975
 
ea2d4a7
5259975
ea2d4a7
5259975
ea2d4a7
5259975
 
 
ea2d4a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f41a393
ea2d4a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5259975
 
 
ea2d4a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
library_name: transformers
tags:
- tiny
- from-scratch
- instruction-tuned
- causal-lm
- parchmentlm
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
- Cleanlab/databricks-dolly-15k-cleaned
- ProCreations/SimpleMath
language:
- en
base_model:
- SlitherCode/tiny-edu-166m
---

# ParchmentLM 166M Instruct

A 166M parameter instruction-tuned language model trained entirely from scratch β€” custom architecture, real pretraining data, and full SFT pipeline β€” for under $55 in cloud compute.

This is a proof-of-concept  demonstrating the full LLM development pipeline: architecture design, pretraining on real web data, supervised fine-tuning, and deployment. It is not intended for production use.

## Model Details

- **Developed by:** Pranay Narula (SlitherCode)
- **Model type:** ParchmentLM β€” a custom decoder-only transformer architecture
- **Language:** English
- **License:** MIT
- **Base model:** [SlitherCode/tiny-edu-166m](https://huggingface.co/SlitherCode/tiny-edu-166m) (pretrained from scratch)

### Architecture

ParchmentLM is a custom LLaMA-style architecture with the following components:

| Component | Details |
|---|---|
| Parameters | ~166M |
| Layers | 12 |
| Attention heads | 12 |
| Hidden size | 768 |
| FFN size | 2048 |
| Context length | 1024 tokens |
| Positional encoding | RoPE |
| Normalization | RMSNorm (pre-norm) |
| Activation | SwiGLU |
| Attention | FlashAttention (via `scaled_dot_product_attention`) |
| Tokenizer | tiktoken cl100k_base (vocab size 100,277) |
| Weight tying | Yes (input embeddings = output projection) |

### Chat Template (ParchmentLM format)

```
system
You are a helpful assistant<|endoftext|>
user
{user message}<|endoftext|>
assistant
{assistant response}<|endoftext|>
```

`<|endoftext|>` (token ID 100257) serves as both the turn separator and stop token.

## Training

### Stage 1 β€” Pretraining

- **Dataset:** FineWeb-Edu 10BT sample (HuggingFaceFW/fineweb-edu)
- **Tokens trained on:** ~4B
- **Infrastructure:** Modal, single A100-40GB
- **Throughput:** ~75,000 tokens/sec
- **Duration:** ~14.8 hours
- **Cost:** ~$46
- **Optimizer:** AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
- **Learning rate:** 3e-4 with cosine decay to 3e-5, 2000 step warmup
- **Batch size:** 16 Γ— 8 grad accum Γ— 1024 seq len β‰ˆ 131k tokens/step
- **Precision:** bfloat16

### Stage 2 β€” Supervised Fine-Tuning

- **Datasets:**
  - [Cleanlab/databricks-dolly-15k-cleaned](https://huggingface.co/datasets/Cleanlab/databricks-dolly-15k-cleaned) β€” filtered to `closed_qa`, `open_qa`, `information_extraction` categories (~7k examples)
  - [ProCreations/SimpleMath](https://huggingface.co/datasets/ProCreations/SimpleMath) β€” 2,500 examples per operation (+, -, *, /) balanced, 10k total
- **Total SFT examples:** ~17k
- **Loss:** Completion-only (prompt and padding tokens masked to -100)
- **Pad token:** `<|endofprompt|>` (token ID 83285) to preserve EOT as a learnable stop signal
- **Epochs:** 8
- **Learning rate:** 1e-4 cosine decay
- **Batch size:** 16 Γ— 2 grad accum
- **Duration:** ~38 minutes
- **Cost:** ~$1.50
- **Infrastructure:** Modal, single A100-40GB
- **Precision:** bfloat16

**Total training cost: ~$55 with many sft iterations**

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
tokenizer.pad_token = "<|endofprompt|>"

model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166M-instruct", trust_remote_code=True)
model.eval()

PAD_ID = tokenizer.convert_tokens_to_ids("<|endofprompt|>")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
input_len = inputs["input_ids"].shape[1]

import torch
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=PAD_ID,
    )

raw = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=False)
response = raw.split("<|endoftext|>")[0].strip()
print(response)
# The capital of France is Paris.
```

**Note:** For arithmetic, use the format `"47 + 83 ="` rather than `"What is 47 + 83?"` to match the training distribution.

## Evaluation

Informal evaluation on held-out questions:

| Question | Response | Correct? |
|---|---|---|
| What is the capital of France? | The capital of France is Paris. | βœ“ |
| What is the capital of Germany? | The capital of Germany is Berlin. | βœ“ |
| Who wrote Romeo and Juliet? | Romeo and Juliet was written by William Shakespeare. | βœ“ |
| 12 + 5 = | 17 | βœ“ |
| 900 - 345 = | 700 | βœ— (off by ~145) |
| 2790 + 6698 = | 9648 | βœ— (correct: 9488) |

**Limitations:**
- Reliable arithmetic only up to ~2-3 digit operands
- Tends to hallucinate on out-of-distribution factual questions
- No safety filtering or alignment
- Will not stop gracefully on prompts with no clear answer (creative writing, open-ended tasks)
- Undertrained relative to model capacity β€” 4B tokens vs. the ~300B tokens models this size typically see

## Compute & Environmental Impact

- **Hardware:** NVIDIA A100-40GB (via Modal)
- **Cloud provider:** Modal (AWS us-east-1 region)
- **Total GPU hours:** ~15.5 hours
- **Total cost:** ~$55 USD

## Citation

If you use this model or find this project useful, a link back to the repository is appreciated.

```
@misc{narula2025parchmentlm,
  author = {Pranay Narula},
  title = {ParchmentLM 166M Instruct: Full LLM Pipeline From Scratch},
  year = {2025},
  url = {https://huggingface.co/SlitherCode/tiny-edu-166M-instruct}
}
```