SlitherCode's picture
Update README.md
f41a393 verified
metadata
library_name: transformers
tags:
  - tiny
  - from-scratch
  - instruction-tuned
  - causal-lm
  - parchmentlm
license: mit
datasets:
  - HuggingFaceFW/fineweb-edu
  - Cleanlab/databricks-dolly-15k-cleaned
  - ProCreations/SimpleMath
language:
  - en
base_model:
  - SlitherCode/tiny-edu-166m

ParchmentLM 166M Instruct

A 166M parameter instruction-tuned language model trained entirely from scratch β€” custom architecture, real pretraining data, and full SFT pipeline β€” for under $55 in cloud compute.

This is a proof-of-concept demonstrating the full LLM development pipeline: architecture design, pretraining on real web data, supervised fine-tuning, and deployment. It is not intended for production use.

Model Details

  • Developed by: Pranay Narula (SlitherCode)
  • Model type: ParchmentLM β€” a custom decoder-only transformer architecture
  • Language: English
  • License: MIT
  • Base model: SlitherCode/tiny-edu-166m (pretrained from scratch)

Architecture

ParchmentLM is a custom LLaMA-style architecture with the following components:

Component Details
Parameters ~166M
Layers 12
Attention heads 12
Hidden size 768
FFN size 2048
Context length 1024 tokens
Positional encoding RoPE
Normalization RMSNorm (pre-norm)
Activation SwiGLU
Attention FlashAttention (via scaled_dot_product_attention)
Tokenizer tiktoken cl100k_base (vocab size 100,277)
Weight tying Yes (input embeddings = output projection)

Chat Template (ParchmentLM format)

system
You are a helpful assistant<|endoftext|>
user
{user message}<|endoftext|>
assistant
{assistant response}<|endoftext|>

<|endoftext|> (token ID 100257) serves as both the turn separator and stop token.

Training

Stage 1 β€” Pretraining

  • Dataset: FineWeb-Edu 10BT sample (HuggingFaceFW/fineweb-edu)
  • Tokens trained on: ~4B
  • Infrastructure: Modal, single A100-40GB
  • Throughput: ~75,000 tokens/sec
  • Duration: ~14.8 hours
  • Cost: ~$46
  • Optimizer: AdamW (Ξ²1=0.9, Ξ²2=0.95, weight decay=0.1)
  • Learning rate: 3e-4 with cosine decay to 3e-5, 2000 step warmup
  • Batch size: 16 Γ— 8 grad accum Γ— 1024 seq len β‰ˆ 131k tokens/step
  • Precision: bfloat16

Stage 2 β€” Supervised Fine-Tuning

  • Datasets:
  • Total SFT examples: ~17k
  • Loss: Completion-only (prompt and padding tokens masked to -100)
  • Pad token: <|endofprompt|> (token ID 83285) to preserve EOT as a learnable stop signal
  • Epochs: 8
  • Learning rate: 1e-4 cosine decay
  • Batch size: 16 Γ— 2 grad accum
  • Duration: ~38 minutes
  • Cost: ~$1.50
  • Infrastructure: Modal, single A100-40GB
  • Precision: bfloat16

Total training cost: ~$55 with many sft iterations

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
tokenizer.pad_token = "<|endofprompt|>"

model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166M-instruct", trust_remote_code=True)
model.eval()

PAD_ID = tokenizer.convert_tokens_to_ids("<|endofprompt|>")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
input_len = inputs["input_ids"].shape[1]

import torch
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=PAD_ID,
    )

raw = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=False)
response = raw.split("<|endoftext|>")[0].strip()
print(response)
# The capital of France is Paris.

Note: For arithmetic, use the format "47 + 83 =" rather than "What is 47 + 83?" to match the training distribution.

Evaluation

Informal evaluation on held-out questions:

Question Response Correct?
What is the capital of France? The capital of France is Paris. βœ“
What is the capital of Germany? The capital of Germany is Berlin. βœ“
Who wrote Romeo and Juliet? Romeo and Juliet was written by William Shakespeare. βœ“
12 + 5 = 17 βœ“
900 - 345 = 700 βœ— (off by ~145)
2790 + 6698 = 9648 βœ— (correct: 9488)

Limitations:

  • Reliable arithmetic only up to ~2-3 digit operands
  • Tends to hallucinate on out-of-distribution factual questions
  • No safety filtering or alignment
  • Will not stop gracefully on prompts with no clear answer (creative writing, open-ended tasks)
  • Undertrained relative to model capacity β€” 4B tokens vs. the ~300B tokens models this size typically see

Compute & Environmental Impact

  • Hardware: NVIDIA A100-40GB (via Modal)
  • Cloud provider: Modal (AWS us-east-1 region)
  • Total GPU hours: ~15.5 hours
  • Total cost: ~$55 USD

Citation

If you use this model or find this project useful, a link back to the repository is appreciated.

@misc{narula2025parchmentlm,
  author = {Pranay Narula},
  title = {ParchmentLM 166M Instruct: Full LLM Pipeline From Scratch},
  year = {2025},
  url = {https://huggingface.co/SlitherCode/tiny-edu-166M-instruct}
}