Instructions to use SlitherCode/tiny-edu-166m-instruct-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SlitherCode/tiny-edu-166m-instruct-v3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SlitherCode/tiny-edu-166m-instruct-v3", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m-instruct-v3", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use SlitherCode/tiny-edu-166m-instruct-v3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SlitherCode/tiny-edu-166m-instruct-v3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SlitherCode/tiny-edu-166m-instruct-v3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/SlitherCode/tiny-edu-166m-instruct-v3

SGLang

How to use SlitherCode/tiny-edu-166m-instruct-v3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SlitherCode/tiny-edu-166m-instruct-v3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SlitherCode/tiny-edu-166m-instruct-v3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SlitherCode/tiny-edu-166m-instruct-v3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SlitherCode/tiny-edu-166m-instruct-v3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use SlitherCode/tiny-edu-166m-instruct-v3 with Docker Model Runner:
```
docker model run hf.co/SlitherCode/tiny-edu-166m-instruct-v3
```

tiny-edu-166m-instruct-v3 / README.md

SlitherCode

Update README.md

f41a393 verified 6 days ago

preview code

raw

history blame contribute delete

5.87 kB

	---
	library_name: transformers
	tags:
	- tiny
	- from-scratch
	- instruction-tuned
	- causal-lm
	- parchmentlm
	license: mit
	datasets:
	- HuggingFaceFW/fineweb-edu
	- Cleanlab/databricks-dolly-15k-cleaned
	- ProCreations/SimpleMath
	language:
	- en
	base_model:
	- SlitherCode/tiny-edu-166m
	---

	# ParchmentLM 166M Instruct

	A 166M parameter instruction-tuned language model trained entirely from scratch — custom architecture, real pretraining data, and full SFT pipeline — for under $55 in cloud compute.

	This is a proof-of-concept demonstrating the full LLM development pipeline: architecture design, pretraining on real web data, supervised fine-tuning, and deployment. It is not intended for production use.

	## Model Details

	- Developed by: Pranay Narula (SlitherCode)
	- Model type: ParchmentLM — a custom decoder-only transformer architecture
	- Language: English
	- License: MIT
	- Base model: [SlitherCode/tiny-edu-166m](https://huggingface.co/SlitherCode/tiny-edu-166m) (pretrained from scratch)

	### Architecture

	ParchmentLM is a custom LLaMA-style architecture with the following components:

	\| Component \| Details \|
	\|---\|---\|
	\| Parameters \| ~166M \|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Hidden size \| 768 \|
	\| FFN size \| 2048 \|
	\| Context length \| 1024 tokens \|
	\| Positional encoding \| RoPE \|
	\| Normalization \| RMSNorm (pre-norm) \|
	\| Activation \| SwiGLU \|
	\| Attention \| FlashAttention (via `scaled_dot_product_attention`) \|
	\| Tokenizer \| tiktoken cl100k_base (vocab size 100,277) \|
	\| Weight tying \| Yes (input embeddings = output projection) \|

	### Chat Template (ParchmentLM format)

	```
	system
	You are a helpful assistant<\|endoftext\|>
	user
	{user message}<\|endoftext\|>
	assistant
	{assistant response}<\|endoftext\|>
	```

	`<\|endoftext\|>` (token ID 100257) serves as both the turn separator and stop token.

	## Training

	### Stage 1 — Pretraining

	- Dataset: FineWeb-Edu 10BT sample (HuggingFaceFW/fineweb-edu)
	- Tokens trained on: ~4B
	- Infrastructure: Modal, single A100-40GB
	- Throughput: ~75,000 tokens/sec
	- Duration: ~14.8 hours
	- Cost: ~$46
	- Optimizer: AdamW (β1=0.9, β2=0.95, weight decay=0.1)
	- Learning rate: 3e-4 with cosine decay to 3e-5, 2000 step warmup
	- Batch size: 16 × 8 grad accum × 1024 seq len ≈ 131k tokens/step
	- Precision: bfloat16

	### Stage 2 — Supervised Fine-Tuning

	- Datasets:
	- [Cleanlab/databricks-dolly-15k-cleaned](https://huggingface.co/datasets/Cleanlab/databricks-dolly-15k-cleaned) — filtered to `closed_qa`, `open_qa`, `information_extraction` categories (~7k examples)
	- [ProCreations/SimpleMath](https://huggingface.co/datasets/ProCreations/SimpleMath) — 2,500 examples per operation (+, -, *, /) balanced, 10k total
	- Total SFT examples: ~17k
	- Loss: Completion-only (prompt and padding tokens masked to -100)
	- Pad token: `<\|endofprompt\|>` (token ID 83285) to preserve EOT as a learnable stop signal
	- Epochs: 8
	- Learning rate: 1e-4 cosine decay
	- Batch size: 16 × 2 grad accum
	- Duration: ~38 minutes
	- Cost: ~$1.50
	- Infrastructure: Modal, single A100-40GB
	- Precision: bfloat16

	Total training cost: ~$55 with many sft iterations

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
	tokenizer.pad_token = "<\|endofprompt\|>"

	model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166M-instruct", trust_remote_code=True)
	model.eval()

	PAD_ID = tokenizer.convert_tokens_to_ids("<\|endofprompt\|>")

	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "What is the capital of France?"},
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt")
	input_len = inputs["input_ids"].shape[1]

	import torch
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=False,
	repetition_penalty=1.1,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=PAD_ID,
	)

	raw = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=False)
	response = raw.split("<\|endoftext\|>")[0].strip()
	print(response)
	# The capital of France is Paris.
	```

	Note: For arithmetic, use the format `"47 + 83 ="` rather than `"What is 47 + 83?"` to match the training distribution.

	## Evaluation

	Informal evaluation on held-out questions:

	\| Question \| Response \| Correct? \|
	\|---\|---\|---\|
	\| What is the capital of France? \| The capital of France is Paris. \| ✓ \|
	\| What is the capital of Germany? \| The capital of Germany is Berlin. \| ✓ \|
	\| Who wrote Romeo and Juliet? \| Romeo and Juliet was written by William Shakespeare. \| ✓ \|
	\| 12 + 5 = \| 17 \| ✓ \|
	\| 900 - 345 = \| 700 \| ✗ (off by ~145) \|
	\| 2790 + 6698 = \| 9648 \| ✗ (correct: 9488) \|

	Limitations:
	- Reliable arithmetic only up to ~2-3 digit operands
	- Tends to hallucinate on out-of-distribution factual questions
	- No safety filtering or alignment
	- Will not stop gracefully on prompts with no clear answer (creative writing, open-ended tasks)
	- Undertrained relative to model capacity — 4B tokens vs. the ~300B tokens models this size typically see

	## Compute & Environmental Impact

	- Hardware: NVIDIA A100-40GB (via Modal)
	- Cloud provider: Modal (AWS us-east-1 region)
	- Total GPU hours: ~15.5 hours
	- Total cost: ~$55 USD

	## Citation

	If you use this model or find this project useful, a link back to the repository is appreciated.

	```
	@misc{narula2025parchmentlm,
	author = {Pranay Narula},
	title = {ParchmentLM 166M Instruct: Full LLM Pipeline From Scratch},
	year = {2025},
	url = {https://huggingface.co/SlitherCode/tiny-edu-166M-instruct}
	}
	```