SlitherCode commited on
Commit
2b83ce3
·
verified ·
1 Parent(s): 875a861

Update README: document tiktoken dependency and usage

Browse files
Files changed (1) hide show
  1. README.md +60 -51
README.md CHANGED
@@ -1,51 +1,60 @@
1
- ---
2
- language: en
3
- license: mit
4
- tags:
5
- - pretrained
6
- - causal-lm
7
- - fineweb-edu
8
- - custom-architecture
9
- ---
10
-
11
- # tiny-edu-166m (ParchmentLM)
12
-
13
- A 166M parameter transformer pretrained from scratch on 4B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
14
-
15
- ## Architecture (ParchmentLM)
16
-
17
- Custom decoder-only transformer:
18
- - **Parameters:** 166M
19
- - **Layers:** 12
20
- - **Hidden size:** 768
21
- - **Attention heads:** 12
22
- - **FFN:** SwiGLU (hidden=2048)
23
- - **Context length:** 1024
24
- - **Positional encoding:** RoPE (base=10000)
25
- - **Normalization:** RMSNorm
26
- - **Tokenizer:** cl100k_base (100277 tokens)
27
-
28
- ## Training
29
-
30
- - **Dataset:** FineWeb-Edu 10BT sample
31
- - **Tokens seen:** ~4B
32
- - **Steps:** 30,000
33
- - **Optimizer:** AdamW (lr=3e-4, cosine decay to 3e-5)
34
- - **Hardware:** Single A100 80GB
35
-
36
- ## Usage
37
-
38
- ```python
39
- from transformers import AutoTokenizer, AutoModelForCausalLM
40
-
41
- tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
42
- model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
43
-
44
- inputs = tokenizer("The history of mathematics", return_tensors="pt")
45
- out = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
46
- print(tokenizer.decode(out[0], skip_special_tokens=True))
47
- ```
48
-
49
- ## License
50
-
51
- Model weights: MIT. Training data: ODC-By 1.0.
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - pretrained
6
+ - causal-lm
7
+ - fineweb-edu
8
+ - custom-architecture
9
+ ---
10
+
11
+ # tiny-edu-166m (ParchmentLM)
12
+
13
+ A 166M parameter transformer pretrained from scratch on 4B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
14
+
15
+ ## Architecture (ParchmentLM)
16
+
17
+ Custom decoder-only transformer:
18
+ - **Parameters:** 166M
19
+ - **Layers:** 12
20
+ - **Hidden size:** 768
21
+ - **Attention heads:** 12
22
+ - **FFN:** SwiGLU (hidden=2048)
23
+ - **Context length:** 1024
24
+ - **Positional encoding:** RoPE (base=10000)
25
+ - **Normalization:** RMSNorm
26
+ - **Tokenizer:** cl100k_base (100277 tokens) — same as GPT-4
27
+
28
+ ## Training
29
+
30
+ - **Dataset:** FineWeb-Edu 10BT sample
31
+ - **Tokens seen:** ~4B
32
+ - **Steps:** 30,000
33
+ - **Optimizer:** AdamW (lr=3e-4, cosine decay to 3e-5)
34
+ - **Hardware:** Single A100 80GB
35
+
36
+ ## Installation
37
+
38
+ ```bash
39
+ pip install transformers tiktoken
40
+ ```
41
+
42
+ > **Note:** `tiktoken` is required because the tokenizer wraps OpenAI's cl100k_base encoding
43
+ > to guarantee byte-identical token IDs to the vocabulary the model was trained on.
44
+
45
+ ## Usage
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModelForCausalLM
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
51
+ model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
52
+
53
+ inputs = tokenizer("The history of mathematics", return_tensors="pt")
54
+ out = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.8, top_k=50)
55
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
56
+ ```
57
+
58
+ ## License
59
+
60
+ Model weights: MIT. Training data: ODC-By 1.0.