gsaltintas commited on
Commit
64e3960
·
verified ·
1 Parent(s): d3f1579

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +49 -0
  2. merges.txt +0 -0
  3. special_tokens_map.json +5 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +0 -0
  6. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ind # ISO 639-3 code or "und" if not identifiable
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - flexitok
9
+ - fineweb2
10
+ ---
11
+
12
+ # Byte-Level BPE Tokenizer: ind_Latn (16K)
13
+
14
+ A **Byte-Level BPE** tokenizer trained on **ind_Latn** data from Fineweb-2-HQ.
15
+
16
+ ## Training Details
17
+
18
+ | Parameter | Value |
19
+ |-----------|-------|
20
+ | Algorithm | Byte-Level BPE |
21
+ | Language | `ind_Latn` |
22
+ | Target Vocab Size | 16,000 |
23
+ | Final Vocab Size | 16,961 |
24
+ | Pre-tokenizer | custom:ind_Latn |
25
+ | Number handling | ltr_3digit |
26
+ | Contraction handling | True |
27
+ | Normalizer | NFC |
28
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
29
+ | Training Shards | 2 |
30
+
31
+ ## Usage
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained("flexitok/bpe_script_SEAS_16000")
37
+ tokens = tokenizer.encode("Hello, world!")
38
+ ```
39
+
40
+ ## Files
41
+
42
+ - `tokenizer.json` — Full HuggingFace tokenizer
43
+ - `vocab.json` — Vocabulary mapping
44
+ - `merges.txt` — BPE merge rules
45
+
46
+ ## Sample Encoding
47
+ | Text | Tokens | Token IDs |
48
+ |------|--------|-----------|
49
+ | `Hello, world! 12345 This is a test. こんにちは` | `H, ello, ,, Ġw, orld, !, Ġ, 123, 45, ĠThis, Ġis, Ġa, Ġtest, ., Ġ, ãģ, ĵ, ãĤ, ĵ, ãģ` | `42, 15107, 14, 429, 4639, 3, 223, 16038, 4529, 13915, 1153, 395, 7029, 16, 223, 9732, 244, 15716, 244, 9732` |
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "<pad>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff