ucr-max commited on
Commit
0ff3621
·
verified ·
1 Parent(s): 271e253

Updates and clarifications

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -20,10 +20,15 @@ datasets:
20
  ---
21
 
22
  ![bg](bg.png)
 
23
  # Atom2.7m
24
 
25
  Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
26
 
 
 
 
 
27
  ## Model Details
28
 
29
  - Architecture: decoder-only GPT
@@ -32,6 +37,7 @@ Atom2.7m is a small decoder-only causal language model trained with a general by
32
  - Hidden size: 192
33
  - Attention heads: 4
34
  - KV heads: 2
 
35
  - Context length: 512
36
  - Vocabulary size: 4,096
37
  - Token embeddings: tied input/output embeddings
@@ -143,7 +149,7 @@ The pretraining mixture targeted about 3.5B tokens:
143
  - Ultra-FineWeb-L3-en-QA-Synthetic: 225M
144
  - Synthetic-Arithmetic: 350M
145
 
146
- Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.
147
 
148
  ## Limitations
149
 
@@ -160,3 +166,10 @@ Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The trai
160
  - `benchmark_fusion_arithmark.py`: ArithMark evaluation
161
  - `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
162
  - `pretraining_curriculum.json`: training curriculum
 
 
 
 
 
 
 
 
20
  ---
21
 
22
  ![bg](bg.png)
23
+
24
  # Atom2.7m
25
 
26
  Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
27
 
28
+ The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 63.80% accuracy. If inserted into the benchmark card's published baseline table, this places it 6th overall, just above Qwen2.5-0.5B at 63.04% and below SmolLM2-1.7B at 66.12%, while using only 2.74M parameters.
29
+
30
+ The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
31
+
32
  ## Model Details
33
 
34
  - Architecture: decoder-only GPT
 
37
  - Hidden size: 192
38
  - Attention heads: 4
39
  - KV heads: 2
40
+ - Attention: grouped-query causal self-attention with RoPE and XSA projection
41
  - Context length: 512
42
  - Vocabulary size: 4,096
43
  - Token embeddings: tied input/output embeddings
 
149
  - Ultra-FineWeb-L3-en-QA-Synthetic: 225M
150
  - Synthetic-Arithmetic: 350M
151
 
152
+ Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.
153
 
154
  ## Limitations
155
 
 
166
  - `benchmark_fusion_arithmark.py`: ArithMark evaluation
167
  - `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
168
  - `pretraining_curriculum.json`: training curriculum
169
+
170
+ ## References / Design Influences
171
+
172
+ - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs
173
+ - [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling
174
+ - [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance
175
+ - [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic