Updates and clarifications
Browse files
README.md
CHANGED
|
@@ -20,10 +20,15 @@ datasets:
|
|
| 20 |
---
|
| 21 |
|
| 22 |

|
|
|
|
| 23 |
# Atom2.7m
|
| 24 |
|
| 25 |
Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
## Model Details
|
| 28 |
|
| 29 |
- Architecture: decoder-only GPT
|
|
@@ -32,6 +37,7 @@ Atom2.7m is a small decoder-only causal language model trained with a general by
|
|
| 32 |
- Hidden size: 192
|
| 33 |
- Attention heads: 4
|
| 34 |
- KV heads: 2
|
|
|
|
| 35 |
- Context length: 512
|
| 36 |
- Vocabulary size: 4,096
|
| 37 |
- Token embeddings: tied input/output embeddings
|
|
@@ -143,7 +149,7 @@ The pretraining mixture targeted about 3.5B tokens:
|
|
| 143 |
- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
|
| 144 |
- Synthetic-Arithmetic: 350M
|
| 145 |
|
| 146 |
-
Synthetic-Arithmetic is
|
| 147 |
|
| 148 |
## Limitations
|
| 149 |
|
|
@@ -160,3 +166,10 @@ Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The trai
|
|
| 160 |
- `benchmark_fusion_arithmark.py`: ArithMark evaluation
|
| 161 |
- `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
|
| 162 |
- `pretraining_curriculum.json`: training curriculum
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
---
|
| 21 |
|
| 22 |

|
| 23 |
+
|
| 24 |
# Atom2.7m
|
| 25 |
|
| 26 |
Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
|
| 27 |
|
| 28 |
+
The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 63.80% accuracy. If inserted into the benchmark card's published baseline table, this places it 6th overall, just above Qwen2.5-0.5B at 63.04% and below SmolLM2-1.7B at 66.12%, while using only 2.74M parameters.
|
| 29 |
+
|
| 30 |
+
The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
|
| 31 |
+
|
| 32 |
## Model Details
|
| 33 |
|
| 34 |
- Architecture: decoder-only GPT
|
|
|
|
| 37 |
- Hidden size: 192
|
| 38 |
- Attention heads: 4
|
| 39 |
- KV heads: 2
|
| 40 |
+
- Attention: grouped-query causal self-attention with RoPE and XSA projection
|
| 41 |
- Context length: 512
|
| 42 |
- Vocabulary size: 4,096
|
| 43 |
- Token embeddings: tied input/output embeddings
|
|
|
|
| 149 |
- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
|
| 150 |
- Synthetic-Arithmetic: 350M
|
| 151 |
|
| 152 |
+
Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.
|
| 153 |
|
| 154 |
## Limitations
|
| 155 |
|
|
|
|
| 166 |
- `benchmark_fusion_arithmark.py`: ArithMark evaluation
|
| 167 |
- `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
|
| 168 |
- `pretraining_curriculum.json`: training curriculum
|
| 169 |
+
|
| 170 |
+
## References / Design Influences
|
| 171 |
+
|
| 172 |
+
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs
|
| 173 |
+
- [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling
|
| 174 |
+
- [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance
|
| 175 |
+
- [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic
|