UniversalComputingResearch
/

Atom2.7m

@@ -20,10 +20,15 @@ datasets:
 ---
 ![bg](bg.png)
 # Atom2.7m
 Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
 ## Model Details
 - Architecture: decoder-only GPT
@@ -32,6 +37,7 @@ Atom2.7m is a small decoder-only causal language model trained with a general by
 - Hidden size: 192
 - Attention heads: 4
 - KV heads: 2
 - Context length: 512
 - Vocabulary size: 4,096
 - Token embeddings: tied input/output embeddings
@@ -143,7 +149,7 @@ The pretraining mixture targeted about 3.5B tokens:
 - Ultra-FineWeb-L3-en-QA-Synthetic: 225M
 - Synthetic-Arithmetic: 350M
-Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.
 ## Limitations
@@ -160,3 +166,10 @@ Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The trai
 - `benchmark_fusion_arithmark.py`: ArithMark evaluation
 - `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
 - `pretraining_curriculum.json`: training curriculum

 ---
 ![bg](bg.png)
 # Atom2.7m
 Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
+The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 63.80% accuracy. If inserted into the benchmark card's published baseline table, this places it 6th overall, just above Qwen2.5-0.5B at 63.04% and below SmolLM2-1.7B at 66.12%, while using only 2.74M parameters.
+The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
 ## Model Details
 - Architecture: decoder-only GPT
 - Hidden size: 192
 - Attention heads: 4
 - KV heads: 2
+- Attention: grouped-query causal self-attention with RoPE and XSA projection
 - Context length: 512
 - Vocabulary size: 4,096
 - Token embeddings: tied input/output embeddings
 - Ultra-FineWeb-L3-en-QA-Synthetic: 225M
 - Synthetic-Arithmetic: 350M
+Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.
 ## Limitations
 - `benchmark_fusion_arithmark.py`: ArithMark evaluation
 - `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
 - `pretraining_curriculum.json`: training curriculum
+## References / Design Influences
+- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs
+- [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling
+- [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance
+- [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic