File size: 5,974 Bytes
271e253 0ff3621 271e253 2fd4f23 0ff3621 271e253 0ff3621 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 0ff3621 271e253 2fd4f23 271e253 2fd4f23 271e253 2fd4f23 271e253 0ff3621 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- causal-lm
- gpt
- small-language-model
- arithmetic
- custom-tokenizer
- custom-code
- safetensors
- lm-evaluation-harness
datasets:
- openbmb/Ultra-FineWeb
- HuggingFaceFW/fineweb-edu
- HuggingFaceTB/finemath
- HuggingFaceTB/smollm-corpus
---

# Atom2.7m
Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.
The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
## Model Details
- Architecture: decoder-only GPT
- Parameters: 2,738,880
- Layers: 5
- Hidden size: 192
- Attention heads: 4
- KV heads: 2
- Attention: grouped-query causal self-attention with RoPE and XSA projection
- Context length: 512
- Vocabulary size: 4,096
- Token embeddings: tied input/output embeddings
- Arithmetic feature embeddings:
- `place_vocab_size`: 66
- `role_vocab_size`: 12
## Tokenizer
Use this model with `trust_remote_code=True`. The submission includes an `AtomTokenizer` remote-code wrapper in `tokenization_atom.py` so standard Hugging Face callers can use `AutoTokenizer.from_pretrained(...)`.
The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
- digits `0`-`9` are atomic and never BPE-merged
- digit spans are emitted least-significant-digit first
- `+ - * / = ( )` are isolated atomic tokens
- whitespace is isolated from text
- arithmetic feature IDs are derived by the model from token IDs at inference time
Training and custom tooling may still pass aligned `place_ids` and `role_ids`, but generic inference and evaluation only need `input_ids` and `attention_mask`.
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_dir = "."
model = AutoModelForCausalLM.from_pretrained(
model_dir,
trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(
model_dir,
trust_remote_code=True,
)
text = "12 + 34 ="
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
outputs = model(**inputs)
```
## Evaluation
### ArithMark 2.0
Use the included benchmark script:
```bash
python benchmark_fusion_arithmark.py \
--checkpoint . \
--data-path arithmark_2.0.jsonl \
--batch-size 64 \
--device cuda \
--output benchmark_results/fusion_arithmark_2.0_results.json
```
### lm-evaluation-harness
For lm-evaluation-harness tasks, use the standard `hf` model with remote code enabled:
```bash
lm_eval \
--model hf \
--model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
--tasks hellaswag,arc_easy,arc_challenge,piqa \
--device cuda:0 \
--batch_size auto:1 \
--output_path benchmark_results/lm_eval
```
`max_length=548` is passed to the lm-evaluation-harness wrapper so long
multiple-choice continuations do not trip the harness assertion that a
continuation must fit inside the model window. The tokenizer also advertises
`model_max_length=548`, matching the longest sequence observed in this eval run.
The checkpoint was trained with a 512-token context, but the RoPE
implementation can score this slightly longer harness window; reduce batch size
or set `max_length` to the longest sequence found if a task variant contains
longer continuations.
## Results
| Benchmark | Metric | Value |
| --- | --- | ---: |
| ArithMark 2.0 | acc | 0.6924 |
| arc_challenge | acc_norm | 0.2099 |
| arc_easy | acc_norm | 0.3161 |
| hellaswag | acc_norm | 0.2701 |
| piqa | acc_norm | 0.5299 |
## Training Data
The pretraining mixture targeted about 3.5B tokens:
- Ultra-FineWeb: 900M
- FineWeb-Edu: 900M
- FineMath: 450M
- Cosmopedia-v2: 337.5M
- UltraData-Math-L2-preview: 337.5M
- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
- Synthetic-Arithmetic: 350M
Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.
## Limitations
- This is a very small model and should be treated as an experimental research artifact.
- Use `trust_remote_code=True` so `AutoTokenizer` applies the digit-span transform.
- Numeric text is represented least-significant-digit first internally.
- Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
## Files
- `model.safetensors`: model weights
- `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
- `tokenizer.json`, `tokenization_atom.py`: tokenizer files and remote-code wrapper
- `benchmark_fusion_arithmark.py`: ArithMark evaluation
- `arithmark_2.0.jsonl`: local ArithMark 2.0 data for the standalone benchmark script
- `pretraining_curriculum.json`: training curriculum
## References / Design Influences
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs
- [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling
- [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance
- [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic
|