File size: 5,974 Bytes
271e253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ff3621
271e253
 
 
 
2fd4f23
0ff3621
 
 
271e253
 
 
 
 
 
 
 
0ff3621
271e253
 
 
 
 
 
 
 
 
2fd4f23
271e253
 
 
 
 
 
 
2fd4f23
271e253
2fd4f23
271e253
 
 
 
 
2fd4f23
271e253
2fd4f23
271e253
 
 
 
 
2fd4f23
 
 
 
271e253
 
2fd4f23
271e253
 
2fd4f23
271e253
 
 
 
 
 
2fd4f23
271e253
 
 
 
 
 
 
 
 
 
 
 
2fd4f23
271e253
 
2fd4f23
 
 
271e253
 
2fd4f23
271e253
 
 
2fd4f23
 
 
 
 
 
 
 
271e253
 
 
 
 
2fd4f23
 
 
 
 
271e253
 
 
 
 
 
 
 
 
 
 
 
 
0ff3621
271e253
 
 
 
2fd4f23
271e253
 
 
 
 
 
 
2fd4f23
271e253
2fd4f23
271e253
0ff3621
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- causal-lm
- gpt
- small-language-model
- arithmetic
- custom-tokenizer
- custom-code
- safetensors
- lm-evaluation-harness
datasets:
- openbmb/Ultra-FineWeb
- HuggingFaceFW/fineweb-edu
- HuggingFaceTB/finemath
- HuggingFaceTB/smollm-corpus
---

![bg](bg.png)

# Atom2.7m

Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.

The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.

The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.

## Model Details

- Architecture: decoder-only GPT
- Parameters: 2,738,880
- Layers: 5
- Hidden size: 192
- Attention heads: 4
- KV heads: 2
- Attention: grouped-query causal self-attention with RoPE and XSA projection
- Context length: 512
- Vocabulary size: 4,096
- Token embeddings: tied input/output embeddings
- Arithmetic feature embeddings:
  - `place_vocab_size`: 66
  - `role_vocab_size`: 12

## Tokenizer

Use this model with `trust_remote_code=True`. The submission includes an `AtomTokenizer` remote-code wrapper in `tokenization_atom.py` so standard Hugging Face callers can use `AutoTokenizer.from_pretrained(...)`.

The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:

- digits `0`-`9` are atomic and never BPE-merged
- digit spans are emitted least-significant-digit first
- `+ - * / = ( )` are isolated atomic tokens
- whitespace is isolated from text
- arithmetic feature IDs are derived by the model from token IDs at inference time

Training and custom tooling may still pass aligned `place_ids` and `role_ids`, but generic inference and evaluation only need `input_ids` and `attention_mask`.

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "."

model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    trust_remote_code=True,
)

text = "12 + 34 ="
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    outputs = model(**inputs)
```

## Evaluation

### ArithMark 2.0

Use the included benchmark script:

```bash
python benchmark_fusion_arithmark.py \
  --checkpoint . \
  --data-path arithmark_2.0.jsonl \
  --batch-size 64 \
  --device cuda \
  --output benchmark_results/fusion_arithmark_2.0_results.json
```

### lm-evaluation-harness

For lm-evaluation-harness tasks, use the standard `hf` model with remote code enabled:

```bash
lm_eval \
  --model hf \
  --model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
  --device cuda:0 \
  --batch_size auto:1 \
  --output_path benchmark_results/lm_eval
```

`max_length=548` is passed to the lm-evaluation-harness wrapper so long
multiple-choice continuations do not trip the harness assertion that a
continuation must fit inside the model window. The tokenizer also advertises
`model_max_length=548`, matching the longest sequence observed in this eval run.
The checkpoint was trained with a 512-token context, but the RoPE
implementation can score this slightly longer harness window; reduce batch size
or set `max_length` to the longest sequence found if a task variant contains
longer continuations.

## Results

| Benchmark | Metric | Value |
| --- | --- | ---: |
| ArithMark 2.0 | acc | 0.6924 |
| arc_challenge | acc_norm | 0.2099 |
| arc_easy | acc_norm | 0.3161 |
| hellaswag | acc_norm | 0.2701 |
| piqa | acc_norm | 0.5299 |

## Training Data

The pretraining mixture targeted about 3.5B tokens:

- Ultra-FineWeb: 900M
- FineWeb-Edu: 900M
- FineMath: 450M
- Cosmopedia-v2: 337.5M
- UltraData-Math-L2-preview: 337.5M
- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
- Synthetic-Arithmetic: 350M

Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.

## Limitations

- This is a very small model and should be treated as an experimental research artifact.
- Use `trust_remote_code=True` so `AutoTokenizer` applies the digit-span transform.
- Numeric text is represented least-significant-digit first internally.
- Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.

## Files

- `model.safetensors`: model weights
- `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
- `tokenizer.json`, `tokenization_atom.py`: tokenizer files and remote-code wrapper
- `benchmark_fusion_arithmark.py`: ArithMark evaluation
- `arithmark_2.0.jsonl`: local ArithMark 2.0 data for the standalone benchmark script
- `pretraining_curriculum.json`: training curriculum

## References / Design Influences

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs
- [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling
- [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance
- [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic