ucr-max commited on
Commit
2fd4f23
·
verified ·
1 Parent(s): 0ff3621

Update Atom2.7m submission

Browse files
README.md CHANGED
@@ -25,7 +25,7 @@ datasets:
25
 
26
  Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
27
 
28
- The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 63.80% accuracy. If inserted into the benchmark card's published baseline table, this places it 6th overall, just above Qwen2.5-0.5B at 63.04% and below SmolLM2-1.7B at 66.12%, while using only 2.74M parameters.
29
 
30
  The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
31
 
@@ -47,7 +47,7 @@ The result shows the leverage of domain-specific design. With arithmetic-aware t
47
 
48
  ## Tokenizer
49
 
50
- This model should not be evaluated or used with a plain Hugging Face tokenizer path alone. It uses a custom fusion tokenizer implemented in `tokenizer_utils.py`.
51
 
52
  The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
53
 
@@ -55,56 +55,43 @@ The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sens
55
  - digit spans are emitted least-significant-digit first
56
  - `+ - * / = ( )` are isolated atomic tokens
57
  - whitespace is isolated from text
58
- - `place_ids` are assigned to every digit run
59
- - `role_ids` are assigned only for strict integer equation spans
60
 
61
- The model expects aligned `input_ids`, `place_ids`, and `role_ids`.
62
 
63
  ## Usage
64
 
65
  ```python
66
- from pathlib import Path
67
-
68
  import torch
69
- from transformers import AutoModelForCausalLM
70
-
71
- from tokenizer_utils import load_tokenizer
72
 
73
- model_dir = Path(".")
74
 
75
  model = AutoModelForCausalLM.from_pretrained(
76
  model_dir,
77
  trust_remote_code=True,
78
  ).eval()
79
- tokenizer = load_tokenizer(model_dir)
 
 
 
80
 
81
  text = "12 + 34 ="
82
- encoding = tokenizer.encode(text)
83
-
84
- input_ids = torch.tensor([encoding.input_ids])
85
- place_ids = torch.tensor([encoding.place_ids])
86
- role_ids = torch.tensor([encoding.role_ids])
87
 
88
  with torch.no_grad():
89
- outputs = model(
90
- input_ids=input_ids,
91
- place_ids=place_ids,
92
- role_ids=role_ids,
93
- )
94
  ```
95
 
96
- For correct results, do not rely on `pipeline("text-generation")` unless it is wrapped to provide `place_ids` and `role_ids`.
97
-
98
  ## Evaluation
99
 
100
  ### ArithMark 2.0
101
 
102
- Use the included fusion-aware benchmark script:
103
 
104
  ```bash
105
  python benchmark_fusion_arithmark.py \
106
  --checkpoint . \
107
- --tokenizer-dir . \
108
  --data-path arithmark_2.0.jsonl \
109
  --batch-size 64 \
110
  --device cuda \
@@ -113,29 +100,36 @@ python benchmark_fusion_arithmark.py \
113
 
114
  ### lm-evaluation-harness
115
 
116
- Use the included launcher so the `atom2.7m` model wrapper is registered:
117
 
118
  ```bash
119
- python lm_eval_fusion run \
120
- --model atom2.7m \
121
- --model_args pretrained=.,tokenizer_dir=. \
122
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
123
  --device cuda:0 \
124
- --batch_size auto \
125
  --output_path benchmark_results/lm_eval
126
  ```
127
 
128
- The wrapper uses `tokenizer_utils.load_tokenizer()` and forwards `place_ids` and `role_ids` to the model.
 
 
 
 
 
 
 
129
 
130
  ## Results
131
 
132
  | Benchmark | Metric | Value |
133
  | --- | --- | ---: |
134
- | ArithMark 2.0 | acc | 0.6380 |
135
- | arc_challenge | acc_norm | 0.2261 |
136
- | arc_easy | acc_norm | 0.3270 |
137
- | hellaswag | acc_norm | 0.2733 |
138
- | piqa | acc_norm | 0.5305 |
139
 
140
  ## Training Data
141
 
@@ -154,7 +148,7 @@ Synthetic-Arithmetic is canonical integer equation data. The training curriculum
154
  ## Limitations
155
 
156
  - This is a very small model and should be treated as an experimental research artifact.
157
- - The custom tokenizer makes plain `AutoTokenizer` or default `lm_eval --model hf` unsuitable for final reported numbers.
158
  - Numeric text is represented least-significant-digit first internally.
159
  - Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
160
 
@@ -162,9 +156,9 @@ Synthetic-Arithmetic is canonical integer equation data. The training curriculum
162
 
163
  - `model.safetensors`: model weights
164
  - `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
165
- - `tokenizer.json`, `tokenizer_utils.py`: tokenizer files and fusion wrapper
166
  - `benchmark_fusion_arithmark.py`: ArithMark evaluation
167
- - `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
168
  - `pretraining_curriculum.json`: training curriculum
169
 
170
  ## References / Design Influences
 
25
 
26
  Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
27
 
28
+ The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.
29
 
30
  The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
31
 
 
47
 
48
  ## Tokenizer
49
 
50
+ Use this model with `trust_remote_code=True`. The submission includes an `AtomTokenizer` remote-code wrapper in `tokenization_atom.py` so standard Hugging Face callers can use `AutoTokenizer.from_pretrained(...)`.
51
 
52
  The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
53
 
 
55
  - digit spans are emitted least-significant-digit first
56
  - `+ - * / = ( )` are isolated atomic tokens
57
  - whitespace is isolated from text
58
+ - arithmetic feature IDs are derived by the model from token IDs at inference time
 
59
 
60
+ Training and custom tooling may still pass aligned `place_ids` and `role_ids`, but generic inference and evaluation only need `input_ids` and `attention_mask`.
61
 
62
  ## Usage
63
 
64
  ```python
 
 
65
  import torch
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
67
 
68
+ model_dir = "."
69
 
70
  model = AutoModelForCausalLM.from_pretrained(
71
  model_dir,
72
  trust_remote_code=True,
73
  ).eval()
74
+ tokenizer = AutoTokenizer.from_pretrained(
75
+ model_dir,
76
+ trust_remote_code=True,
77
+ )
78
 
79
  text = "12 + 34 ="
80
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
 
 
 
 
81
 
82
  with torch.no_grad():
83
+ outputs = model(**inputs)
 
 
 
 
84
  ```
85
 
 
 
86
  ## Evaluation
87
 
88
  ### ArithMark 2.0
89
 
90
+ Use the included benchmark script:
91
 
92
  ```bash
93
  python benchmark_fusion_arithmark.py \
94
  --checkpoint . \
 
95
  --data-path arithmark_2.0.jsonl \
96
  --batch-size 64 \
97
  --device cuda \
 
100
 
101
  ### lm-evaluation-harness
102
 
103
+ For lm-evaluation-harness tasks, use the standard `hf` model with remote code enabled:
104
 
105
  ```bash
106
+ lm_eval \
107
+ --model hf \
108
+ --model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
109
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
110
  --device cuda:0 \
111
+ --batch_size auto:1 \
112
  --output_path benchmark_results/lm_eval
113
  ```
114
 
115
+ `max_length=548` is passed to the lm-evaluation-harness wrapper so long
116
+ multiple-choice continuations do not trip the harness assertion that a
117
+ continuation must fit inside the model window. The tokenizer also advertises
118
+ `model_max_length=548`, matching the longest sequence observed in this eval run.
119
+ The checkpoint was trained with a 512-token context, but the RoPE
120
+ implementation can score this slightly longer harness window; reduce batch size
121
+ or set `max_length` to the longest sequence found if a task variant contains
122
+ longer continuations.
123
 
124
  ## Results
125
 
126
  | Benchmark | Metric | Value |
127
  | --- | --- | ---: |
128
+ | ArithMark 2.0 | acc | 0.6924 |
129
+ | arc_challenge | acc_norm | 0.2099 |
130
+ | arc_easy | acc_norm | 0.3161 |
131
+ | hellaswag | acc_norm | 0.2701 |
132
+ | piqa | acc_norm | 0.5299 |
133
 
134
  ## Training Data
135
 
 
148
  ## Limitations
149
 
150
  - This is a very small model and should be treated as an experimental research artifact.
151
+ - Use `trust_remote_code=True` so `AutoTokenizer` applies the digit-span transform.
152
  - Numeric text is represented least-significant-digit first internally.
153
  - Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
154
 
 
156
 
157
  - `model.safetensors`: model weights
158
  - `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
159
+ - `tokenizer.json`, `tokenization_atom.py`: tokenizer files and remote-code wrapper
160
  - `benchmark_fusion_arithmark.py`: ArithMark evaluation
161
+ - `arithmark_2.0.jsonl`: local ArithMark 2.0 data for the standalone benchmark script
162
  - `pretraining_curriculum.json`: training curriculum
163
 
164
  ## References / Design Influences
arithmark_2.0.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
benchmark_fusion_arithmark.py CHANGED
@@ -1,4 +1,4 @@
1
- """Score an Atom2.7m checkpoint on ArithMark 2.0."""
2
 
3
  from __future__ import annotations
4
 
@@ -12,16 +12,13 @@ import urllib.request
12
 
13
  import torch
14
  import torch.nn.functional as F
15
- from transformers import AutoModelForCausalLM
16
-
17
- from tokenizer_utils import SPECIAL_TOKENS, FusionTokenizer, load_tokenizer
18
 
19
 
20
  DATA_URL = (
21
  "https://huggingface.co/datasets/AxiomicLabs/Arithmark-2.0/"
22
  "resolve/main/arithmark_2.0.jsonl"
23
  )
24
- PAD_ID = SPECIAL_TOKENS.index("<|pad|>")
25
 
26
 
27
  def ensure_data(path: Path) -> Path:
@@ -45,25 +42,20 @@ def load_examples(path: Path, *, max_examples: int = 0) -> list[dict]:
45
 
46
 
47
  def _encoded_choice(
48
- tokenizer: FusionTokenizer,
49
  context: str,
50
  ending: str,
51
- ) -> tuple[list[int], list[int], list[int], int]:
52
- context_encoding = tokenizer.encode(context)
53
- full_encoding = tokenizer.encode(context + ending)
54
- continuation_length = len(full_encoding.input_ids) - len(context_encoding.input_ids)
55
- return (
56
- full_encoding.input_ids,
57
- full_encoding.place_ids,
58
- full_encoding.role_ids,
59
- continuation_length,
60
- )
61
 
62
 
63
  @torch.inference_mode()
64
  def evaluate(
65
- model: torch.nn.Module,
66
- tokenizer: FusionTokenizer,
67
  examples: list[dict],
68
  *,
69
  device: torch.device,
@@ -79,6 +71,9 @@ def evaluate(
79
  failures: list[dict] = []
80
  failure_summary: Counter[tuple[str, str, str]] = Counter()
81
  model.eval()
 
 
 
82
 
83
  for start in range(0, len(examples), batch_size):
84
  batch_examples = examples[start : start + batch_size]
@@ -95,20 +90,16 @@ def evaluate(
95
  max_length = max(len(item[0]) for item in encoded)
96
  input_ids = torch.full(
97
  (len(encoded), max_length),
98
- PAD_ID,
99
  dtype=torch.long,
100
  device=device,
101
  )
102
- place_ids = torch.zeros_like(input_ids)
103
- role_ids = torch.zeros_like(input_ids)
104
  attention_mask = torch.zeros_like(input_ids, dtype=torch.bool)
105
  lengths = []
106
  continuation_lengths = []
107
- for row, (ids, places, roles, continuation_length) in enumerate(encoded):
108
  length = len(ids)
109
  input_ids[row, :length] = torch.tensor(ids, device=device)
110
- place_ids[row, :length] = torch.tensor(places, device=device)
111
- role_ids[row, :length] = torch.tensor(roles, device=device)
112
  attention_mask[row, :length] = True
113
  lengths.append(length)
114
  continuation_lengths.append(continuation_length)
@@ -121,8 +112,6 @@ def evaluate(
121
  with autocast:
122
  logits = model(
123
  input_ids=input_ids,
124
- place_ids=place_ids,
125
- role_ids=role_ids,
126
  attention_mask=attention_mask,
127
  ).logits
128
  log_probs = F.log_softmax(logits.float(), dim=-1)
@@ -195,7 +184,7 @@ def evaluate(
195
 
196
  results = {
197
  "benchmark": "arithmark_2.0",
198
- "model_type": "atom2.7m",
199
  "accuracy": correct / max(total, 1),
200
  "correct": correct,
201
  "total": total,
@@ -228,10 +217,10 @@ def evaluate(
228
  def parse_args() -> argparse.Namespace:
229
  parser = argparse.ArgumentParser(description=__doc__)
230
  parser.add_argument("--checkpoint", type=Path, default=Path("outputs/fusion_run/final_model"))
231
- parser.add_argument("--tokenizer-dir", type=Path, default=Path("tokenizer_4k"))
232
  parser.add_argument("--data-path", type=Path, default=Path("arithmark_2.0.jsonl"))
233
  parser.add_argument("--batch-size", type=int, default=64)
234
  parser.add_argument("--device", default="auto")
 
235
  parser.add_argument("--output", type=Path)
236
  parser.add_argument(
237
  "--max-examples",
@@ -263,11 +252,21 @@ def main() -> None:
263
 
264
  data_path = ensure_data(args.data_path)
265
  examples = load_examples(data_path, max_examples=args.max_examples)
 
 
 
 
 
 
 
266
  model = AutoModelForCausalLM.from_pretrained(
267
  args.checkpoint,
 
268
  trust_remote_code=True,
269
  ).to(device)
270
- tokenizer = load_tokenizer(args.tokenizer_dir)
 
 
271
  results = evaluate(
272
  model,
273
  tokenizer,
 
1
+ """Score a fusion GPT checkpoint on ArithMark 2.0."""
2
 
3
  from __future__ import annotations
4
 
 
12
 
13
  import torch
14
  import torch.nn.functional as F
15
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
16
 
17
 
18
  DATA_URL = (
19
  "https://huggingface.co/datasets/AxiomicLabs/Arithmark-2.0/"
20
  "resolve/main/arithmark_2.0.jsonl"
21
  )
 
22
 
23
 
24
  def ensure_data(path: Path) -> Path:
 
42
 
43
 
44
  def _encoded_choice(
45
+ tokenizer,
46
  context: str,
47
  ending: str,
48
+ ) -> tuple[list[int], int]:
49
+ context_ids = tokenizer(context, add_special_tokens=False).input_ids
50
+ full_ids = tokenizer(context + ending, add_special_tokens=False).input_ids
51
+ continuation_length = len(full_ids) - len(context_ids)
52
+ return full_ids, continuation_length
 
 
 
 
 
53
 
54
 
55
  @torch.inference_mode()
56
  def evaluate(
57
+ model,
58
+ tokenizer,
59
  examples: list[dict],
60
  *,
61
  device: torch.device,
 
71
  failures: list[dict] = []
72
  failure_summary: Counter[tuple[str, str, str]] = Counter()
73
  model.eval()
74
+ pad_id = tokenizer.pad_token_id
75
+ if pad_id is None:
76
+ pad_id = tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 0
77
 
78
  for start in range(0, len(examples), batch_size):
79
  batch_examples = examples[start : start + batch_size]
 
90
  max_length = max(len(item[0]) for item in encoded)
91
  input_ids = torch.full(
92
  (len(encoded), max_length),
93
+ int(pad_id),
94
  dtype=torch.long,
95
  device=device,
96
  )
 
 
97
  attention_mask = torch.zeros_like(input_ids, dtype=torch.bool)
98
  lengths = []
99
  continuation_lengths = []
100
+ for row, (ids, continuation_length) in enumerate(encoded):
101
  length = len(ids)
102
  input_ids[row, :length] = torch.tensor(ids, device=device)
 
 
103
  attention_mask[row, :length] = True
104
  lengths.append(length)
105
  continuation_lengths.append(continuation_length)
 
112
  with autocast:
113
  logits = model(
114
  input_ids=input_ids,
 
 
115
  attention_mask=attention_mask,
116
  ).logits
117
  log_probs = F.log_softmax(logits.float(), dim=-1)
 
184
 
185
  results = {
186
  "benchmark": "arithmark_2.0",
187
+ "model_type": "fusion_gpt",
188
  "accuracy": correct / max(total, 1),
189
  "correct": correct,
190
  "total": total,
 
217
  def parse_args() -> argparse.Namespace:
218
  parser = argparse.ArgumentParser(description=__doc__)
219
  parser.add_argument("--checkpoint", type=Path, default=Path("outputs/fusion_run/final_model"))
 
220
  parser.add_argument("--data-path", type=Path, default=Path("arithmark_2.0.jsonl"))
221
  parser.add_argument("--batch-size", type=int, default=64)
222
  parser.add_argument("--device", default="auto")
223
+ parser.add_argument("--dtype", default="auto", choices=("auto", "float32", "bfloat16", "float16"))
224
  parser.add_argument("--output", type=Path)
225
  parser.add_argument(
226
  "--max-examples",
 
252
 
253
  data_path = ensure_data(args.data_path)
254
  examples = load_examples(data_path, max_examples=args.max_examples)
255
+ dtype = None
256
+ if args.dtype == "float32":
257
+ dtype = torch.float32
258
+ elif args.dtype == "bfloat16":
259
+ dtype = torch.bfloat16
260
+ elif args.dtype == "float16":
261
+ dtype = torch.float16
262
  model = AutoModelForCausalLM.from_pretrained(
263
  args.checkpoint,
264
+ dtype=dtype,
265
  trust_remote_code=True,
266
  ).to(device)
267
+ tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True)
268
+ if tokenizer.pad_token_id is None:
269
+ tokenizer.pad_token = tokenizer.eos_token
270
  results = evaluate(
271
  model,
272
  tokenizer,
config.json CHANGED
@@ -8,6 +8,52 @@
8
  },
9
  "block_size": 512,
10
  "dtype": "float32",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  "head_dim": 48,
12
  "hidden_size": 192,
13
  "intermediate_size": 480,
 
8
  },
9
  "block_size": 512,
10
  "dtype": "float32",
11
+ "feature_digit_token_ids": [
12
+ 20,
13
+ 21,
14
+ 22,
15
+ 23,
16
+ 24,
17
+ 25,
18
+ 26,
19
+ 27,
20
+ 28,
21
+ 29
22
+ ],
23
+ "feature_equals_token_id": 33,
24
+ "feature_space_token_ids": [
25
+ 202,
26
+ 204,
27
+ 205,
28
+ 221,
29
+ 222,
30
+ 223,
31
+ 224,
32
+ 225,
33
+ 273,
34
+ 293,
35
+ 355,
36
+ 359,
37
+ 488,
38
+ 501,
39
+ 669,
40
+ 809,
41
+ 856,
42
+ 902,
43
+ 1168,
44
+ 1386,
45
+ 1407,
46
+ 1581,
47
+ 1687,
48
+ 2070,
49
+ 2165,
50
+ 2627,
51
+ 2951,
52
+ 3033,
53
+ 3218,
54
+ 3391,
55
+ 4076
56
+ ],
57
  "head_dim": 48,
58
  "hidden_size": 192,
59
  "intermediate_size": 480,
config.py CHANGED
@@ -22,6 +22,9 @@ DEFAULT_BLOCK_SIZE = 512
22
  DEFAULT_ROPE_THETA = 5000.0
23
  DEFAULT_PLACE_VOCAB_SIZE = 66
24
  DEFAULT_ROLE_VOCAB_SIZE = 12
 
 
 
25
 
26
 
27
  class GPTConfig(PretrainedConfig):
@@ -46,6 +49,9 @@ class GPTConfig(PretrainedConfig):
46
  use_role_embeddings: bool = True,
47
  place_vocab_size: int = DEFAULT_PLACE_VOCAB_SIZE,
48
  role_vocab_size: int = DEFAULT_ROLE_VOCAB_SIZE,
 
 
 
49
  **kwargs,
50
  ):
51
  if num_key_value_heads is None:
@@ -79,6 +85,25 @@ class GPTConfig(PretrainedConfig):
79
  self.use_role_embeddings = bool(use_role_embeddings)
80
  self.place_vocab_size = int(place_vocab_size)
81
  self.role_vocab_size = int(role_vocab_size)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
 
84
  def _bool_env(name: str, default: bool) -> bool:
@@ -107,13 +132,13 @@ class Hyperparameters:
107
  iterations: int = field(default_factory=lambda: int(os.environ.get("ITERATIONS", "10000")))
108
  requested_train_tokens: int = field(init=False)
109
  train_tokens: int = field(init=False)
110
- decay_start_frac: float = field(default_factory=lambda: float(os.environ.get("DECAY_START_FRAC", "0.7")))
111
  warmup_steps: int = field(default_factory=lambda: int(os.environ.get("WARMUP_STEPS", "0")))
112
  lr_warmup_steps: int = field(default_factory=lambda: int(os.environ.get("LR_WARMUP_STEPS", "500")))
113
  train_batch_tokens: int = field(default_factory=lambda: int(os.environ.get("TRAIN_BATCH_TOKENS", str(DEFAULT_BLOCK_SIZE * 512))))
114
  train_seq_len: int = field(init=False)
115
  eval_seq_len: int = field(init=False)
116
- grad_accum_steps: int = field(default_factory=lambda: int(os.environ.get("GRAD_ACCUM_STEPS", "4")))
117
  train_log_every: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_EVERY", "100")))
118
  train_log_dense_steps: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_DENSE_STEPS", "100")))
119
  train_log_ramp_steps: int = field(
@@ -146,12 +171,12 @@ class Hyperparameters:
146
  place_vocab_size: int = field(default_factory=lambda: int(os.environ.get("PLACE_VOCAB_SIZE", str(DEFAULT_PLACE_VOCAB_SIZE))))
147
  role_vocab_size: int = field(default_factory=lambda: int(os.environ.get("ROLE_VOCAB_SIZE", str(DEFAULT_ROLE_VOCAB_SIZE))))
148
 
149
- min_lr: float = field(default_factory=lambda: float(os.environ.get("MIN_LR", "0.0")))
150
- lr: float = field(default_factory=lambda: float(os.environ.get("LR", "0.004")))
151
  beta1: float = field(default_factory=lambda: float(os.environ.get("BETA1", "0.9")))
152
- beta2: float = field(default_factory=lambda: float(os.environ.get("BETA2", "0.95")))
153
  adam_eps: float = field(default_factory=lambda: float(os.environ.get("ADAM_EPS", "1e-8")))
154
- weight_decay: float = field(default_factory=lambda: float(os.environ.get("WEIGHT_DECAY", "0.005")))
155
 
156
  compile_model: bool = field(default_factory=lambda: _bool_env("COMPILE_MODEL", True))
157
  autocast: bool = field(default_factory=lambda: _bool_env("AUTOCAST", True))
 
22
  DEFAULT_ROPE_THETA = 5000.0
23
  DEFAULT_PLACE_VOCAB_SIZE = 66
24
  DEFAULT_ROLE_VOCAB_SIZE = 12
25
+ DEFAULT_FEATURE_DIGIT_TOKEN_IDS = tuple(range(20, 30))
26
+ DEFAULT_FEATURE_EQUALS_TOKEN_ID = 33
27
+ DEFAULT_FEATURE_SPACE_TOKEN_IDS = (225,)
28
 
29
 
30
  class GPTConfig(PretrainedConfig):
 
49
  use_role_embeddings: bool = True,
50
  place_vocab_size: int = DEFAULT_PLACE_VOCAB_SIZE,
51
  role_vocab_size: int = DEFAULT_ROLE_VOCAB_SIZE,
52
+ feature_digit_token_ids: list[int] | tuple[int, ...] | None = None,
53
+ feature_equals_token_id: int | None = DEFAULT_FEATURE_EQUALS_TOKEN_ID,
54
+ feature_space_token_ids: list[int] | tuple[int, ...] | None = None,
55
  **kwargs,
56
  ):
57
  if num_key_value_heads is None:
 
85
  self.use_role_embeddings = bool(use_role_embeddings)
86
  self.place_vocab_size = int(place_vocab_size)
87
  self.role_vocab_size = int(role_vocab_size)
88
+ self.feature_digit_token_ids = [
89
+ int(token_id)
90
+ for token_id in (
91
+ DEFAULT_FEATURE_DIGIT_TOKEN_IDS
92
+ if feature_digit_token_ids is None
93
+ else feature_digit_token_ids
94
+ )
95
+ ]
96
+ self.feature_equals_token_id = (
97
+ None if feature_equals_token_id is None else int(feature_equals_token_id)
98
+ )
99
+ self.feature_space_token_ids = [
100
+ int(token_id)
101
+ for token_id in (
102
+ DEFAULT_FEATURE_SPACE_TOKEN_IDS
103
+ if feature_space_token_ids is None
104
+ else feature_space_token_ids
105
+ )
106
+ ]
107
 
108
 
109
  def _bool_env(name: str, default: bool) -> bool:
 
132
  iterations: int = field(default_factory=lambda: int(os.environ.get("ITERATIONS", "10000")))
133
  requested_train_tokens: int = field(init=False)
134
  train_tokens: int = field(init=False)
135
+ decay_start_frac: float = field(default_factory=lambda: float(os.environ.get("DECAY_START_FRAC", "0.9")))
136
  warmup_steps: int = field(default_factory=lambda: int(os.environ.get("WARMUP_STEPS", "0")))
137
  lr_warmup_steps: int = field(default_factory=lambda: int(os.environ.get("LR_WARMUP_STEPS", "500")))
138
  train_batch_tokens: int = field(default_factory=lambda: int(os.environ.get("TRAIN_BATCH_TOKENS", str(DEFAULT_BLOCK_SIZE * 512))))
139
  train_seq_len: int = field(init=False)
140
  eval_seq_len: int = field(init=False)
141
+ grad_accum_steps: int = field(default_factory=lambda: int(os.environ.get("GRAD_ACCUM_STEPS", "2")))
142
  train_log_every: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_EVERY", "100")))
143
  train_log_dense_steps: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_DENSE_STEPS", "100")))
144
  train_log_ramp_steps: int = field(
 
171
  place_vocab_size: int = field(default_factory=lambda: int(os.environ.get("PLACE_VOCAB_SIZE", str(DEFAULT_PLACE_VOCAB_SIZE))))
172
  role_vocab_size: int = field(default_factory=lambda: int(os.environ.get("ROLE_VOCAB_SIZE", str(DEFAULT_ROLE_VOCAB_SIZE))))
173
 
174
+ min_lr: float = field(default_factory=lambda: float(os.environ.get("MIN_LR", "0.01")))
175
+ lr: float = field(default_factory=lambda: float(os.environ.get("LR", "0.005")))
176
  beta1: float = field(default_factory=lambda: float(os.environ.get("BETA1", "0.9")))
177
+ beta2: float = field(default_factory=lambda: float(os.environ.get("BETA2", "0.98")))
178
  adam_eps: float = field(default_factory=lambda: float(os.environ.get("ADAM_EPS", "1e-8")))
179
+ weight_decay: float = field(default_factory=lambda: float(os.environ.get("WEIGHT_DECAY", "0.001")))
180
 
181
  compile_model: bool = field(default_factory=lambda: _bool_env("COMPILE_MODEL", True))
182
  autocast: bool = field(default_factory=lambda: _bool_env("AUTOCAST", True))
configuration_gpt.py CHANGED
@@ -5,6 +5,9 @@ New code should import these from :mod:`GPT.config`.
5
 
6
  from .config import (
7
  DEFAULT_BLOCK_SIZE,
 
 
 
8
  DEFAULT_HEAD_DIM,
9
  DEFAULT_HIDDEN_SIZE,
10
  DEFAULT_INTERMEDIATE_SIZE,
@@ -21,6 +24,9 @@ from .config import (
21
 
22
  __all__ = [
23
  "DEFAULT_BLOCK_SIZE",
 
 
 
24
  "DEFAULT_HEAD_DIM",
25
  "DEFAULT_HIDDEN_SIZE",
26
  "DEFAULT_INTERMEDIATE_SIZE",
 
5
 
6
  from .config import (
7
  DEFAULT_BLOCK_SIZE,
8
+ DEFAULT_FEATURE_DIGIT_TOKEN_IDS,
9
+ DEFAULT_FEATURE_EQUALS_TOKEN_ID,
10
+ DEFAULT_FEATURE_SPACE_TOKEN_IDS,
11
  DEFAULT_HEAD_DIM,
12
  DEFAULT_HIDDEN_SIZE,
13
  DEFAULT_INTERMEDIATE_SIZE,
 
24
 
25
  __all__ = [
26
  "DEFAULT_BLOCK_SIZE",
27
+ "DEFAULT_FEATURE_DIGIT_TOKEN_IDS",
28
+ "DEFAULT_FEATURE_EQUALS_TOKEN_ID",
29
+ "DEFAULT_FEATURE_SPACE_TOKEN_IDS",
30
  "DEFAULT_HEAD_DIM",
31
  "DEFAULT_HIDDEN_SIZE",
32
  "DEFAULT_INTERMEDIATE_SIZE",
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.57.6"
4
+ }
model.py CHANGED
@@ -21,6 +21,8 @@ CONTROL_TENSOR_NAME_PATTERNS = (
21
  "ln_",
22
  "rms",
23
  )
 
 
24
 
25
 
26
  class CastedLinear(nn.Linear):
@@ -98,6 +100,8 @@ class GPTAttention(nn.Module):
98
  self.o_proj = CastedLinear(self.n_head * self.head_dim, config.hidden_size, bias=False)
99
 
100
  def _xsa_efficient(self, y: Tensor, v_current: Tensor) -> Tensor:
 
 
101
  B, H, T, D = y.shape
102
  Hkv = v_current.size(1)
103
  group = H // Hkv
@@ -248,17 +252,89 @@ class GPTForCausalLM(GPTPreTrainedModel, GenerationMixin):
248
  def set_output_embeddings(self, new_embeddings):
249
  self.lm_head = new_embeddings
250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
251
  def embed_tokens(self, input_ids, *, place_ids=None, role_ids=None, **kwargs):
 
 
 
 
 
 
 
 
 
252
  embeddings = self.transformer["wte"](input_ids)
253
  if self.place_embeddings is not None:
254
- if place_ids is None:
255
- place_ids = torch.zeros_like(input_ids)
256
  if place_ids.shape != input_ids.shape:
257
  raise ValueError("place_ids must match input_ids shape")
258
  embeddings = embeddings + self.place_embeddings(place_ids)
259
  if self.role_embeddings is not None:
260
- if role_ids is None:
261
- role_ids = torch.zeros_like(input_ids)
262
  if role_ids.shape != input_ids.shape:
263
  raise ValueError("role_ids must match input_ids shape")
264
  embeddings = embeddings + self.role_embeddings(role_ids)
@@ -297,6 +373,8 @@ class GPTForCausalLM(GPTPreTrainedModel, GenerationMixin):
297
  input_ids,
298
  attention_mask=None,
299
  labels=None,
 
 
300
  past_key_values: Optional[DynamicCache] = None,
301
  use_cache=False,
302
  **kwargs,
@@ -306,7 +384,12 @@ class GPTForCausalLM(GPTPreTrainedModel, GenerationMixin):
306
  past_key_values = DynamicCache()
307
 
308
  past_len = past_key_values.get_seq_length() if past_key_values is not None else 0
309
- x = self.embed_tokens(input_ids, **kwargs)
 
 
 
 
 
310
  cos, sin = self._get_freqs_cis(past_len + T, input_ids.device)
311
  freqs_cis = (
312
  cos[past_len:past_len + T],
 
21
  "ln_",
22
  "rms",
23
  )
24
+ RESULT_ROLE_ID = 10
25
+ SPACE_ROLE_ID = 11
26
 
27
 
28
  class CastedLinear(nn.Linear):
 
100
  self.o_proj = CastedLinear(self.n_head * self.head_dim, config.hidden_size, bias=False)
101
 
102
  def _xsa_efficient(self, y: Tensor, v_current: Tensor) -> Tensor:
103
+ # y: [B, H, T, D]
104
+ # v_current: [B, Hkv, T, D]
105
  B, H, T, D = y.shape
106
  Hkv = v_current.size(1)
107
  group = H // Hkv
 
252
  def set_output_embeddings(self, new_embeddings):
253
  self.lm_head = new_embeddings
254
 
255
+ def derive_features_from_input_ids(self, input_ids: Tensor) -> tuple[Tensor, Tensor]:
256
+ """Derive arithmetic auxiliary streams from token IDs.
257
+
258
+ This is the default-compatible path for HF/leaderboard callers that only
259
+ provide ``input_ids``. Training and specialized benchmarks may still pass
260
+ precomputed streams, which remain authoritative.
261
+ """
262
+ digit_ids = set(int(token_id) for token_id in getattr(self.config, "feature_digit_token_ids", []))
263
+ equals_id = getattr(self.config, "feature_equals_token_id", None)
264
+ space_ids = set(int(token_id) for token_id in getattr(self.config, "feature_space_token_ids", []))
265
+ place_overflow_id = int(getattr(self.config, "place_vocab_size", 1)) - 1
266
+ role_vocab_size = int(getattr(self.config, "role_vocab_size", 0))
267
+
268
+ place_ids = torch.zeros_like(input_ids)
269
+ role_ids = torch.zeros_like(input_ids)
270
+ if not digit_ids:
271
+ return place_ids, role_ids
272
+
273
+ input_cpu = input_ids.detach().to("cpu")
274
+ place_cpu = torch.zeros_like(input_cpu)
275
+ role_cpu = torch.zeros_like(input_cpu)
276
+
277
+ for row in range(input_cpu.size(0)):
278
+ ids = [int(token_id) for token_id in input_cpu[row].tolist()]
279
+
280
+ index = 0
281
+ digit_runs: list[tuple[int, int]] = []
282
+ while index < len(ids):
283
+ if ids[index] not in digit_ids:
284
+ index += 1
285
+ continue
286
+ run_start = index
287
+ offset = 1
288
+ while index < len(ids) and ids[index] in digit_ids:
289
+ place_cpu[row, index] = min(offset, place_overflow_id)
290
+ index += 1
291
+ offset += 1
292
+ digit_runs.append((run_start, index))
293
+
294
+ if equals_id is None or not digit_runs:
295
+ continue
296
+ equals_positions = [pos for pos, token_id in enumerate(ids) if token_id == int(equals_id)]
297
+ if len(equals_positions) != 1:
298
+ continue
299
+ equals_position = equals_positions[0]
300
+ operand_runs = [(start, end) for start, end in digit_runs if end <= equals_position]
301
+ result_runs = [(start, end) for start, end in digit_runs if start > equals_position]
302
+ if not operand_runs or len(operand_runs) > 9:
303
+ continue
304
+
305
+ if role_vocab_size > SPACE_ROLE_ID:
306
+ for pos, token_id in enumerate(ids):
307
+ if token_id in space_ids:
308
+ role_cpu[row, pos] = SPACE_ROLE_ID
309
+ for role, (start, end) in enumerate(operand_runs, start=1):
310
+ if role >= role_vocab_size:
311
+ break
312
+ role_cpu[row, start:end] = role
313
+ if role_vocab_size > RESULT_ROLE_ID:
314
+ for start, end in result_runs:
315
+ role_cpu[row, start:end] = RESULT_ROLE_ID
316
+
317
+ return (
318
+ place_cpu.to(device=input_ids.device, non_blocking=True),
319
+ role_cpu.to(device=input_ids.device, non_blocking=True),
320
+ )
321
+
322
  def embed_tokens(self, input_ids, *, place_ids=None, role_ids=None, **kwargs):
323
+ if (place_ids is None and self.place_embeddings is not None) or (
324
+ role_ids is None and self.role_embeddings is not None
325
+ ):
326
+ derived_place_ids, derived_role_ids = self.derive_features_from_input_ids(input_ids)
327
+ if place_ids is None:
328
+ place_ids = derived_place_ids
329
+ if role_ids is None:
330
+ role_ids = derived_role_ids
331
+
332
  embeddings = self.transformer["wte"](input_ids)
333
  if self.place_embeddings is not None:
 
 
334
  if place_ids.shape != input_ids.shape:
335
  raise ValueError("place_ids must match input_ids shape")
336
  embeddings = embeddings + self.place_embeddings(place_ids)
337
  if self.role_embeddings is not None:
 
 
338
  if role_ids.shape != input_ids.shape:
339
  raise ValueError("role_ids must match input_ids shape")
340
  embeddings = embeddings + self.role_embeddings(role_ids)
 
373
  input_ids,
374
  attention_mask=None,
375
  labels=None,
376
+ place_ids=None,
377
+ role_ids=None,
378
  past_key_values: Optional[DynamicCache] = None,
379
  use_cache=False,
380
  **kwargs,
 
384
  past_key_values = DynamicCache()
385
 
386
  past_len = past_key_values.get_seq_length() if past_key_values is not None else 0
387
+ x = self.embed_tokens(
388
+ input_ids,
389
+ place_ids=place_ids,
390
+ role_ids=role_ids,
391
+ **kwargs,
392
+ )
393
  cos, sin = self._get_freqs_cis(past_len + T, input_ids.device)
394
  freqs_cis = (
395
  cos[past_len:past_len + T],
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a21f5910c898c5464320d3f01b15ccde1eb278073266b221e48e9ec15ccbe899
3
  size 10930496
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0344de127ce64d272153de099118d92c9cd58d2aaf07060bc730bd1cebfa2e33
3
  size 10930496
requirements.txt CHANGED
@@ -3,4 +3,3 @@ transformers
3
  tokenizers
4
  safetensors
5
  tqdm
6
- lm-eval
 
3
  tokenizers
4
  safetensors
5
  tqdm
 
tokenization_atom.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Remote-code tokenizer for Atom/Fusion GPT checkpoints.
2
+
3
+ The tokenizer is intentionally HF-compatible: generic callers can use
4
+ ``AutoTokenizer.from_pretrained(..., trust_remote_code=True)``. Arithmetic digit
5
+ spans are reversed before tokenization so the model receives LSD-first numbers,
6
+ matching pretraining.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import re
12
+ from typing import Any
13
+
14
+ from transformers import PreTrainedTokenizerFast
15
+
16
+
17
+ class AtomTokenizer(PreTrainedTokenizerFast):
18
+ vocab_files_names = {"tokenizer_file": "tokenizer.json"}
19
+ model_input_names = ["input_ids", "attention_mask"]
20
+ slow_tokenizer_class = None
21
+ _digit_span_re = re.compile(r"\d+")
22
+
23
+ def __init__(self, *args: Any, **kwargs: Any) -> None:
24
+ kwargs.setdefault("bos_token", "<|bos|>")
25
+ kwargs.setdefault("eos_token", "<|eos|>")
26
+ kwargs.setdefault("unk_token", "<|unk|>")
27
+ kwargs.setdefault("pad_token", "<|pad|>")
28
+ super().__init__(*args, **kwargs)
29
+
30
+ @classmethod
31
+ def _reverse_digit_spans(cls, text: str) -> str:
32
+ return cls._digit_span_re.sub(lambda match: match.group(0)[::-1], text)
33
+
34
+ @classmethod
35
+ def _transform_text(cls, value: Any) -> Any:
36
+ if isinstance(value, str):
37
+ return cls._reverse_digit_spans(value)
38
+ if isinstance(value, tuple):
39
+ return tuple(cls._transform_text(item) for item in value)
40
+ if isinstance(value, list):
41
+ return [cls._transform_text(item) for item in value]
42
+ return value
43
+
44
+ def __call__(self, text=None, text_pair=None, *args: Any, **kwargs: Any):
45
+ return super().__call__(
46
+ self._transform_text(text),
47
+ self._transform_text(text_pair),
48
+ *args,
49
+ **kwargs,
50
+ )
51
+
52
+ def encode(self, text, text_pair=None, *args: Any, **kwargs: Any):
53
+ return super().encode(
54
+ self._transform_text(text),
55
+ self._transform_text(text_pair),
56
+ *args,
57
+ **kwargs,
58
+ )
59
+
60
+ def batch_encode_plus(self, batch_text_or_text_pairs, *args: Any, **kwargs: Any):
61
+ return super().batch_encode_plus(
62
+ self._transform_text(batch_text_or_text_pairs),
63
+ *args,
64
+ **kwargs,
65
+ )
66
+
67
+ def _decode(self, token_ids, skip_special_tokens: bool = False, **kwargs: Any) -> str:
68
+ text = super()._decode(
69
+ token_ids,
70
+ skip_special_tokens=skip_special_tokens,
71
+ **kwargs,
72
+ )
73
+ return self._reverse_digit_spans(text)
tokenizer_config.json CHANGED
@@ -2,10 +2,16 @@
2
  "additional_special_tokens": [
3
  "<|endoftext|>"
4
  ],
 
 
 
 
 
 
5
  "bos_token": "<|bos|>",
6
  "eos_token": "<|eos|>",
7
- "model_max_length": 512,
8
  "pad_token": "<|pad|>",
9
- "tokenizer_class": "GPT2TokenizerFast",
10
  "unk_token": "<|unk|>"
11
  }
 
2
  "additional_special_tokens": [
3
  "<|endoftext|>"
4
  ],
5
+ "auto_map": {
6
+ "AutoTokenizer": [
7
+ "tokenization_atom.AtomTokenizer",
8
+ null
9
+ ]
10
+ },
11
  "bos_token": "<|bos|>",
12
  "eos_token": "<|eos|>",
13
+ "model_max_length": 548,
14
  "pad_token": "<|pad|>",
15
+ "tokenizer_class": "AtomTokenizer",
16
  "unk_token": "<|unk|>"
17
  }