Update Atom2.7m submission

Browse files

Files changed (12) hide show

README.md +34 -40
arithmark_2.0.jsonl +0 -0
benchmark_fusion_arithmark.py +28 -29
config.json +46 -0
config.py +31 -6
configuration_gpt.py +6 -0
generation_config.json +4 -0
model.py +88 -5
model.safetensors +1 -1
requirements.txt +0 -1
tokenization_atom.py +73 -0
tokenizer_config.json +8 -2

README.md CHANGED Viewed

@@ -25,7 +25,7 @@ datasets:
 Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
-The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 63.80% accuracy. If inserted into the benchmark card's published baseline table, this places it 6th overall, just above Qwen2.5-0.5B at 63.04% and below SmolLM2-1.7B at 66.12%, while using only 2.74M parameters.
 The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
@@ -47,7 +47,7 @@ The result shows the leverage of domain-specific design. With arithmetic-aware t
 ## Tokenizer
-This model should not be evaluated or used with a plain Hugging Face tokenizer path alone. It uses a custom fusion tokenizer implemented in `tokenizer_utils.py`.
 The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
@@ -55,56 +55,43 @@ The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sens
 - digit spans are emitted least-significant-digit first
 - `+ - * / = ( )` are isolated atomic tokens
 - whitespace is isolated from text
-- `place_ids` are assigned to every digit run
-- `role_ids` are assigned only for strict integer equation spans
-The model expects aligned `input_ids`, `place_ids`, and `role_ids`.
 ## Usage
 ```python
-from pathlib import Path
 import torch
-from transformers import AutoModelForCausalLM
-from tokenizer_utils import load_tokenizer
-model_dir = Path(".")
 model = AutoModelForCausalLM.from_pretrained(
     model_dir,
     trust_remote_code=True,
 ).eval()
-tokenizer = load_tokenizer(model_dir)
 text = "12 + 34 ="
-encoding = tokenizer.encode(text)
-input_ids = torch.tensor([encoding.input_ids])
-place_ids = torch.tensor([encoding.place_ids])
-role_ids = torch.tensor([encoding.role_ids])
 with torch.no_grad():
-    outputs = model(
-        input_ids=input_ids,
-        place_ids=place_ids,
-        role_ids=role_ids,
-    )
 ```
-For correct results, do not rely on `pipeline("text-generation")` unless it is wrapped to provide `place_ids` and `role_ids`.
 ## Evaluation
 ### ArithMark 2.0
-Use the included fusion-aware benchmark script:
 ```bash
 python benchmark_fusion_arithmark.py \
   --checkpoint . \
-  --tokenizer-dir . \
   --data-path arithmark_2.0.jsonl \
   --batch-size 64 \
   --device cuda \
@@ -113,29 +100,36 @@ python benchmark_fusion_arithmark.py \
 ### lm-evaluation-harness
-Use the included launcher so the `atom2.7m` model wrapper is registered:
 ```bash
-python lm_eval_fusion run \
-  --model atom2.7m \
-  --model_args pretrained=.,tokenizer_dir=. \
   --tasks hellaswag,arc_easy,arc_challenge,piqa \
   --device cuda:0 \
-  --batch_size auto \
   --output_path benchmark_results/lm_eval
 ```
-The wrapper uses `tokenizer_utils.load_tokenizer()` and forwards `place_ids` and `role_ids` to the model.
 ## Results
 | Benchmark | Metric | Value |
 | --- | --- | ---: |
-| ArithMark 2.0 | acc | 0.6380 |
-| arc_challenge | acc_norm | 0.2261 |
-| arc_easy | acc_norm | 0.3270 |
-| hellaswag | acc_norm | 0.2733 |
-| piqa | acc_norm | 0.5305 |
 ## Training Data
@@ -154,7 +148,7 @@ Synthetic-Arithmetic is canonical integer equation data. The training curriculum
 ## Limitations
 - This is a very small model and should be treated as an experimental research artifact.
-- The custom tokenizer makes plain `AutoTokenizer` or default `lm_eval --model hf` unsuitable for final reported numbers.
 - Numeric text is represented least-significant-digit first internally.
 - Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
@@ -162,9 +156,9 @@ Synthetic-Arithmetic is canonical integer equation data. The training curriculum
 - `model.safetensors`: model weights
 - `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
-- `tokenizer.json`, `tokenizer_utils.py`: tokenizer files and fusion wrapper
 - `benchmark_fusion_arithmark.py`: ArithMark evaluation
-- `lm_eval_fusion.py`, `lm_eval_fusion`: lm-eval custom model wrapper
 - `pretraining_curriculum.json`: training curriculum
 ## References / Design Influences

 Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
+The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.
 The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.
 ## Tokenizer
+Use this model with `trust_remote_code=True`. The submission includes an `AtomTokenizer` remote-code wrapper in `tokenization_atom.py` so standard Hugging Face callers can use `AutoTokenizer.from_pretrained(...)`.
 The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
 - digit spans are emitted least-significant-digit first
 - `+ - * / = ( )` are isolated atomic tokens
 - whitespace is isolated from text
+- arithmetic feature IDs are derived by the model from token IDs at inference time
+Training and custom tooling may still pass aligned `place_ids` and `role_ids`, but generic inference and evaluation only need `input_ids` and `attention_mask`.
 ## Usage
 ```python
 import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_dir = "."
 model = AutoModelForCausalLM.from_pretrained(
     model_dir,
     trust_remote_code=True,
 ).eval()
+tokenizer = AutoTokenizer.from_pretrained(
+    model_dir,
+    trust_remote_code=True,
+)
 text = "12 + 34 ="
+inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
 with torch.no_grad():
+    outputs = model(**inputs)
 ```
 ## Evaluation
 ### ArithMark 2.0
+Use the included benchmark script:
 ```bash
 python benchmark_fusion_arithmark.py \
   --checkpoint . \
   --data-path arithmark_2.0.jsonl \
   --batch-size 64 \
   --device cuda \
 ### lm-evaluation-harness
+For lm-evaluation-harness tasks, use the standard `hf` model with remote code enabled:
 ```bash
+lm_eval \
+  --model hf \
+  --model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
   --tasks hellaswag,arc_easy,arc_challenge,piqa \
   --device cuda:0 \
+  --batch_size auto:1 \
   --output_path benchmark_results/lm_eval
 ```
+`max_length=548` is passed to the lm-evaluation-harness wrapper so long
+multiple-choice continuations do not trip the harness assertion that a
+continuation must fit inside the model window. The tokenizer also advertises
+`model_max_length=548`, matching the longest sequence observed in this eval run.
+The checkpoint was trained with a 512-token context, but the RoPE
+implementation can score this slightly longer harness window; reduce batch size
+or set `max_length` to the longest sequence found if a task variant contains
+longer continuations.
 ## Results
 | Benchmark | Metric | Value |
 | --- | --- | ---: |
+| ArithMark 2.0 | acc | 0.6924 |
+| arc_challenge | acc_norm | 0.2099 |
+| arc_easy | acc_norm | 0.3161 |
+| hellaswag | acc_norm | 0.2701 |
+| piqa | acc_norm | 0.5299 |
 ## Training Data
 ## Limitations
 - This is a very small model and should be treated as an experimental research artifact.
+- Use `trust_remote_code=True` so `AutoTokenizer` applies the digit-span transform.
 - Numeric text is represented least-significant-digit first internally.
 - Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
 - `model.safetensors`: model weights
 - `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
+- `tokenizer.json`, `tokenization_atom.py`: tokenizer files and remote-code wrapper
 - `benchmark_fusion_arithmark.py`: ArithMark evaluation
+- `arithmark_2.0.jsonl`: local ArithMark 2.0 data for the standalone benchmark script
 - `pretraining_curriculum.json`: training curriculum
 ## References / Design Influences

arithmark_2.0.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

benchmark_fusion_arithmark.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Score an Atom2.7m checkpoint on ArithMark 2.0."""
 from __future__ import annotations
@@ -12,16 +12,13 @@ import urllib.request
 import torch
 import torch.nn.functional as F
-from transformers import AutoModelForCausalLM
-from tokenizer_utils import SPECIAL_TOKENS, FusionTokenizer, load_tokenizer
 DATA_URL = (
     "https://huggingface.co/datasets/AxiomicLabs/Arithmark-2.0/"
     "resolve/main/arithmark_2.0.jsonl"
 )
-PAD_ID = SPECIAL_TOKENS.index("<|pad|>")
 def ensure_data(path: Path) -> Path:
@@ -45,25 +42,20 @@ def load_examples(path: Path, *, max_examples: int = 0) -> list[dict]:
 def _encoded_choice(
-    tokenizer: FusionTokenizer,
     context: str,
     ending: str,
-) -> tuple[list[int], list[int], list[int], int]:
-    context_encoding = tokenizer.encode(context)
-    full_encoding = tokenizer.encode(context + ending)
-    continuation_length = len(full_encoding.input_ids) - len(context_encoding.input_ids)
-    return (
-        full_encoding.input_ids,
-        full_encoding.place_ids,
-        full_encoding.role_ids,
-        continuation_length,
-    )
 @torch.inference_mode()
 def evaluate(
-    model: torch.nn.Module,
-    tokenizer: FusionTokenizer,
     examples: list[dict],
     *,
     device: torch.device,
@@ -79,6 +71,9 @@ def evaluate(
     failures: list[dict] = []
     failure_summary: Counter[tuple[str, str, str]] = Counter()
     model.eval()
     for start in range(0, len(examples), batch_size):
         batch_examples = examples[start : start + batch_size]
@@ -95,20 +90,16 @@ def evaluate(
         max_length = max(len(item[0]) for item in encoded)
         input_ids = torch.full(
             (len(encoded), max_length),
-            PAD_ID,
             dtype=torch.long,
             device=device,
         )
-        place_ids = torch.zeros_like(input_ids)
-        role_ids = torch.zeros_like(input_ids)
         attention_mask = torch.zeros_like(input_ids, dtype=torch.bool)
         lengths = []
         continuation_lengths = []
-        for row, (ids, places, roles, continuation_length) in enumerate(encoded):
             length = len(ids)
             input_ids[row, :length] = torch.tensor(ids, device=device)
-            place_ids[row, :length] = torch.tensor(places, device=device)
-            role_ids[row, :length] = torch.tensor(roles, device=device)
             attention_mask[row, :length] = True
             lengths.append(length)
             continuation_lengths.append(continuation_length)
@@ -121,8 +112,6 @@ def evaluate(
         with autocast:
             logits = model(
                 input_ids=input_ids,
-                place_ids=place_ids,
-                role_ids=role_ids,
                 attention_mask=attention_mask,
             ).logits
         log_probs = F.log_softmax(logits.float(), dim=-1)
@@ -195,7 +184,7 @@ def evaluate(
     results = {
         "benchmark": "arithmark_2.0",
-        "model_type": "atom2.7m",
         "accuracy": correct / max(total, 1),
         "correct": correct,
         "total": total,
@@ -228,10 +217,10 @@ def evaluate(
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument("--checkpoint", type=Path, default=Path("outputs/fusion_run/final_model"))
-    parser.add_argument("--tokenizer-dir", type=Path, default=Path("tokenizer_4k"))
     parser.add_argument("--data-path", type=Path, default=Path("arithmark_2.0.jsonl"))
     parser.add_argument("--batch-size", type=int, default=64)
     parser.add_argument("--device", default="auto")
     parser.add_argument("--output", type=Path)
     parser.add_argument(
         "--max-examples",
@@ -263,11 +252,21 @@ def main() -> None:
     data_path = ensure_data(args.data_path)
     examples = load_examples(data_path, max_examples=args.max_examples)
     model = AutoModelForCausalLM.from_pretrained(
         args.checkpoint,
         trust_remote_code=True,
     ).to(device)
-    tokenizer = load_tokenizer(args.tokenizer_dir)
     results = evaluate(
         model,
         tokenizer,

+"""Score a fusion GPT checkpoint on ArithMark 2.0."""
 from __future__ import annotations
 import torch
 import torch.nn.functional as F
+from transformers import AutoModelForCausalLM, AutoTokenizer
 DATA_URL = (
     "https://huggingface.co/datasets/AxiomicLabs/Arithmark-2.0/"
     "resolve/main/arithmark_2.0.jsonl"
 )
 def ensure_data(path: Path) -> Path:
 def _encoded_choice(
+    tokenizer,
     context: str,
     ending: str,
+) -> tuple[list[int], int]:
+    context_ids = tokenizer(context, add_special_tokens=False).input_ids
+    full_ids = tokenizer(context + ending, add_special_tokens=False).input_ids
+    continuation_length = len(full_ids) - len(context_ids)
+    return full_ids, continuation_length
 @torch.inference_mode()
 def evaluate(
+    model,
+    tokenizer,
     examples: list[dict],
     *,
     device: torch.device,
     failures: list[dict] = []
     failure_summary: Counter[tuple[str, str, str]] = Counter()
     model.eval()
+    pad_id = tokenizer.pad_token_id
+    if pad_id is None:
+        pad_id = tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 0
     for start in range(0, len(examples), batch_size):
         batch_examples = examples[start : start + batch_size]
         max_length = max(len(item[0]) for item in encoded)
         input_ids = torch.full(
             (len(encoded), max_length),
+            int(pad_id),
             dtype=torch.long,
             device=device,
         )
         attention_mask = torch.zeros_like(input_ids, dtype=torch.bool)
         lengths = []
         continuation_lengths = []
+        for row, (ids, continuation_length) in enumerate(encoded):
             length = len(ids)
             input_ids[row, :length] = torch.tensor(ids, device=device)
             attention_mask[row, :length] = True
             lengths.append(length)
             continuation_lengths.append(continuation_length)
         with autocast:
             logits = model(
                 input_ids=input_ids,
                 attention_mask=attention_mask,
             ).logits
         log_probs = F.log_softmax(logits.float(), dim=-1)
     results = {
         "benchmark": "arithmark_2.0",
+        "model_type": "fusion_gpt",
         "accuracy": correct / max(total, 1),
         "correct": correct,
         "total": total,
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument("--checkpoint", type=Path, default=Path("outputs/fusion_run/final_model"))
     parser.add_argument("--data-path", type=Path, default=Path("arithmark_2.0.jsonl"))
     parser.add_argument("--batch-size", type=int, default=64)
     parser.add_argument("--device", default="auto")
+    parser.add_argument("--dtype", default="auto", choices=("auto", "float32", "bfloat16", "float16"))
     parser.add_argument("--output", type=Path)
     parser.add_argument(
         "--max-examples",
     data_path = ensure_data(args.data_path)
     examples = load_examples(data_path, max_examples=args.max_examples)
+    dtype = None
+    if args.dtype == "float32":
+        dtype = torch.float32
+    elif args.dtype == "bfloat16":
+        dtype = torch.bfloat16
+    elif args.dtype == "float16":
+        dtype = torch.float16
     model = AutoModelForCausalLM.from_pretrained(
         args.checkpoint,
+        dtype=dtype,
         trust_remote_code=True,
     ).to(device)
+    tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True)
+    if tokenizer.pad_token_id is None:
+        tokenizer.pad_token = tokenizer.eos_token
     results = evaluate(
         model,
         tokenizer,

config.json CHANGED Viewed

@@ -8,6 +8,52 @@
   },
   "block_size": 512,
   "dtype": "float32",
   "head_dim": 48,
   "hidden_size": 192,
   "intermediate_size": 480,

   },
   "block_size": 512,
   "dtype": "float32",
+  "feature_digit_token_ids": [
+    20,
+    21,
+    22,
+    23,
+    24,
+    25,
+    26,
+    27,
+    28,
+    29
+  ],
+  "feature_equals_token_id": 33,
+  "feature_space_token_ids": [
+    202,
+    204,
+    205,
+    221,
+    222,
+    223,
+    224,
+    225,
+    273,
+    293,
+    355,
+    359,
+    488,
+    501,
+    669,
+    809,
+    856,
+    902,
+    1168,
+    1386,
+    1407,
+    1581,
+    1687,
+    2070,
+    2165,
+    2627,
+    2951,
+    3033,
+    3218,
+    3391,
+    4076
+  ],
   "head_dim": 48,
   "hidden_size": 192,
   "intermediate_size": 480,

config.py CHANGED Viewed

@@ -22,6 +22,9 @@ DEFAULT_BLOCK_SIZE = 512
 DEFAULT_ROPE_THETA = 5000.0
 DEFAULT_PLACE_VOCAB_SIZE = 66
 DEFAULT_ROLE_VOCAB_SIZE = 12
 class GPTConfig(PretrainedConfig):
@@ -46,6 +49,9 @@ class GPTConfig(PretrainedConfig):
         use_role_embeddings: bool = True,
         place_vocab_size: int = DEFAULT_PLACE_VOCAB_SIZE,
         role_vocab_size: int = DEFAULT_ROLE_VOCAB_SIZE,
         **kwargs,
     ):
         if num_key_value_heads is None:
@@ -79,6 +85,25 @@ class GPTConfig(PretrainedConfig):
         self.use_role_embeddings = bool(use_role_embeddings)
         self.place_vocab_size = int(place_vocab_size)
         self.role_vocab_size = int(role_vocab_size)
 def _bool_env(name: str, default: bool) -> bool:
@@ -107,13 +132,13 @@ class Hyperparameters:
     iterations: int = field(default_factory=lambda: int(os.environ.get("ITERATIONS", "10000")))
     requested_train_tokens: int = field(init=False)
     train_tokens: int = field(init=False)
-    decay_start_frac: float = field(default_factory=lambda: float(os.environ.get("DECAY_START_FRAC", "0.7")))
     warmup_steps: int = field(default_factory=lambda: int(os.environ.get("WARMUP_STEPS", "0")))
     lr_warmup_steps: int = field(default_factory=lambda: int(os.environ.get("LR_WARMUP_STEPS", "500")))
     train_batch_tokens: int = field(default_factory=lambda: int(os.environ.get("TRAIN_BATCH_TOKENS", str(DEFAULT_BLOCK_SIZE * 512))))
     train_seq_len: int = field(init=False)
     eval_seq_len: int = field(init=False)
-    grad_accum_steps: int = field(default_factory=lambda: int(os.environ.get("GRAD_ACCUM_STEPS", "4")))
     train_log_every: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_EVERY", "100")))
     train_log_dense_steps: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_DENSE_STEPS", "100")))
     train_log_ramp_steps: int = field(
@@ -146,12 +171,12 @@ class Hyperparameters:
     place_vocab_size: int = field(default_factory=lambda: int(os.environ.get("PLACE_VOCAB_SIZE", str(DEFAULT_PLACE_VOCAB_SIZE))))
     role_vocab_size: int = field(default_factory=lambda: int(os.environ.get("ROLE_VOCAB_SIZE", str(DEFAULT_ROLE_VOCAB_SIZE))))
-    min_lr: float = field(default_factory=lambda: float(os.environ.get("MIN_LR", "0.0")))
-    lr: float = field(default_factory=lambda: float(os.environ.get("LR", "0.004")))
     beta1: float = field(default_factory=lambda: float(os.environ.get("BETA1", "0.9")))
-    beta2: float = field(default_factory=lambda: float(os.environ.get("BETA2", "0.95")))
     adam_eps: float = field(default_factory=lambda: float(os.environ.get("ADAM_EPS", "1e-8")))
-    weight_decay: float = field(default_factory=lambda: float(os.environ.get("WEIGHT_DECAY", "0.005")))
     compile_model: bool = field(default_factory=lambda: _bool_env("COMPILE_MODEL", True))
     autocast: bool = field(default_factory=lambda: _bool_env("AUTOCAST", True))

 DEFAULT_ROPE_THETA = 5000.0
 DEFAULT_PLACE_VOCAB_SIZE = 66
 DEFAULT_ROLE_VOCAB_SIZE = 12
+DEFAULT_FEATURE_DIGIT_TOKEN_IDS = tuple(range(20, 30))
+DEFAULT_FEATURE_EQUALS_TOKEN_ID = 33
+DEFAULT_FEATURE_SPACE_TOKEN_IDS = (225,)
 class GPTConfig(PretrainedConfig):
         use_role_embeddings: bool = True,
         place_vocab_size: int = DEFAULT_PLACE_VOCAB_SIZE,
         role_vocab_size: int = DEFAULT_ROLE_VOCAB_SIZE,
+        feature_digit_token_ids: list[int] | tuple[int, ...] | None = None,
+        feature_equals_token_id: int | None = DEFAULT_FEATURE_EQUALS_TOKEN_ID,
+        feature_space_token_ids: list[int] | tuple[int, ...] | None = None,
         **kwargs,
     ):
         if num_key_value_heads is None:
         self.use_role_embeddings = bool(use_role_embeddings)
         self.place_vocab_size = int(place_vocab_size)
         self.role_vocab_size = int(role_vocab_size)
+        self.feature_digit_token_ids = [
+            int(token_id)
+            for token_id in (
+                DEFAULT_FEATURE_DIGIT_TOKEN_IDS
+                if feature_digit_token_ids is None
+                else feature_digit_token_ids
+            )
+        ]
+        self.feature_equals_token_id = (
+            None if feature_equals_token_id is None else int(feature_equals_token_id)
+        )
+        self.feature_space_token_ids = [
+            int(token_id)
+            for token_id in (
+                DEFAULT_FEATURE_SPACE_TOKEN_IDS
+                if feature_space_token_ids is None
+                else feature_space_token_ids
+            )
+        ]
 def _bool_env(name: str, default: bool) -> bool:
     iterations: int = field(default_factory=lambda: int(os.environ.get("ITERATIONS", "10000")))
     requested_train_tokens: int = field(init=False)
     train_tokens: int = field(init=False)
+    decay_start_frac: float = field(default_factory=lambda: float(os.environ.get("DECAY_START_FRAC", "0.9")))
     warmup_steps: int = field(default_factory=lambda: int(os.environ.get("WARMUP_STEPS", "0")))
     lr_warmup_steps: int = field(default_factory=lambda: int(os.environ.get("LR_WARMUP_STEPS", "500")))
     train_batch_tokens: int = field(default_factory=lambda: int(os.environ.get("TRAIN_BATCH_TOKENS", str(DEFAULT_BLOCK_SIZE * 512))))
     train_seq_len: int = field(init=False)
     eval_seq_len: int = field(init=False)
+    grad_accum_steps: int = field(default_factory=lambda: int(os.environ.get("GRAD_ACCUM_STEPS", "2")))
     train_log_every: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_EVERY", "100")))
     train_log_dense_steps: int = field(default_factory=lambda: int(os.environ.get("TRAIN_LOG_DENSE_STEPS", "100")))
     train_log_ramp_steps: int = field(
     place_vocab_size: int = field(default_factory=lambda: int(os.environ.get("PLACE_VOCAB_SIZE", str(DEFAULT_PLACE_VOCAB_SIZE))))
     role_vocab_size: int = field(default_factory=lambda: int(os.environ.get("ROLE_VOCAB_SIZE", str(DEFAULT_ROLE_VOCAB_SIZE))))
+    min_lr: float = field(default_factory=lambda: float(os.environ.get("MIN_LR", "0.01")))
+    lr: float = field(default_factory=lambda: float(os.environ.get("LR", "0.005")))
     beta1: float = field(default_factory=lambda: float(os.environ.get("BETA1", "0.9")))
+    beta2: float = field(default_factory=lambda: float(os.environ.get("BETA2", "0.98")))
     adam_eps: float = field(default_factory=lambda: float(os.environ.get("ADAM_EPS", "1e-8")))
+    weight_decay: float = field(default_factory=lambda: float(os.environ.get("WEIGHT_DECAY", "0.001")))
     compile_model: bool = field(default_factory=lambda: _bool_env("COMPILE_MODEL", True))
     autocast: bool = field(default_factory=lambda: _bool_env("AUTOCAST", True))

configuration_gpt.py CHANGED Viewed

@@ -5,6 +5,9 @@ New code should import these from :mod:`GPT.config`.
 from .config import (
     DEFAULT_BLOCK_SIZE,
     DEFAULT_HEAD_DIM,
     DEFAULT_HIDDEN_SIZE,
     DEFAULT_INTERMEDIATE_SIZE,
@@ -21,6 +24,9 @@ from .config import (
 __all__ = [
     "DEFAULT_BLOCK_SIZE",
     "DEFAULT_HEAD_DIM",
     "DEFAULT_HIDDEN_SIZE",
     "DEFAULT_INTERMEDIATE_SIZE",

 from .config import (
     DEFAULT_BLOCK_SIZE,
+    DEFAULT_FEATURE_DIGIT_TOKEN_IDS,
+    DEFAULT_FEATURE_EQUALS_TOKEN_ID,
+    DEFAULT_FEATURE_SPACE_TOKEN_IDS,
     DEFAULT_HEAD_DIM,
     DEFAULT_HIDDEN_SIZE,
     DEFAULT_INTERMEDIATE_SIZE,
 __all__ = [
     "DEFAULT_BLOCK_SIZE",
+    "DEFAULT_FEATURE_DIGIT_TOKEN_IDS",
+    "DEFAULT_FEATURE_EQUALS_TOKEN_ID",
+    "DEFAULT_FEATURE_SPACE_TOKEN_IDS",
     "DEFAULT_HEAD_DIM",
     "DEFAULT_HIDDEN_SIZE",
     "DEFAULT_INTERMEDIATE_SIZE",

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.57.6"
+}

model.py CHANGED Viewed

@@ -21,6 +21,8 @@ CONTROL_TENSOR_NAME_PATTERNS = (
     "ln_",
     "rms",
 )
 class CastedLinear(nn.Linear):
@@ -98,6 +100,8 @@ class GPTAttention(nn.Module):
         self.o_proj = CastedLinear(self.n_head * self.head_dim, config.hidden_size, bias=False)
     def _xsa_efficient(self, y: Tensor, v_current: Tensor) -> Tensor:
         B, H, T, D = y.shape
         Hkv = v_current.size(1)
         group = H // Hkv
@@ -248,17 +252,89 @@ class GPTForCausalLM(GPTPreTrainedModel, GenerationMixin):
     def set_output_embeddings(self, new_embeddings):
         self.lm_head = new_embeddings
     def embed_tokens(self, input_ids, *, place_ids=None, role_ids=None, **kwargs):
         embeddings = self.transformer["wte"](input_ids)
         if self.place_embeddings is not None:
-            if place_ids is None:
-                place_ids = torch.zeros_like(input_ids)
             if place_ids.shape != input_ids.shape:
                 raise ValueError("place_ids must match input_ids shape")
             embeddings = embeddings + self.place_embeddings(place_ids)
         if self.role_embeddings is not None:
-            if role_ids is None:
-                role_ids = torch.zeros_like(input_ids)
             if role_ids.shape != input_ids.shape:
                 raise ValueError("role_ids must match input_ids shape")
             embeddings = embeddings + self.role_embeddings(role_ids)
@@ -297,6 +373,8 @@ class GPTForCausalLM(GPTPreTrainedModel, GenerationMixin):
         input_ids,
         attention_mask=None,
         labels=None,
         past_key_values: Optional[DynamicCache] = None,
         use_cache=False,
         **kwargs,
@@ -306,7 +384,12 @@ class GPTForCausalLM(GPTPreTrainedModel, GenerationMixin):
             past_key_values = DynamicCache()
         past_len = past_key_values.get_seq_length() if past_key_values is not None else 0
-        x = self.embed_tokens(input_ids, **kwargs)
         cos, sin = self._get_freqs_cis(past_len + T, input_ids.device)
         freqs_cis = (
             cos[past_len:past_len + T],

     "ln_",
     "rms",
 )
+RESULT_ROLE_ID = 10
+SPACE_ROLE_ID = 11
 class CastedLinear(nn.Linear):
         self.o_proj = CastedLinear(self.n_head * self.head_dim, config.hidden_size, bias=False)
     def _xsa_efficient(self, y: Tensor, v_current: Tensor) -> Tensor:
+        # y:         [B, H,   T, D]
+        # v_current: [B, Hkv, T, D]
         B, H, T, D = y.shape
         Hkv = v_current.size(1)
         group = H // Hkv
     def set_output_embeddings(self, new_embeddings):
         self.lm_head = new_embeddings
+    def derive_features_from_input_ids(self, input_ids: Tensor) -> tuple[Tensor, Tensor]:
+        """Derive arithmetic auxiliary streams from token IDs.
+        This is the default-compatible path for HF/leaderboard callers that only
+        provide ``input_ids``. Training and specialized benchmarks may still pass
+        precomputed streams, which remain authoritative.
+        """
+        digit_ids = set(int(token_id) for token_id in getattr(self.config, "feature_digit_token_ids", []))
+        equals_id = getattr(self.config, "feature_equals_token_id", None)
+        space_ids = set(int(token_id) for token_id in getattr(self.config, "feature_space_token_ids", []))
+        place_overflow_id = int(getattr(self.config, "place_vocab_size", 1)) - 1
+        role_vocab_size = int(getattr(self.config, "role_vocab_size", 0))
+        place_ids = torch.zeros_like(input_ids)
+        role_ids = torch.zeros_like(input_ids)
+        if not digit_ids:
+            return place_ids, role_ids
+        input_cpu = input_ids.detach().to("cpu")
+        place_cpu = torch.zeros_like(input_cpu)
+        role_cpu = torch.zeros_like(input_cpu)
+        for row in range(input_cpu.size(0)):
+            ids = [int(token_id) for token_id in input_cpu[row].tolist()]
+            index = 0
+            digit_runs: list[tuple[int, int]] = []
+            while index < len(ids):
+                if ids[index] not in digit_ids:
+                    index += 1
+                    continue
+                run_start = index
+                offset = 1
+                while index < len(ids) and ids[index] in digit_ids:
+                    place_cpu[row, index] = min(offset, place_overflow_id)
+                    index += 1
+                    offset += 1
+                digit_runs.append((run_start, index))
+            if equals_id is None or not digit_runs:
+                continue
+            equals_positions = [pos for pos, token_id in enumerate(ids) if token_id == int(equals_id)]
+            if len(equals_positions) != 1:
+                continue
+            equals_position = equals_positions[0]
+            operand_runs = [(start, end) for start, end in digit_runs if end <= equals_position]
+            result_runs = [(start, end) for start, end in digit_runs if start > equals_position]
+            if not operand_runs or len(operand_runs) > 9:
+                continue
+            if role_vocab_size > SPACE_ROLE_ID:
+                for pos, token_id in enumerate(ids):
+                    if token_id in space_ids:
+                        role_cpu[row, pos] = SPACE_ROLE_ID
+            for role, (start, end) in enumerate(operand_runs, start=1):
+                if role >= role_vocab_size:
+                    break
+                role_cpu[row, start:end] = role
+            if role_vocab_size > RESULT_ROLE_ID:
+                for start, end in result_runs:
+                    role_cpu[row, start:end] = RESULT_ROLE_ID
+        return (
+            place_cpu.to(device=input_ids.device, non_blocking=True),
+            role_cpu.to(device=input_ids.device, non_blocking=True),
+        )
     def embed_tokens(self, input_ids, *, place_ids=None, role_ids=None, **kwargs):
+        if (place_ids is None and self.place_embeddings is not None) or (
+            role_ids is None and self.role_embeddings is not None
+        ):
+            derived_place_ids, derived_role_ids = self.derive_features_from_input_ids(input_ids)
+            if place_ids is None:
+                place_ids = derived_place_ids
+            if role_ids is None:
+                role_ids = derived_role_ids
         embeddings = self.transformer["wte"](input_ids)
         if self.place_embeddings is not None:
             if place_ids.shape != input_ids.shape:
                 raise ValueError("place_ids must match input_ids shape")
             embeddings = embeddings + self.place_embeddings(place_ids)
         if self.role_embeddings is not None:
             if role_ids.shape != input_ids.shape:
                 raise ValueError("role_ids must match input_ids shape")
             embeddings = embeddings + self.role_embeddings(role_ids)
         input_ids,
         attention_mask=None,
         labels=None,
+        place_ids=None,
+        role_ids=None,
         past_key_values: Optional[DynamicCache] = None,
         use_cache=False,
         **kwargs,
             past_key_values = DynamicCache()
         past_len = past_key_values.get_seq_length() if past_key_values is not None else 0
+        x = self.embed_tokens(
+            input_ids,
+            place_ids=place_ids,
+            role_ids=role_ids,
+            **kwargs,
+        )
         cos, sin = self._get_freqs_cis(past_len + T, input_ids.device)
         freqs_cis = (
             cos[past_len:past_len + T],

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a21f5910c898c5464320d3f01b15ccde1eb278073266b221e48e9ec15ccbe899
 size 10930496

 version https://git-lfs.github.com/spec/v1
+oid sha256:0344de127ce64d272153de099118d92c9cd58d2aaf07060bc730bd1cebfa2e33
 size 10930496

requirements.txt CHANGED Viewed

@@ -3,4 +3,3 @@ transformers
 tokenizers
 safetensors
 tqdm
-lm-eval

 tokenizers
 safetensors
 tqdm

tokenization_atom.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""Remote-code tokenizer for Atom/Fusion GPT checkpoints.
+The tokenizer is intentionally HF-compatible: generic callers can use
+``AutoTokenizer.from_pretrained(..., trust_remote_code=True)``. Arithmetic digit
+spans are reversed before tokenization so the model receives LSD-first numbers,
+matching pretraining.
+"""
+from __future__ import annotations
+import re
+from typing import Any
+from transformers import PreTrainedTokenizerFast
+class AtomTokenizer(PreTrainedTokenizerFast):
+    vocab_files_names = {"tokenizer_file": "tokenizer.json"}
+    model_input_names = ["input_ids", "attention_mask"]
+    slow_tokenizer_class = None
+    _digit_span_re = re.compile(r"\d+")
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        kwargs.setdefault("bos_token", "<|bos|>")
+        kwargs.setdefault("eos_token", "<|eos|>")
+        kwargs.setdefault("unk_token", "<|unk|>")
+        kwargs.setdefault("pad_token", "<|pad|>")
+        super().__init__(*args, **kwargs)
+    @classmethod
+    def _reverse_digit_spans(cls, text: str) -> str:
+        return cls._digit_span_re.sub(lambda match: match.group(0)[::-1], text)
+    @classmethod
+    def _transform_text(cls, value: Any) -> Any:
+        if isinstance(value, str):
+            return cls._reverse_digit_spans(value)
+        if isinstance(value, tuple):
+            return tuple(cls._transform_text(item) for item in value)
+        if isinstance(value, list):
+            return [cls._transform_text(item) for item in value]
+        return value
+    def __call__(self, text=None, text_pair=None, *args: Any, **kwargs: Any):
+        return super().__call__(
+            self._transform_text(text),
+            self._transform_text(text_pair),
+            *args,
+            **kwargs,
+        )
+    def encode(self, text, text_pair=None, *args: Any, **kwargs: Any):
+        return super().encode(
+            self._transform_text(text),
+            self._transform_text(text_pair),
+            *args,
+            **kwargs,
+        )
+    def batch_encode_plus(self, batch_text_or_text_pairs, *args: Any, **kwargs: Any):
+        return super().batch_encode_plus(
+            self._transform_text(batch_text_or_text_pairs),
+            *args,
+            **kwargs,
+        )
+    def _decode(self, token_ids, skip_special_tokens: bool = False, **kwargs: Any) -> str:
+        text = super()._decode(
+            token_ids,
+            skip_special_tokens=skip_special_tokens,
+            **kwargs,
+        )
+        return self._reverse_digit_spans(text)

tokenizer_config.json CHANGED Viewed

@@ -2,10 +2,16 @@
   "additional_special_tokens": [
     "<|endoftext|>"
   ],
   "bos_token": "<|bos|>",
   "eos_token": "<|eos|>",
-  "model_max_length": 512,
   "pad_token": "<|pad|>",
-  "tokenizer_class": "GPT2TokenizerFast",
   "unk_token": "<|unk|>"
 }

   "additional_special_tokens": [
     "<|endoftext|>"
   ],
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_atom.AtomTokenizer",
+      null
+    ]
+  },
   "bos_token": "<|bos|>",
   "eos_token": "<|eos|>",
+  "model_max_length": 548,
   "pad_token": "<|pad|>",
+  "tokenizer_class": "AtomTokenizer",
   "unk_token": "<|unk|>"
 }