--- license: apache-2.0 language: - en tags: - small-lm - gemma4-attention - muon - swiglu - experimental library_name: pytorch --- # Shard-40m-v1 A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model. This is the first checkpoint in the Shard series of small experimental transformers. ## Architecture ``` Total params: 54,538,752 (~54.5M) Hidden dim: 512 Layers: 12 Attention heads: 8 (MHA, no GQA) Head dim: 64 MLP intermediate: 2048 (SwiGLU) Vocab size: 8192 Max sequence: 8192 Attention pattern: Gemma 4 alternating sliding window (window=1024) + global, last layer global Norm: RMSNorm, pre-norm Position encoding: RoPE on Q and K Embeddings: tied input/output Activation: SwiGLU MoE: none Engram: none ``` ## Training ``` Phase 1 (pretrain): Compute: Thunder Compute single GPU Steps: 48,220 of a 100,000 step target (paused early) Throughput: 86,800 tokens per second Optimizer: Muon for hidden 2D weights, AdamW for embeddings and norms LR schedule: WSD (warmup-stable-decay) Stabilizers: lm_head logit cap 30, z-loss coefficient 1e-4 Phase 2 (anneal): Compute: Colab A100 Steps: 20,000 (full anneal complete) Final cross-entropy: 3.27 Mix: OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC ``` ## Files - `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16) - `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB) - `models/tokenizer.json` — custom 8192-vocab BPE - `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py) ## How to load ```python import sys, torch sys.path.insert(0, 'code') from config import Config from model import ToyLM from tokenizer import load_tokenizer ck = torch.load('models/model.pt', map_location='cpu', weights_only=False) cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg'] model = ToyLM(cfg).cuda().to(torch.bfloat16) model.load_state_dict(ck['model']) model.eval() tok = load_tokenizer('models/tokenizer.json') ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda') with torch.no_grad(): for _ in range(40): logits, _ = model(ids) nxt = logits[:, -1].argmax(-1, keepdim=True) ids = torch.cat([ids, nxt], 1) print(tok.decode(ids[0].tolist())) ``` ## Benchmark Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory. Sampled outputs at temperature 0.7, top_p 0.9: | Prompt | Output | |---|---| | `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` | | `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` | | `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) | | `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` | | `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` | | `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) | ## What this artifact proves The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel. ## What this artifact cannot do Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers). ## Why release it To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured. # Notes As this model was trained by [Crownelius](https://huggingface.co/Crownelius), it does not adhere to the required specifications and therefore cannot be integrated into the inference script ## License Apache 2.0. Use freely. Attribution appreciated but not required. ## Citation ``` @misc{shard40mv1, author = {Shane (Crownelius)}, title = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/CompactAI-O/Shard-40m-v1} } ```