Crowfeather-50m

A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. No SFT yet — this is a base LM only.

This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on @Crownelius's profile.

Howdy from Shane

Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is step 17,500. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.

If you want the full backstory and trial-and-error, see the companion HF post and the older notes-fant3-and-50m-toy-2026-04 repo.

Architecture

Transformer with two ideas pulled directly from April 2026 research:

component	choice	source
attention	alternating sliding (window=1024) / global, last layer always global	Gemma 4
optimizer	Muon for 2D weights, AdamW for embeddings (hybrid, `adamw_lr = muon_lr/4`)	DeepSeek V4
LR schedule	WSD (Warmup → Stable → Decay), 20% decay phase	Apr 2026 small-LM research
logit stability	Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4	both
embeddings	tied (input + output share)	standard
activation	SwiGLU MLP	standard
memory	RoPE positional, RMSNorm	standard

vocab_size      = 8192        (BPE on 100k FineWeb-edu docs, deterministic)
dim             = 512
n_layers        = 12
n_heads         = 8
head_dim        = 64
mlp_hidden      = 2048
max_seq_len     = 8192
sliding_window  = 1024        (Gemma 4 alternating pattern)

total params    = 54,538,752
  embedding     = 4,194,304   (tied, 7.7%)
  attention     = 12,582,912  (23.1%)
  mlp           = 37,748,736  (69.2%)
  norms         = 12,800      (0.02%)

Training


pretrain corpus	`HuggingFaceFW/fineweb-edu` (default split, 2M docs streamed)
pretrain target	100,000 steps
pretrain actual	17,500 steps banked (~2.3B tokens, ~46× Chinchilla optimal for 50M)
batch	16 × 4096 × 2 grad-accum = effective 32
peak LR	2e-3 (Muon) / 5e-4 (AdamW for embeddings)
WSD	warmup 1500, stable to ~80%, linear decay over last 20%
precision	bf16
hardware	NVIDIA A100 80GB (Thunder Compute)
wall time	~25h before instance terminated

Performance — verbatim qualitative outputs

Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.

>>> The capital of France is
... said to be Rouen, a city of great historical interest. It has been sited in the palace
of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
colleges and villas, which have housed at least fifty ancient monasteries.

>>> Photosynthesis is the process by which
... a new animal becomes an animal. The species' means of survival, they are undergoing
metabolism and their deaths can be found in several different ways.

>>> Once upon a time, in a small village,
... the children of Mr. Fabien and his brothers had been on each side by nightfall and
were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...

>>> def fibonacci(n):
... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
almost no Python in it, expected failure]

>>> The three most important inventions of the 20th century were
... that of 1864, for which he received a D.S. degree and was awarded the Presidential
Medal of Freedom (1867)... [biographical pastiche, no factual grounding]

Honest read

capability	grade
English grammar	A
Sentence flow	A−
Topic-adjacent vocabulary	B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology)
Factual accuracy	D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery)
Code	F (corpus has almost no code)
Long-form coherence	C+ (drifts but maintains tone)

This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: sounds like text, doesn't know much. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.

Intended use

Research artifact. Use cases:

Studying small-LM training dynamics
Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
Distillation source for even smaller students
Fine-tuning on narrow domains (would benefit from adding SFT first)

Not intended for:

Production / user-facing applications (factual accuracy too low)
Chat use (no SFT, no chat template training)
Code generation (no code in pretrain corpus)

How to use

import torch
from huggingface_hub import hf_hub_download
# Custom model code is required — clone or download from the companion repo:
#   https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
# or grab the toy_50m_code.tar.gz attached here.

# Once code is on PYTHONPATH:
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer

ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
tok_path  = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")

tok = load_tokenizer(tok_path)
ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
cfg = Config(**ck["cfg"])
m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
m.load_state_dict(ck["model"])
m.eval()

ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
print(tok.decode(out[0].tolist()))

What's coming

The Crowfeather series. Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as Crowfeather-50m-vN or with descriptive suffixes. Each release gets a matching post on @Crownelius.

This first release reflects the partial Thunder run. Next up: SFT on Jackrong/GLM-5.1-Reasoning-1M-Cleaned, resuming from the included step_017500.pt on Colab.

Citation

@misc{crowfeather50m_2026,
  title  = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
  author = {Shane (Crownelius)},
  year   = {2026},
  month  = {April},
  url    = {https://huggingface.co/Crowfeather/Crowfeather-50m}
}

Acknowledgments

Gemma 4 (April 2026) for the alternating sliding/global attention pattern
DeepSeek V4 (April 2026) for the Muon optimizer recipe
Keller Jordan's Muon writeups for orthogonalization details
HuggingFaceFW/fineweb-edu for the pretrain corpus
Thunder Compute for the A100 hours
CompactAI-O for the small-models-as-research-tools ethos

— Shane, April 2026

Downloads last month: -; Downloads are not tracked for this model. How to track