Crowfeather-50m

A 54.5M-parameter base language model. Pretrained on FineWeb-edu for 17,500 steps (~2.3B tokens) using a Gemma-4-style alternating sliding/global attention transformer with the DeepSeek-V4 Muon optimizer. No SFT yet — this is a base LM only.

This is the first checkpoint in the Crowfeather series. Each subsequent training run will land as its own model card with a matching post on @Crownelius's profile.

Howdy from Shane

Name's Shane. Built this on Thunder Compute over Apr 29-30, 2026, planning a 100k-step pretrain. Credits ran out earlier than the math said they should — instance died around step ~18,280, and the latest cleanly-saved checkpoint is step 17,500. So that's what's in this repo. Continuation will pick up on Colab from this exact checkpoint.

If you want the full backstory and trial-and-error, see the companion HF post and the older notes-fant3-and-50m-toy-2026-04 repo.

Architecture

Transformer with two ideas pulled directly from April 2026 research:

component choice source
attention alternating sliding (window=1024) / global, last layer always global Gemma 4
optimizer Muon for 2D weights, AdamW for embeddings (hybrid, adamw_lr = muon_lr/4) DeepSeek V4
LR schedule WSD (Warmup → Stable → Decay), 20% decay phase Apr 2026 small-LM research
logit stability Gemma-2 logit soft-cap at 30 + PaLM z-loss at 1e-4 both
embeddings tied (input + output share) standard
activation SwiGLU MLP standard
memory RoPE positional, RMSNorm standard
vocab_size      = 8192        (BPE on 100k FineWeb-edu docs, deterministic)
dim             = 512
n_layers        = 12
n_heads         = 8
head_dim        = 64
mlp_hidden      = 2048
max_seq_len     = 8192
sliding_window  = 1024        (Gemma 4 alternating pattern)

total params    = 54,538,752
  embedding     = 4,194,304   (tied, 7.7%)
  attention     = 12,582,912  (23.1%)
  mlp           = 37,748,736  (69.2%)
  norms         = 12,800      (0.02%)

Training

pretrain corpus HuggingFaceFW/fineweb-edu (default split, 2M docs streamed)
pretrain target 100,000 steps
pretrain actual 17,500 steps banked (~2.3B tokens, ~46× Chinchilla optimal for 50M)
batch 16 × 4096 × 2 grad-accum = effective 32
peak LR 2e-3 (Muon) / 5e-4 (AdamW for embeddings)
WSD warmup 1500, stable to ~80%, linear decay over last 20%
precision bf16
hardware NVIDIA A100 80GB (Thunder Compute)
wall time ~25h before instance terminated

Performance — verbatim qualitative outputs

Sampling defaults: T=0.8, top_p=0.9, repetition_penalty=1.3, max_new_tokens=80. Model loaded in bf16.

>>> The capital of France is
... said to be Rouen, a city of great historical interest. It has been sited in the palace
of La Pasca, on the banks of the Rhine and Claude de Becqueret, with its many churches,
colleges and villas, which have housed at least fifty ancient monasteries.

>>> Photosynthesis is the process by which
... a new animal becomes an animal. The species' means of survival, they are undergoing
metabolism and their deaths can be found in several different ways.

>>> Once upon a time, in a small village,
... the children of Mr. Fabien and his brothers had been on each side by nightfall and
were now heading for Saturday, November 3rd. The sister-in-law was incited to his story...

>>> def fibonacci(n):
... Specific conditions for the body of a Type 1 Triangle... [non-code; FineWeb-edu has
almost no Python in it, expected failure]

>>> The three most important inventions of the 20th century were
... that of 1864, for which he received a D.S. degree and was awarded the Presidential
Medal of Freedom (1867)... [biographical pastiche, no factual grounding]

Honest read

capability grade
English grammar A
Sentence flow A−
Topic-adjacent vocabulary B+ (knows "Rhine", "monasteries" for France; "metabolism", "gills" for biology)
Factual accuracy D (Paris→Rouen, photosynthesis→animal metabolism, Fibonacci→bone surgery)
Code F (corpus has almost no code)
Long-form coherence C+ (drifts but maintains tone)

This is exactly what you'd expect from 17.5k pretrain steps on FineWeb-edu with no SFT: sounds like text, doesn't know much. Factual accuracy gets fixed by data + scale (or distillation), not by architecture.

Intended use

Research artifact. Use cases:

  • Studying small-LM training dynamics
  • Recipe ablations (substitute Muon for AdamW, try different schedulers, etc.)
  • Distillation source for even smaller students
  • Fine-tuning on narrow domains (would benefit from adding SFT first)

Not intended for:

  • Production / user-facing applications (factual accuracy too low)
  • Chat use (no SFT, no chat template training)
  • Code generation (no code in pretrain corpus)

How to use

import torch
from huggingface_hub import hf_hub_download
# Custom model code is required — clone or download from the companion repo:
#   https://huggingface.co/Crownelius/notes-fant3-and-50m-toy-2026-04
# or grab the toy_50m_code.tar.gz attached here.

# Once code is on PYTHONPATH:
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer

ckpt_path = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="step_017500.pt")
tok_path  = hf_hub_download(repo_id="Crowfeather/Crowfeather-50m", filename="tokenizer.json")

tok = load_tokenizer(tok_path)
ck = torch.load(ckpt_path, map_location="cpu", weights_only=False)
cfg = Config(**ck["cfg"])
m = ToyLM(cfg).to("cuda", dtype=torch.bfloat16)
m.load_state_dict(ck["model"])
m.eval()

ids = torch.tensor([tok.encode("The capital of France is").ids], device="cuda")
out = m.generate(ids, max_new_tokens=80, temperature=0.8, top_p=0.9, rep_penalty=1.3)
print(tok.decode(out[0].tolist()))

What's coming

The Crowfeather series. Every additional training run on this codebase produces a new model card here on the Hub with the corresponding checkpoint. Continuation runs (more pretrain steps, eventually SFT, eventually preference optimization) will land as Crowfeather-50m-vN or with descriptive suffixes. Each release gets a matching post on @Crownelius.

This first release reflects the partial Thunder run. Next up: SFT on Jackrong/GLM-5.1-Reasoning-1M-Cleaned, resuming from the included step_017500.pt on Colab.

Citation

@misc{crowfeather50m_2026,
  title  = {Crowfeather-50m: a partial-pretrain 54M-parameter Gemma-4/DSv4 base LM},
  author = {Shane (Crownelius)},
  year   = {2026},
  month  = {April},
  url    = {https://huggingface.co/Crowfeather/Crowfeather-50m}
}

Acknowledgments

— Shane, April 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support