İvme-Conversate-22M-Base

İvme (Turkish: acceleration) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.

The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.

Model Details

Parameter	Value
Architecture	Decoder-only transformer
Parameters	22,028,160
Layers	10
Hidden dim	384
FFN dim	1024 (SwiGLU)
Attention heads	6 query / 2 KV (GQA)
Context length	1024 tokens
Vocab size	16,384 (custom BPE)
Positional encoding	RoPE (θ=10,000)
Normalization	RMSNorm (pre-norm)
Embeddings	Tied input/output
Biases	None

Benchmarks

All benchmarks run via EleutherAI lm-evaluation-harness, 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.

Benchmark	Score	Notes
WikiText-2 (byte_perplexity) ↓	2.96	Lower is better
BLiMP ↑	61.40%	Average over 67 subtasks; random baseline 50%
ARC-Easy ↑	30.85%	acc_norm, 0-shot

Training

Data Mix (~1.57B tokens, Chinchilla-optimal)

Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.

Source	Tokens	Share
epfml/FineWeb-HQ (score > 0.8)	~710M	45%
bigcode/python-stack-v1-functions-filtered	~160M	10%
HuggingFaceTB/finemath (finemath-4plus)	~235M	15%
HuggingFaceTB/cosmopedia (stanford + wikihow)	~395M	25%
wikimedia/wikipedia (EN, 20231101)	~80M	5%

Hyperparameters

Setting	Value
Optimizer	Muon (body weights) + AdamW (embeddings, norms)
Muon lr	0.02
AdamW lr	3e-4
LR schedule	Warmup-Stable-Decay (WSD)
Warmup steps	100
Decay fraction	20% of training
Weight decay	0.1
Gradient clipping	1.0
Effective batch	~1.05M tokens/step
Total steps	1,447
Precision	bfloat16
Attention	Flash Attention 2 (HF Kernels)
Final weights	EMA (β=0.999) of training trajectory

Hardware

Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately 20 minutes.

Tokenizer

Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.

Special tokens: <|pad|>, <|bos|>, <|eos|>, <|unk|>, <|user|>, <|assistant|>, <|system|>

Usage

import torch
from tokenizers import Tokenizer

# Load with custom code (not a standard HF AutoModel — see model.py)
from model import IvmeConfig, IvmeConversate

tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
cfg = ckpt["cfg"]
cfg.attn_backend = "sdpa"  # or "kernels" for HF Kernels flash-attn
model = IvmeConversate(cfg).cuda()
model.load_state_dict(ckpt["model"])
model.eval()

prompt = "The theory of relativity states that"
ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0].tolist()))

Limitations

Base model only — not instruction tuned, will not follow instructions or answer questions
English only (v1)
Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
Repetition at higher temperatures without repetition_penalty
1024 token context window

What's Next

İvme-Conversate-22M-Instruct — SFT on smol-smoltalk for instruction following
İvme-Conversate-v2 — extended training (~15B tokens), reordered curriculum
Turkish support — v2 will add EN+TR with a dedicated bilingual tokenizer
İvme-Classify — encoder-only series for classification tasks

Citation

@misc{ivme-conversate-22m,
  author       = {IvmeLabs},
  title        = {İvme-Conversate-22M-Base},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
}

Built by IvmeLabs. Small models, deliberate choices.

Downloads last month: -; Downloads are not tracked for this model. How to track

IvmeLabs
/

Ivme-Conversate-22M-Base