İvme-Conversate-22M-Base
İvme (Turkish: acceleration) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.
The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.
Model Details
| Parameter | Value |
|---|---|
| Architecture | Decoder-only transformer |
| Parameters | 22,028,160 |
| Layers | 10 |
| Hidden dim | 384 |
| FFN dim | 1024 (SwiGLU) |
| Attention heads | 6 query / 2 KV (GQA) |
| Context length | 1024 tokens |
| Vocab size | 16,384 (custom BPE) |
| Positional encoding | RoPE (θ=10,000) |
| Normalization | RMSNorm (pre-norm) |
| Embeddings | Tied input/output |
| Biases | None |
Benchmarks
All benchmarks run via EleutherAI lm-evaluation-harness, 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.
| Benchmark | Score | Notes |
|---|---|---|
| WikiText-2 (byte_perplexity) ↓ | 2.96 | Lower is better |
| BLiMP ↑ | 61.40% | Average over 67 subtasks; random baseline 50% |
| ARC-Easy ↑ | 30.85% | acc_norm, 0-shot |
Training
Data Mix (~1.57B tokens, Chinchilla-optimal)
Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.
| Source | Tokens | Share |
|---|---|---|
| epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% |
| bigcode/python-stack-v1-functions-filtered | ~160M | 10% |
| HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% |
| HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% |
| wikimedia/wikipedia (EN, 20231101) | ~80M | 5% |
Hyperparameters
| Setting | Value |
|---|---|
| Optimizer | Muon (body weights) + AdamW (embeddings, norms) |
| Muon lr | 0.02 |
| AdamW lr | 3e-4 |
| LR schedule | Warmup-Stable-Decay (WSD) |
| Warmup steps | 100 |
| Decay fraction | 20% of training |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Effective batch | ~1.05M tokens/step |
| Total steps | 1,447 |
| Precision | bfloat16 |
| Attention | Flash Attention 2 (HF Kernels) |
| Final weights | EMA (β=0.999) of training trajectory |
Hardware
Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately 20 minutes.
Tokenizer
Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.
Special tokens: <|pad|>, <|bos|>, <|eos|>, <|unk|>, <|user|>, <|assistant|>, <|system|>
Usage
import torch
from tokenizers import Tokenizer
# Load with custom code (not a standard HF AutoModel — see model.py)
from model import IvmeConfig, IvmeConversate
tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
cfg = ckpt["cfg"]
cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn
model = IvmeConversate(cfg).cuda()
model.load_state_dict(ckpt["model"])
model.eval()
prompt = "The theory of relativity states that"
ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0].tolist()))
Limitations
- Base model only — not instruction tuned, will not follow instructions or answer questions
- English only (v1)
- Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
- Repetition at higher temperatures without
repetition_penalty - 1024 token context window
What's Next
- İvme-Conversate-22M-Instruct — SFT on smol-smoltalk for instruction following
- İvme-Conversate-v2 — extended training (~15B tokens), reordered curriculum
- Turkish support — v2 will add EN+TR with a dedicated bilingual tokenizer
- İvme-Classify — encoder-only series for classification tasks
Citation
@misc{ivme-conversate-22m,
author = {IvmeLabs},
title = {İvme-Conversate-22M-Base},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
}
Built by IvmeLabs. Small models, deliberate choices.
