Mixture of Recursion Language Model β 198M (Adaptive Computation)
A novel 198M parameter conversational language model featuring adaptive recursive computation through perplexity-based dynamic routing.
Built from scratch β no pretrained base model, no fine-tuning. Trained on 150K conversational samples on a single Kaggle T4 GPU.
π₯ Novel Architecture: Mixture of Recursion (MoR)
Traditional transformers apply the same depth to every input β simple and complex sentences both go through all N layers. This wastes compute on easy inputs and under-processes hard ones.
MoR fixes this with self-supervised perplexity-guided routing:
High PPL (>50) β Model struggling β 5 recursive steps
Mid PPL (20β50) β Uncertain β 3 recursive steps
Low PPL (<20) β Model confident β 1 recursive step
No manual labels needed β the router learns difficulty directly from the model's own perplexity signal during training.
Key Components
| Component | Description | Params |
|---|---|---|
| Token Embedding | GPT-2 BPE vocab (50,260) + special tokens | ~39M |
| Base Transformer | 16 layers, RoPE, pre-norm, NaN-safe attention | ~149M |
| Perplexity Router | 2-layer MLP (768β384β3), self-supervised | ~1.2M |
| Recursive Layer | Shared transformer block, reused 1/3/5Γ | ~7M |
| Total | ~198M |
NaN-Safe FP16 Training
Standard -inf masking causes NaN during fp16 mixed-precision training.
MoR uses -1e4 masking + pre-softmax clamping β completely stable:
scores = scores.clamp(min=-1e4, max=1e4)
attn = F.softmax(scores, dim=-1)
attn = torch.nan_to_num(attn, nan=0.0)
Result: 0 NaN batches across all 150K training steps.
π Performance β Honest Evaluation
Training Val Set (in-distribution)
Evaluated on 2,000 held-out samples from the same conversational distribution as training data (HH-RLHF + UltraChat + Alpaca-GPT4):
| Epoch | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 1 | 4.5081 | 3.0798 | 21.75 |
| 2 | 3.3068 | 2.7326 | 15.37 |
Fresh Test Set (out-of-distribution)
Evaluated on 500 fresh HH-RLHF test samples (never seen during training). GPT-2 Medium evaluated on identical samples for fair comparison:
| Model | Params | PPL (fresh test) | Notes |
|---|---|---|---|
| GPT-2 Medium | 345M | 26.89 | OpenAI baseline |
| MoR 198M | 198M | 43.21 | This model |
Honest note: MoR scores higher PPL on the fresh test set than GPT-2 Medium. The 15.37 val PPL was measured on in-distribution data β the model shows signs of overfitting to the training distribution. The architecture is novel and the routing mechanism works correctly, but the model needs more diverse training data and longer training to generalize better.
The val PPL (15.37) should not be directly compared to GPT-2 Medium's standard benchmark PPL β they are measured on different datasets.
What the Results Actually Show
β
Architecture works β training is stable, loss decreases
β
Router works β 0 NaN batches, routing signals valid
β
Novel contribution β perplexity-based self-supervised routing
β οΈ Generalization gap β val PPL 15.37 vs test PPL 43.21
β οΈ More data needed β 150K samples is small for this model size
π Quick Start
pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Girinath11/recursive-language-model-198m",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"Girinath11/recursive-language-model-198m",
trust_remote_code=True
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()
Chat Format (required)
The model was trained with a specific chat format β always use this:
def chat(question, max_new_tokens=150, temperature=0.7, top_p=0.9):
prompt = f"<|user|>\n{question}\n<|assistant|>\n"
inputs = tokenizer(
prompt,
return_tensors="pt",
add_special_tokens=False
).to(device)
with torch.no_grad():
outputs = model.generate(
inputs["input_ids"],
max_new_tokens = max_new_tokens,
temperature = temperature,
top_p = top_p,
do_sample = True,
)
full = tokenizer.decode(outputs[0], skip_special_tokens=True)
return full.split("<|assistant|>")[-1].strip() \
if "<|assistant|>" in full else full
print(chat("What is machine learning?"))
print(chat("Explain neural networks simply"))
π Training Details
Dataset β 150K Conversational Samples
| Dataset | Samples | % | Description |
|---|---|---|---|
| Anthropic HH-RLHF | 80,000 | 53% | Helpful & harmless human feedback |
| UltraChat | 50,000 | 33% | GPT-4 multi-turn dialogues |
| Alpaca-GPT4 | 20,000 | 14% | Instruction following |
| Validation | 2,000 | β | Held-out (same distribution) |
Training Config
GPU : NVIDIA Tesla T4 (15.6 GB VRAM)
Platform : Kaggle (single GPU)
Epochs : 2
Total steps : 150,000
Training time : ~9h 12m
Batch size : 2
Grad accum : 32 (effective batch = 64)
Max seq len : 512
Learning rate : 1e-4
LR schedule : Linear warmup + Cosine decay
Warmup steps : 500
Optimizer : AdamW (Ξ²=0.9, 0.95, Ξ΅=1e-8)
Weight decay : 0.01
Grad clip : 1.0
Mixed precision: FP16 (AMP)
Loss:
Total = LM loss + 0.1 Γ Router loss
π‘ Technical Innovation
Self-Supervised Perplexity Routing
# During training β no manual labels needed!
sample_ppl = exp(per_sample_ce_loss)
if sample_ppl < 20: pseudo_label = 0 # simple
elif sample_ppl < 50: pseudo_label = 1 # medium
else: pseudo_label = 2 # complex
router_loss = CrossEntropyLoss(router_logits, pseudo_label)
total_loss = lm_loss + 0.1 * router_loss
Why it works:
- Router learns difficulty FROM the model's own performance
- As training progresses, more samples become "simple" β natural curriculum
- No annotation cost, no bias from human difficulty ratings
What I Learned / What to Improve Next
This model is my first from-scratch LLM β I learned a lot:
What worked:
β
NaN-safe attention masking (-1e4)
β
Self-supervised routing signal
β
Stable FP16 training (0 NaN batches)
β
Perplexity drops meaningfully across epochs
What needs improvement:
β οΈ More diverse training data (150K is small)
β οΈ Longer training (2 epochs not enough)
β οΈ Stronger regularization to prevent overfitting
β οΈ Better evaluation on neutral benchmarks from day 1
Next project: BharatMorph β applies lessons learned here to Indian languages, with PPL-gated dynamic recursion (checks quality after each step, not pre-decided) and morpheme-aware embeddings for agglutinative Indian language structure.
β οΈ Limitations
- Context window: 512 tokens max
- English only: Not suitable for other languages
- Small training set: 150K samples β commercial models use 100B+
- Generalization: Shows overfitting to training distribution
- Repetition: May loop in very long generations (>200 tokens)
- Factual accuracy: May hallucinate β do not use for facts
π― Intended Use
| Use case | Suitable? |
|---|---|
| Research on adaptive computation | β Yes |
| Learning how LLMs are built from scratch | β Yes |
| Prototyping conversational AI | β With caveats |
| Production chatbots | β No |
| Medical / legal / financial advice | β No |
| Factual question answering | β No |
π Citation
@misc{girinath2026mor,
author = {Girinath V},
title = {Mixture of Recursion: Self-Supervised Perplexity-Guided
Adaptive Computation for Language Models},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Girinath11/recursive-language-model-198m}},
note = {198M parameter LM trained from scratch with adaptive
recursive computation via perplexity-based routing}
}
π Acknowledgments
- Anthropic β HH-RLHF dataset
- Tsinghua University β UltraChat dataset
- Vicgalle β Alpaca-GPT4 dataset
- HuggingFace β Transformers library
- Kaggle β Free GPU access
Model status : β
Research / Educational use
Last updated : March 2026
License : MIT
- Downloads last month
- 668