Mixture of Recursion Language Model — 198M (Adaptive Computation)

A novel 198M parameter conversational language model featuring adaptive recursive computation through perplexity-based dynamic routing.

Built from scratch — no pretrained base model, no fine-tuning. Trained on 150K conversational samples on a single Kaggle T4 GPU.

🔥 Novel Architecture: Mixture of Recursion (MoR)

Traditional transformers apply the same depth to every input — simple and complex sentences both go through all N layers. This wastes compute on easy inputs and under-processes hard ones.

MoR fixes this with self-supervised perplexity-guided routing:

High PPL (>50)   → Model struggling  → 5 recursive steps
Mid PPL (20–50)  → Uncertain         → 3 recursive steps  
Low PPL (<20)    → Model confident   → 1 recursive step

No manual labels needed — the router learns difficulty directly from the model's own perplexity signal during training.

Key Components

Component	Description	Params
Token Embedding	GPT-2 BPE vocab (50,260) + special tokens	~39M
Base Transformer	16 layers, RoPE, pre-norm, NaN-safe attention	~149M
Perplexity Router	2-layer MLP (768→384→3), self-supervised	~1.2M
Recursive Layer	Shared transformer block, reused 1/3/5×	~7M
Total		~198M

NaN-Safe FP16 Training

Standard -inf masking causes NaN during fp16 mixed-precision training. MoR uses -1e4 masking + pre-softmax clamping — completely stable:

scores = scores.clamp(min=-1e4, max=1e4)
attn   = F.softmax(scores, dim=-1)
attn   = torch.nan_to_num(attn, nan=0.0)

Result: 0 NaN batches across all 150K training steps.

📊 Performance — Honest Evaluation

Training Val Set (in-distribution)

Evaluated on 2,000 held-out samples from the same conversational distribution as training data (HH-RLHF + UltraChat + Alpaca-GPT4):

Epoch	Train Loss	Val Loss	Val PPL
1	4.5081	3.0798	21.75
2	3.3068	2.7326	15.37

Fresh Test Set (out-of-distribution)

Evaluated on 500 fresh HH-RLHF test samples (never seen during training). GPT-2 Medium evaluated on identical samples for fair comparison:

Model	Params	PPL (fresh test)	Notes
GPT-2 Medium	345M	26.89	OpenAI baseline
MoR 198M	198M	43.21	This model

Honest note: MoR scores higher PPL on the fresh test set than GPT-2 Medium. The 15.37 val PPL was measured on in-distribution data — the model shows signs of overfitting to the training distribution. The architecture is novel and the routing mechanism works correctly, but the model needs more diverse training data and longer training to generalize better.

The val PPL (15.37) should not be directly compared to GPT-2 Medium's standard benchmark PPL — they are measured on different datasets.

What the Results Actually Show

✅ Architecture works   — training is stable, loss decreases
✅ Router works         — 0 NaN batches, routing signals valid
✅ Novel contribution   — perplexity-based self-supervised routing
⚠️  Generalization gap  — val PPL 15.37 vs test PPL 43.21
⚠️  More data needed    — 150K samples is small for this model size

🚀 Quick Start

pip install transformers torch

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Girinath11/recursive-language-model-198m",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Girinath11/recursive-language-model-198m",
    trust_remote_code=True
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = model.to(device).eval()

Chat Format (required)

The model was trained with a specific chat format — always use this:

def chat(question, max_new_tokens=150, temperature=0.7, top_p=0.9):
    prompt = f"<|user|>\n{question}\n<|assistant|>\n"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=False
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_new_tokens = max_new_tokens,
            temperature    = temperature,
            top_p          = top_p,
            do_sample      = True,
        )

    full = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return full.split("<|assistant|>")[-1].strip() \
           if "<|assistant|>" in full else full

print(chat("What is machine learning?"))
print(chat("Explain neural networks simply"))

📚 Training Details

Dataset — 150K Conversational Samples

Dataset	Samples	%	Description
Anthropic HH-RLHF	80,000	53%	Helpful & harmless human feedback
UltraChat	50,000	33%	GPT-4 multi-turn dialogues
Alpaca-GPT4	20,000	14%	Instruction following
Validation	2,000	—	Held-out (same distribution)

Training Config

GPU            : NVIDIA Tesla T4 (15.6 GB VRAM)
Platform       : Kaggle (single GPU)
Epochs         : 2
Total steps    : 150,000
Training time  : ~9h 12m

Batch size     : 2
Grad accum     : 32  (effective batch = 64)
Max seq len    : 512
Learning rate  : 1e-4
LR schedule    : Linear warmup + Cosine decay
Warmup steps   : 500
Optimizer      : AdamW (β=0.9, 0.95, ε=1e-8)
Weight decay   : 0.01
Grad clip      : 1.0
Mixed precision: FP16 (AMP)

Loss:
  Total = LM loss + 0.1 × Router loss

💡 Technical Innovation

Self-Supervised Perplexity Routing

# During training — no manual labels needed!
sample_ppl = exp(per_sample_ce_loss)

if sample_ppl < 20:   pseudo_label = 0  # simple
elif sample_ppl < 50: pseudo_label = 1  # medium
else:                 pseudo_label = 2  # complex

router_loss = CrossEntropyLoss(router_logits, pseudo_label)
total_loss  = lm_loss + 0.1 * router_loss

Why it works:

Router learns difficulty FROM the model's own performance
As training progresses, more samples become "simple" — natural curriculum
No annotation cost, no bias from human difficulty ratings

What I Learned / What to Improve Next

This model is my first from-scratch LLM — I learned a lot:

What worked:
  ✅ NaN-safe attention masking (-1e4)
  ✅ Self-supervised routing signal
  ✅ Stable FP16 training (0 NaN batches)
  ✅ Perplexity drops meaningfully across epochs

What needs improvement:
  ⚠️  More diverse training data (150K is small)
  ⚠️  Longer training (2 epochs not enough)
  ⚠️  Stronger regularization to prevent overfitting
  ⚠️  Better evaluation on neutral benchmarks from day 1

Next project: BharatMorph — applies lessons learned here to Indian languages, with PPL-gated dynamic recursion (checks quality after each step, not pre-decided) and morpheme-aware embeddings for agglutinative Indian language structure.

⚠️ Limitations

Context window: 512 tokens max
English only: Not suitable for other languages
Small training set: 150K samples — commercial models use 100B+
Generalization: Shows overfitting to training distribution
Repetition: May loop in very long generations (>200 tokens)
Factual accuracy: May hallucinate — do not use for facts

🎯 Intended Use

Use case	Suitable?
Research on adaptive computation	✅ Yes
Learning how LLMs are built from scratch	✅ Yes
Prototyping conversational AI	✅ With caveats
Production chatbots	❌ No
Medical / legal / financial advice	❌ No
Factual question answering	❌ No

📄 Citation

@misc{girinath2026mor,
  author       = {Girinath V},
  title        = {Mixture of Recursion: Self-Supervised Perplexity-Guided
                  Adaptive Computation for Language Models},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Girinath11/recursive-language-model-198m}},
  note         = {198M parameter LM trained from scratch with adaptive
                  recursive computation via perplexity-based routing}
}

🙏 Acknowledgments

Anthropic — HH-RLHF dataset
Tsinghua University — UltraChat dataset
Vicgalle — Alpaca-GPT4 dataset
HuggingFace — Transformers library
Kaggle — Free GPU access

Model status : ✅ Research / Educational use
Last updated : March 2026
License : MIT

Downloads last month: 668

Safetensors

Model size

0.2B params

Tensor type

I64

F32

Girinath11
/

recursive-language-model-198m