Mixture of Recursion Language Model β€” 198M (Adaptive Computation)

A novel 198M parameter conversational language model featuring adaptive recursive computation through perplexity-based dynamic routing.

Built from scratch β€” no pretrained base model, no fine-tuning. Trained on 150K conversational samples on a single Kaggle T4 GPU.


πŸ”₯ Novel Architecture: Mixture of Recursion (MoR)

Traditional transformers apply the same depth to every input β€” simple and complex sentences both go through all N layers. This wastes compute on easy inputs and under-processes hard ones.

MoR fixes this with self-supervised perplexity-guided routing:

High PPL (>50)   β†’ Model struggling  β†’ 5 recursive steps
Mid PPL (20–50)  β†’ Uncertain         β†’ 3 recursive steps  
Low PPL (<20)    β†’ Model confident   β†’ 1 recursive step

No manual labels needed β€” the router learns difficulty directly from the model's own perplexity signal during training.

Key Components

Component Description Params
Token Embedding GPT-2 BPE vocab (50,260) + special tokens ~39M
Base Transformer 16 layers, RoPE, pre-norm, NaN-safe attention ~149M
Perplexity Router 2-layer MLP (768β†’384β†’3), self-supervised ~1.2M
Recursive Layer Shared transformer block, reused 1/3/5Γ— ~7M
Total ~198M

NaN-Safe FP16 Training

Standard -inf masking causes NaN during fp16 mixed-precision training. MoR uses -1e4 masking + pre-softmax clamping β€” completely stable:

scores = scores.clamp(min=-1e4, max=1e4)
attn   = F.softmax(scores, dim=-1)
attn   = torch.nan_to_num(attn, nan=0.0)

Result: 0 NaN batches across all 150K training steps.


πŸ“Š Performance β€” Honest Evaluation

Training Val Set (in-distribution)

Evaluated on 2,000 held-out samples from the same conversational distribution as training data (HH-RLHF + UltraChat + Alpaca-GPT4):

Epoch Train Loss Val Loss Val PPL
1 4.5081 3.0798 21.75
2 3.3068 2.7326 15.37

Fresh Test Set (out-of-distribution)

Evaluated on 500 fresh HH-RLHF test samples (never seen during training). GPT-2 Medium evaluated on identical samples for fair comparison:

Model Params PPL (fresh test) Notes
GPT-2 Medium 345M 26.89 OpenAI baseline
MoR 198M 198M 43.21 This model

Honest note: MoR scores higher PPL on the fresh test set than GPT-2 Medium. The 15.37 val PPL was measured on in-distribution data β€” the model shows signs of overfitting to the training distribution. The architecture is novel and the routing mechanism works correctly, but the model needs more diverse training data and longer training to generalize better.

The val PPL (15.37) should not be directly compared to GPT-2 Medium's standard benchmark PPL β€” they are measured on different datasets.

What the Results Actually Show

βœ… Architecture works   β€” training is stable, loss decreases
βœ… Router works         β€” 0 NaN batches, routing signals valid
βœ… Novel contribution   β€” perplexity-based self-supervised routing
⚠️  Generalization gap  β€” val PPL 15.37 vs test PPL 43.21
⚠️  More data needed    β€” 150K samples is small for this model size

πŸš€ Quick Start

pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Girinath11/recursive-language-model-198m",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Girinath11/recursive-language-model-198m",
    trust_remote_code=True
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = model.to(device).eval()

Chat Format (required)

The model was trained with a specific chat format β€” always use this:

def chat(question, max_new_tokens=150, temperature=0.7, top_p=0.9):
    prompt = f"<|user|>\n{question}\n<|assistant|>\n"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=False
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_new_tokens = max_new_tokens,
            temperature    = temperature,
            top_p          = top_p,
            do_sample      = True,
        )

    full = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return full.split("<|assistant|>")[-1].strip() \
           if "<|assistant|>" in full else full

print(chat("What is machine learning?"))
print(chat("Explain neural networks simply"))

πŸ“š Training Details

Dataset β€” 150K Conversational Samples

Dataset Samples % Description
Anthropic HH-RLHF 80,000 53% Helpful & harmless human feedback
UltraChat 50,000 33% GPT-4 multi-turn dialogues
Alpaca-GPT4 20,000 14% Instruction following
Validation 2,000 β€” Held-out (same distribution)

Training Config

GPU            : NVIDIA Tesla T4 (15.6 GB VRAM)
Platform       : Kaggle (single GPU)
Epochs         : 2
Total steps    : 150,000
Training time  : ~9h 12m

Batch size     : 2
Grad accum     : 32  (effective batch = 64)
Max seq len    : 512
Learning rate  : 1e-4
LR schedule    : Linear warmup + Cosine decay
Warmup steps   : 500
Optimizer      : AdamW (Ξ²=0.9, 0.95, Ξ΅=1e-8)
Weight decay   : 0.01
Grad clip      : 1.0
Mixed precision: FP16 (AMP)

Loss:
  Total = LM loss + 0.1 Γ— Router loss

πŸ’‘ Technical Innovation

Self-Supervised Perplexity Routing

# During training β€” no manual labels needed!
sample_ppl = exp(per_sample_ce_loss)

if sample_ppl < 20:   pseudo_label = 0  # simple
elif sample_ppl < 50: pseudo_label = 1  # medium
else:                 pseudo_label = 2  # complex

router_loss = CrossEntropyLoss(router_logits, pseudo_label)
total_loss  = lm_loss + 0.1 * router_loss

Why it works:

  • Router learns difficulty FROM the model's own performance
  • As training progresses, more samples become "simple" β€” natural curriculum
  • No annotation cost, no bias from human difficulty ratings

What I Learned / What to Improve Next

This model is my first from-scratch LLM β€” I learned a lot:

What worked:
  βœ… NaN-safe attention masking (-1e4)
  βœ… Self-supervised routing signal
  βœ… Stable FP16 training (0 NaN batches)
  βœ… Perplexity drops meaningfully across epochs

What needs improvement:
  ⚠️  More diverse training data (150K is small)
  ⚠️  Longer training (2 epochs not enough)
  ⚠️  Stronger regularization to prevent overfitting
  ⚠️  Better evaluation on neutral benchmarks from day 1

Next project: BharatMorph β€” applies lessons learned here to Indian languages, with PPL-gated dynamic recursion (checks quality after each step, not pre-decided) and morpheme-aware embeddings for agglutinative Indian language structure.


⚠️ Limitations

  • Context window: 512 tokens max
  • English only: Not suitable for other languages
  • Small training set: 150K samples β€” commercial models use 100B+
  • Generalization: Shows overfitting to training distribution
  • Repetition: May loop in very long generations (>200 tokens)
  • Factual accuracy: May hallucinate β€” do not use for facts

🎯 Intended Use

Use case Suitable?
Research on adaptive computation βœ… Yes
Learning how LLMs are built from scratch βœ… Yes
Prototyping conversational AI βœ… With caveats
Production chatbots ❌ No
Medical / legal / financial advice ❌ No
Factual question answering ❌ No

πŸ“„ Citation

@misc{girinath2026mor,
  author       = {Girinath V},
  title        = {Mixture of Recursion: Self-Supervised Perplexity-Guided
                  Adaptive Computation for Language Models},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Girinath11/recursive-language-model-198m}},
  note         = {198M parameter LM trained from scratch with adaptive
                  recursive computation via perplexity-based routing}
}

πŸ™ Acknowledgments

  • Anthropic β€” HH-RLHF dataset
  • Tsinghua University β€” UltraChat dataset
  • Vicgalle β€” Alpaca-GPT4 dataset
  • HuggingFace β€” Transformers library
  • Kaggle β€” Free GPU access

Model status : βœ… Research / Educational use
Last updated : March 2026
License : MIT

Downloads last month
668
Safetensors
Model size
0.2B params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train Girinath11/recursive-language-model-198m