Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -8,165 +8,57 @@ tags:
|
|
| 8 |
- causal-lm
|
| 9 |
- rabbit
|
| 10 |
- rtaforge
|
| 11 |
-
- proof-of-concept
|
| 12 |
base_model: RtaForge/Anvaya-Rabbit-2.7B
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# Anvaya-Rabbit 2.7B
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
and the Gurukul constitutional training protocol. It serves as a technical
|
| 20 |
-
proof-of-concept that capable alternative-architecture models can be developed under
|
| 21 |
-
severe compute constraints. This is the first model in the Anvaya series:
|
| 22 |
-
**Rabbit → Raccoon → Polar Bear**.
|
| 23 |
-
|
| 24 |
-
## Overview
|
| 25 |
-
|
| 26 |
-
Rabbit demonstrates three proprietary components developed by RtaForge:
|
| 27 |
-
|
| 28 |
-
- **Ṛta-SSM** — a custom recurrent state-space architecture with no attention
|
| 29 |
-
or transformer blocks
|
| 30 |
-
- **Gurukul** — a proposal-validation training loop in which a Sisya proposes
|
| 31 |
-
weight deltas and a Guru validates them against constitutional constraints before
|
| 32 |
-
applying
|
| 33 |
-
- **Subsuminator** — cross-architecture weight migration without full retraining,
|
| 34 |
-
enabling efficient curriculum transfer
|
| 35 |
-
|
| 36 |
-
Trained across a phased curriculum on a single consumer GPU, Rabbit shows
|
| 37 |
-
substantial gains over random initialisation on internal scale-invariant metrics.
|
| 38 |
-
It is a deliberate architecture proof at seq_len=64 — not a production model.
|
| 39 |
-
|
| 40 |
-
For strategic context, IndiaAI alignment, and full programme roadmap, see the
|
| 41 |
-
[Anvaya Executive Briefing](https://huggingface.co/RtaForge/Anvaya-Rabbit-2.7B/resolve/main/docs/Anvaya-Executive-Briefing-May2026.pdf).
|
| 42 |
|
| 43 |
## Architecture
|
| 44 |
|
| 45 |
- **Type**: Ṛta-SSM v7.2.2, Fortress Unbroken — recurrent SSM, no attention
|
| 46 |
-
- **Parameters**: ~2.
|
| 47 |
- **Layers**: 64
|
| 48 |
- **d_model / d_state**: 2560
|
| 49 |
- **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
|
| 50 |
- **Precision**: bfloat16
|
| 51 |
-
- **Training seq_len**: 64
|
| 52 |
|
| 53 |
## Weights
|
| 54 |
|
| 55 |
-
This repository contains the base pretrained checkpoint
|
| 56 |
-
(`
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
Load the imprint weights (base + SFT overlay, recommended for inference):
|
| 60 |
|
| 61 |
```python
|
| 62 |
from white_rabbit.rabbit_model import create_rabbit_model
|
| 63 |
from transformers import AutoTokenizer
|
| 64 |
import torch
|
| 65 |
|
| 66 |
-
model = create_rabbit_model(
|
| 67 |
-
|
| 68 |
-
durga_variant="fu-64", # 64-layer Fortress Unbroken backbone
|
| 69 |
-
)
|
| 70 |
-
sd = torch.load("imprint/Anvaya-Rabbit-2.7B-0.1-alpha-imprint.pt", map_location="cpu")
|
| 71 |
model.load_state_dict(sd, strict=False)
|
| 72 |
model.eval()
|
| 73 |
|
| 74 |
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
| 75 |
```
|
| 76 |
|
| 77 |
-
|
| 78 |
-
> guha@rtaforge.in for access). This model uses a custom SSM architecture
|
| 79 |
-
> not compatible with standard HuggingFace `AutoModel`.
|
| 80 |
-
|
| 81 |
-
**Training infrastructure**: [`Rta-Forge/polaris-revival`](https://github.com/Rta-Forge/polaris-revival) —
|
| 82 |
-
patched ROCm 7.2 runtime restoring native HIP dispatch on gfx803 (RX 560X), with
|
| 83 |
-
fused SSM recurrence kernels. MIT licensed.
|
| 84 |
-
|
| 85 |
-
## Training Protocol
|
| 86 |
-
|
| 87 |
-
Two proprietary components make this training regime possible:
|
| 88 |
-
|
| 89 |
-
**Gurukul** is a constitutional Sisya/Guru proposal-validation loop:
|
| 90 |
-
- The Sisya proposes weight deltas based on the current curriculum phase
|
| 91 |
-
- The Guru validates each proposal against a set of constitutional constraints
|
| 92 |
-
- Accepted proposals update the model; rejected proposals are logged for signal
|
| 93 |
-
- Feedback from each cycle informs the next round of proposals
|
| 94 |
-
|
| 95 |
-
**Subsuminator** enables efficient migration of learned weights across architectures,
|
| 96 |
-
supporting curriculum transfer without retraining from scratch.
|
| 97 |
-
|
| 98 |
-
Together these components allowed 1,500 accepted proposals across 6 phases to be
|
| 99 |
-
processed in ~7 effective days on a single 24GB GPU.
|
| 100 |
-
|
| 101 |
-
**1,500 accepted Gurukul proposals across 6 phases on a single AceCloud L4 (24GB VRAM).
|
| 102 |
-
~7 days effective training time (total elapsed higher due to crash recovery and VRAM
|
| 103 |
-
leak debugging).**
|
| 104 |
-
|
| 105 |
-
| Phase | Proposals | Dataset | Focus |
|
| 106 |
-
|-------|-----------|---------|-------|
|
| 107 |
-
| 0 | 125 | CAMEL Physics | Physical reasoning |
|
| 108 |
-
| 1 | 125 | CAMEL Chemistry | Chemical reasoning |
|
| 109 |
-
| 2 | 125 | CAMEL Biology | Biological reasoning |
|
| 110 |
-
| 3 | 250 | Raccoon Phase 1 | General reasoning |
|
| 111 |
-
| 4 | 500 | Rabbit E2 Phase 4 | Extended curriculum |
|
| 112 |
-
| 5 | 375 | Raccoon Phase 3 (consolidation re-run) | Pattern consolidation |
|
| 113 |
-
|
| 114 |
-
**Final checkpoint: Step 1,500.** seq_len=64, batch_size=3, optimizer=Lion, lr=1e-5.
|
| 115 |
-
|
| 116 |
-
SFT imprint applied using surface-only gate-layer fine-tuning (65 examples, 3 epochs).
|
| 117 |
-
|
| 118 |
-
## Evaluation
|
| 119 |
-
|
| 120 |
-
### Internal — Scale-Invariant Metrics
|
| 121 |
-
|
| 122 |
-
Evaluated using Top-K accuracy and Mean Reciprocal Rank vs. a randomly initialised
|
| 123 |
-
baseline of identical architecture. 50 samples per corpus, seq_len=64.
|
| 124 |
-
|
| 125 |
-
| Metric | Random Init | Trained (Step 1,500) | Gain |
|
| 126 |
-
|--------|-------------|----------------------|------|
|
| 127 |
-
| Top-1 Accuracy (aggregate) | 0.24% | **1.90%** | **~8×** |
|
| 128 |
-
| Top-10 Accuracy (aggregate) | 0.24% | **35.84%** | **~149×** |
|
| 129 |
-
| MRR (aggregate) | 0.0026 | **0.1724** | **~66×** |
|
| 130 |
-
| MRR — Deep Math | 0.0084 | **0.186** | **22×** |
|
| 131 |
-
| Top-10 — Biology | ~1.3% | **~12%** | **~10×** |
|
| 132 |
-
| Top-10 — Chemistry | ~1.3% | **~13%** | **~10×** |
|
| 133 |
-
|
| 134 |
-
These gains are measured against a randomly initialised model of identical
|
| 135 |
-
architecture — they reflect what the training curriculum taught, not absolute
|
| 136 |
-
capability.
|
| 137 |
-
|
| 138 |
-
### Commercial Benchmarks (lm-eval harness)
|
| 139 |
-
|
| 140 |
-
> **Standard academic benchmarks are not yet meaningful here.** Rabbit was
|
| 141 |
-
> deliberately trained at seq_len=64 as a pure architecture proof. Standard
|
| 142 |
-
> lm-eval prompts run 150–400 tokens — well beyond Rabbit's training context.
|
| 143 |
-
> Raccoon (seq_len=512) removes this constraint entirely.
|
| 144 |
-
|
| 145 |
-
| Benchmark | Score | Notes |
|
| 146 |
-
|-----------|-------|-------|
|
| 147 |
-
| HellaSwag | 25.89% | Prompt exceeds training seq_len |
|
| 148 |
-
| ARC-Challenge | 26.71% | Prompt exceeds training seq_len |
|
| 149 |
-
| MMLU | 26.89% | Prompt exceeds training seq_len |
|
| 150 |
-
| WinoGrande | 48.62% | Prompt exceeds training seq_len |
|
| 151 |
-
| TruthfulQA MC1 | 21.91% | Prompt exceeds training seq_len |
|
| 152 |
-
|
| 153 |
-
## Roadmap
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|-------|--------|---------|--------|
|
| 157 |
-
| **Rabbit** | ~2.7B | 64 | ✅ This model — v0.1 Alpha |
|
| 158 |
-
| **Raccoon** | ~6.1B | 512 | In training — reasoning curriculum (math ×2, logic ×2) |
|
| 159 |
-
| **Polar Bear** | ~13B | 512 | Planned — STEM + AEVA anti-hallucination layer |
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
-
**Give us more resources and watch what happens.**
|
| 167 |
|
| 168 |
-
##
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
|
|
|
| 8 |
- causal-lm
|
| 9 |
- rabbit
|
| 10 |
- rtaforge
|
|
|
|
| 11 |
base_model: RtaForge/Anvaya-Rabbit-2.7B
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Anvaya-Rabbit 2.7B
|
| 15 |
|
| 16 |
+
A 2.7B parameter State-Space Model (SSM) trained by RtaForge using the Gurukul
|
| 17 |
+
constitutional training protocol.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## Architecture
|
| 20 |
|
| 21 |
- **Type**: Ṛta-SSM v7.2.2, Fortress Unbroken — recurrent SSM, no attention
|
| 22 |
+
- **Parameters**: ~2.78B
|
| 23 |
- **Layers**: 64
|
| 24 |
- **d_model / d_state**: 2560
|
| 25 |
- **Vocabulary**: 50,280 (GPT-NeoX tokenizer)
|
| 26 |
- **Precision**: bfloat16
|
|
|
|
| 27 |
|
| 28 |
## Weights
|
| 29 |
|
| 30 |
+
This repository contains the base pretrained checkpoint (`base/Anvaya-Rabbit-2.7B-0.1-alpha-base.pt`)
|
| 31 |
+
and the SFT imprint checkpoint (`imprint/Anvaya-Rabbit-2.7B-0.1-alpha-imprint.pt`).
|
| 32 |
+
Load the base weights directly:
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```python
|
| 35 |
from white_rabbit.rabbit_model import create_rabbit_model
|
| 36 |
from transformers import AutoTokenizer
|
| 37 |
import torch
|
| 38 |
|
| 39 |
+
model = create_rabbit_model(vocab_size=50280, durga_variant="fu-64")
|
| 40 |
+
sd = torch.load("base/Anvaya-Rabbit-2.7B-0.1-alpha-base.pt", map_location="cpu")
|
|
|
|
|
|
|
|
|
|
| 41 |
model.load_state_dict(sd, strict=False)
|
| 42 |
model.eval()
|
| 43 |
|
| 44 |
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
| 45 |
```
|
| 46 |
|
| 47 |
+
## Benchmarks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
*Benchmarks pending — will be updated after evaluation run completes.*
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
| Task | Metric | Score |
|
| 52 |
+
|------|--------|-------|
|
| 53 |
+
| HellaSwag | acc_norm | — |
|
| 54 |
+
| ARC-Challenge | acc_norm | — |
|
| 55 |
+
| MMLU | acc | — |
|
| 56 |
+
| WinoGrande | acc | — |
|
| 57 |
+
| TruthfulQA MC1 | mc1 | — |
|
| 58 |
|
|
|
|
| 59 |
|
| 60 |
+
## Training
|
| 61 |
|
| 62 |
+
Trained with the Anvaya Gurukul protocol: a constitutional Sisya/Guru loop
|
| 63 |
+
where Sisya proposes weight deltas and Guru applies them after validation.
|
| 64 |
+
SFT imprint applied using surface-only gate-layer fine-tuning.
|