Aetheris β Hybrid Mamba-MoE Multilingual Model
Aetheris is a ~536M parameter hybrid SSM-MoE language model distilled from CohereLabs/tiny-aya-global (3.35B). It supports 67 languages with 6.3x compression.
Architecture
- Type: Hybrid Mamba-MoE (interleaved SSM + Sparse MoE layers)
- Layers: 24 (12 SSM + 12 MoE)
- Hidden dim: 1024
- Experts: 4 (top-1 routing)
- Vocab: 80,000 tokens (pruned from 261K Aya tokenizer)
- Parameters: 536M (pruned from 722M via vocabulary pruning)
Compression
| Stage | Technique | Before | After | Savings |
|---|---|---|---|---|
| 1 | Knowledge Distillation | 3,350M | 722M | 4.6x |
| 2 | Vocabulary Pruning | 722M | 536M | 25.7% |
| Total | 3,350M | 536M | 6.3x |
Vocabulary Pruning Details
- Original vocab: 255,000 tokens β Pruned: 80,000 tokens
- Dead tokens removed: 131,231 (never used by any of 67 target languages)
- Per-language coverage preserved via frequency-based keep-list union
- Mean fertility increase: <5% across all languages
- Weight tying preserved (embedding = lm_head)
Training
- Stage 1: CKA-guided layer alignment (10K steps)
- Stage 2: KL divergence distillation, T=2.0, alpha=0.7 (20K steps, best loss=2.73)
- Stage 3: SFT fine-tuning (pending)
- Teacher: CohereLabs/tiny-aya-global (3.35B)
- Data: ClimbMix (NVIDIA)
Usage
import torch, yaml, sys
sys.path.insert(0, ".")
from aetheris.config import AetherisConfig
from aetheris.model import HybridMambaMoE
config = AetherisConfig.from_yaml("config.yaml")
model = HybridMambaMoE(config)
sd = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(sd)
model.eval()
Note: This model uses a pruned vocabulary. Use the vocab_mapping.json file
to map between original Aya tokenizer IDs and pruned model IDs.
Wayy Research
People for research, research for people. Buffalo, NY β Est. 2024
- Downloads last month
- 8