| --- |
| base_model: t5-small |
| license: apache-2.0 |
| datasets: |
| - open-web-math/open-web-math |
| tags: |
| - text-generation |
| - causal-lm |
| - mamba |
| - hrm |
| - pytorch |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # CMBA-768M-OpenWebMath |
|
|
| A 768M parameter Hierarchical Recurrent Memory (HRM) language model trained on high-quality math web text from OpenWebMath. This model uses **Mamba2 state-space models** instead of traditional attention mechanisms, enabling efficient long-range sequence modeling. |
|
|
| ## Model Architecture |
|
|
| **CMBA** (Causal Mamba-based Architecture) implements a hierarchical processing structure: |
|
|
| - **Hierarchical Design**: Dual-level processing with H-layers (high-level abstraction) and L-layers (low-level specialists) |
| - **Mamba2 Mixers**: State-space models replace attention for O(n) complexity vs O(n²) |
| - **Adaptive Computation**: Halting mechanism allows variable compute per token (ACT-style pondering) |
| - **Parameters**: ~768M total |
| - **Context Length**: 1024 tokens |
|
|
| ### Configuration |
| ```python |
| Model Dimensions: |
| - d_model: 1024 |
| - n_heads: 16 (for compatibility, not used in Mamba) |
| - d_ff: 4096 |
| - H_layers: 12 (high-level hierarchy) |
| - L_layers: 12 (low-level processing) |
| |
| Mamba2 Settings: |
| - d_state: 128 |
| - expand: 2 |
| - headdim: 64 |
| - d_conv: 4 |
| - ngroups: 1 |
| |
| Training: |
| - Max halt steps: 1 |
| - Block size: 1024 |
| - Batch size: 64 (effective) |
| - Learning rate: 3e-05 → 1e-06 |
| - Weight decay: 0.1 |
| ``` |
|
|
| ## Training Data |
|
|
| - **Dataset**: [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |
| - **Tokenizer**: `t5-small` (T5 SentencePiece) |
| - **Vocab Size**: 32100 |
|
|
| ## Latest Performance (Epoch 0) |
|
|
| - **Validation Loss**: `10.3766` |
| - **Validation Perplexity**: `32099.98` |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import T5Tokenizer |
| from hrm_text1_mamba1_donor import HRMText1 |
| |
| tokenizer = T5Tokenizer.from_pretrained("t5-small") |
| model = HRMText1.from_pretrained("Viharikvs/CMBA-768M-OpenWebMath") |
| |
| # Generate text |
| input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids |
| outputs = model.generate(input_ids, max_length=100) |
| print(tokenizer.decode(outputs[0])) |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{cmba-768m-openwebmath, |
| author = {Vihari}, |
| title = {CMBA-768M-OpenWebMath: Hierarchical Mamba-based Language Model}, |
| year = {2025}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/Viharikvs/CMBA-768M-OpenWebMath} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|