CodeMind-1B-v0.1
A research-focused Small Language Model (SLM) combining DeepSeek-style MLA + MoE, Kimi Attention Residuals, and the Muon optimizer into a single custom architecture trained from scratch on consumer-grade hardware.
Overview
CodeMind-1B is an experimental, coding-focused SLM designed to explore the next generation of efficient transformer architectures. Unlike standard LLaMA/Mistral derivatives, CodeMind is built around a fully custom stack developed from scratch.
This release is the Base Pre-training Checkpoint. It was trained to validate architecture stability, optimizer behavior, and sparse routing efficiency on a single NVIDIA A40 GPU.
Architecture Highlights
Multi-Head Latent Attention (MLA)
DeepSeek-style KV compression using latent-space attention reconstruction.
- Benefits: Massive KV-cache reduction, inherently long-context friendly, lower VRAM usage, and superior inference scaling.
Mixture of Experts (MoE)
Fine-grained routed experts designed for sparse, efficient compute.
- Spec: 4 routed experts (Top-2 routing) + 1 always-active shared expert.
- Routing: Auxiliary-free load balancing (updates via backward pass instead of manual bias adjustments).
Kimi Attention Residuals (AttnRes)
Replacing standard additive residuals (x = x + f(x)), CodeMind uses attention-based residual aggregation inspired by Moonshot AI's Kimi architecture.
- Benefits: Improved gradient flow, better deep-layer information retention, and higher compute efficiency via block-level pooling.
Multi-Token Prediction (MTP)
The model predicts multiple future tokens simultaneously during training.
- Benefits: Better algorithmic planning behavior, stronger token representations, and the foundation for speculative decoding.
Hybrid Muon Optimizer
- Weight Matrices (2D):
Muon(Newton-Schulz orthogonalization for extremely fast convergence). - Embeddings / Norms (1D):
AdamW(For stable positional tracking).
Training Details & Dataset
| Specification | Details |
|---|---|
| Parameters | ~1B Total / ~400M Active (Per Token) |
| Hardware | 1× NVIDIA A40 (48GB VRAM) |
| Training Time | ~21 Hours |
| Tokens Seen | 147 Million |
| Precision | bfloat16 |
| Optimizer | Muon (lr=0.01) + AdamW (lr=3e-4) |
| Objective | Next Token Prediction (NTP) + MTP + Z-Loss |
Dataset Mix
Trained on a carefully curated subset of domains:
- 60% Code: GitHub Python (Teaching coding syntax and structure)
- 15% Math: OpenWebMath (Teaching logical reasoning)
- 25% General: FineWeb-Edu (Teaching general language and knowledge)
Training Metrics
The model demonstrated textbook convergence across all custom objectives without divergence, proving the stability of the hybrid optimizer and custom architecture.
- Initial Loss: ~10.5 âž” Final Loss: ~3.1
- MoE Load Balancing: Stable at ~0.35, proving experts are actively sharing the workload.
- Throughput: Sustained ~1,940 Tok/s on a single A40.
Current Capabilities & Limitations
This Model can:
- Generate valid Python syntax and complete basic functions.
- Understand indentation, classes, and structure.
- Follow standard code patterns.
This checkpoint cannot:
- Perform robust logic (147M tokens is enough for syntax, not reasoning).
- Answer conversational questions (It is not Instruction Tuned).
- Avoid hallucinations on complex algorithms.
How to Load the Model
Because CodeMind uses a heavily customized architecture, it cannot be loaded via standard Hugging Face AutoModel classes. You must use the repository source code directly.
import torch
from safetensors.torch import load_file
from config.model_config import CodeMindConfig
from tokenizer.tokenizer import CodeMindTokenizer
from model.codemind import CodeMindSLM
# 1. Initialize config and tokenizer
config = CodeMindConfig()
tokenizer = CodeMindTokenizer()
# 2. Build the architecture
model = CodeMindSLM(config).to("cuda").to(torch.bfloat16)
# 3. Load weights
state_dict = load_file("model.safetensors")
# 4. Remove torch.compile prefixes if present
cleaned_dict = {
k.replace("_orig_mod.", ""): v
for k, v in state_dict.items()
}
model.load_state_dict(cleaned_dict, strict=False)
print("✅ CodeMind loaded successfully!")
Acknowledgements & References
This architecture was built entirely from scratch, drawing heavy inspiration from the following groundbreaking papers:





