Text Generation
Safetensors
English
custom-architecture
mixture-of-experts
mla
muon
code

CodeMind-1B-v0.1

GitHub

A research-focused Small Language Model (SLM) combining DeepSeek-style MLA + MoE, Kimi Attention Residuals, and the Muon optimizer into a single custom architecture trained from scratch on consumer-grade hardware.


Overview

CodeMind-1B is an experimental, coding-focused SLM designed to explore the next generation of efficient transformer architectures. Unlike standard LLaMA/Mistral derivatives, CodeMind is built around a fully custom stack developed from scratch.

This release is the Base Pre-training Checkpoint. It was trained to validate architecture stability, optimizer behavior, and sparse routing efficiency on a single NVIDIA A40 GPU.

Architecture Highlights

Multi-Head Latent Attention (MLA)

DeepSeek-style KV compression using latent-space attention reconstruction.

  • Benefits: Massive KV-cache reduction, inherently long-context friendly, lower VRAM usage, and superior inference scaling.

Mixture of Experts (MoE)

Fine-grained routed experts designed for sparse, efficient compute.

  • Spec: 4 routed experts (Top-2 routing) + 1 always-active shared expert.
  • Routing: Auxiliary-free load balancing (updates via backward pass instead of manual bias adjustments).

Kimi Attention Residuals (AttnRes)

Replacing standard additive residuals (x = x + f(x)), CodeMind uses attention-based residual aggregation inspired by Moonshot AI's Kimi architecture.

  • Benefits: Improved gradient flow, better deep-layer information retention, and higher compute efficiency via block-level pooling.

Multi-Token Prediction (MTP)

The model predicts multiple future tokens simultaneously during training.

  • Benefits: Better algorithmic planning behavior, stronger token representations, and the foundation for speculative decoding.

Hybrid Muon Optimizer

  • Weight Matrices (2D): Muon (Newton-Schulz orthogonalization for extremely fast convergence).
  • Embeddings / Norms (1D): AdamW (For stable positional tracking).

Training Details & Dataset

Specification Details
Parameters ~1B Total / ~400M Active (Per Token)
Hardware 1× NVIDIA A40 (48GB VRAM)
Training Time ~21 Hours
Tokens Seen 147 Million
Precision bfloat16
Optimizer Muon (lr=0.01) + AdamW (lr=3e-4)
Objective Next Token Prediction (NTP) + MTP + Z-Loss

Dataset Mix

Trained on a carefully curated subset of domains:

  • 60% Code: GitHub Python (Teaching coding syntax and structure)
  • 15% Math: OpenWebMath (Teaching logical reasoning)
  • 25% General: FineWeb-Edu (Teaching general language and knowledge)

Training Metrics

The model demonstrated textbook convergence across all custom objectives without divergence, proving the stability of the hybrid optimizer and custom architecture.

Training Loss & Tokens Seen Train Loss Tokens Seen

MTP & Load Balancing Loss MTP Loss Load Balancing Loss

Z-Loss & Muon Learning Rate Z-Loss Muon Learning Rate

  • Initial Loss: ~10.5 âž” Final Loss: ~3.1
  • MoE Load Balancing: Stable at ~0.35, proving experts are actively sharing the workload.
  • Throughput: Sustained ~1,940 Tok/s on a single A40.

Current Capabilities & Limitations

This Model can:

  • Generate valid Python syntax and complete basic functions.
  • Understand indentation, classes, and structure.
  • Follow standard code patterns.

This checkpoint cannot:

  • Perform robust logic (147M tokens is enough for syntax, not reasoning).
  • Answer conversational questions (It is not Instruction Tuned).
  • Avoid hallucinations on complex algorithms.

How to Load the Model

Because CodeMind uses a heavily customized architecture, it cannot be loaded via standard Hugging Face AutoModel classes. You must use the repository source code directly.

import torch
from safetensors.torch import load_file
from config.model_config import CodeMindConfig
from tokenizer.tokenizer import CodeMindTokenizer
from model.codemind import CodeMindSLM

# 1. Initialize config and tokenizer
config = CodeMindConfig()
tokenizer = CodeMindTokenizer()

# 2. Build the architecture
model = CodeMindSLM(config).to("cuda").to(torch.bfloat16)

# 3. Load weights
state_dict = load_file("model.safetensors")

# 4. Remove torch.compile prefixes if present
cleaned_dict = {
    k.replace("_orig_mod.", ""): v
    for k, v in state_dict.items()
}

model.load_state_dict(cleaned_dict, strict=False)
print("✅ CodeMind loaded successfully!")

Acknowledgements & References

This architecture was built entirely from scratch, drawing heavy inspiration from the following groundbreaking papers:

  1. DeepSeek-V4 Technical Report
  2. DeepSeek-V3 Technical Report
  3. Attention Residuals (Moonshot AI / Kimi)
  4. Muon Optimizer Scalability
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train B4K2xx/CodeMind-1B-v0.1

Papers for B4K2xx/CodeMind-1B-v0.1