
MiniMax-M2.7-JANGTQ_K
MiniMax M2.7 — 74 GB on disk (down from ~230 GB FP8 source) — mixed-bit JANGTQ_K quantization in JANGTQ-PRESTACK layout.
- Source: MiniMaxAI/MiniMax-M2.7 (62 layers, 256 routed experts top-8, 196K context)
- Quantization: mixed-bit MXTQ on routed experts:
down_proj: 4-bit (output enters residual stream, more sensitive)gate_proj: 2-bit (gated activation, less sensitive)up_proj: 2-bit (gated activation)- attention / shared expert / embed / lm_head: 8-bit affine
- norms / router gate / expert_bias: fp16 / fp32 passthrough
- Routed-expert layout: pre-stacked along axis 0 per the JANGTQ-PRESTACK STANDARD — instant cold load, no runtime sidecar.
- Bundle size: ~74 GB on-disk (~3-bit avg routed)
- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
Benchmarks
| Metric | Value | Setup |
|---|---|---|
| MMLU-200 | 93.5% (187/200) | thinking ON, q_per_subject=20, 10 subjects |
| Median speed | ~37 tok/s | M4 Max 128 GB, MLX 0.31 |
| GPU memory at load | ~75 GB | warm |
MMLU eval used the standard mmlu_jangtq_resume.py runner with the model's
default chat template (enable_thinking undefined → thinking ON, which the
M2.7 template auto-opens with <think>\n after the assistant prefix).
Why mixed-bit?
down_proj's output enters the residual stream and accumulates across
62 layers — quantization noise compounds. gate_proj and up_proj
enter through SwiGLU's multiplicative gate (silu(gate) × up) which
dampens noise. Spending 4 bits on down and 2 bits on gate/up gives
quality close to full-4-bit (~115 GB) at 64% the size.
Variants in the MiniMax-M2.7 line
| Variant | Routed bits (avg) | Size | MMLU-200 | Use case |
|---|---|---|---|---|
MiniMax-M2.7-JANGTQ |
2-bit | 56 GB | 91.5% | smallest, best for tight RAM |
MiniMax-M2.7-JANGTQ_K (this) |
~3-bit (mixed 2/4) | 74 GB | 93.5% | +2.0pp MMLU vs JANGTQ for +18 GB |
Loading
pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/MiniMax-M2.7-JANGTQ_K")
Reasoning + tools
- Default: thinking ON (chat template inserts
<think>\nafter assistant prefix) - Disable reasoning:
messages = [{"role": "user", "content": "..."}] inp = tokenizer.apply_chat_template(messages, add_generation_prompt=True, enable_thinking=False) - Reasoning parser:
qwen3(extracts<think>...</think>blocks) - Tool parser:
minimax
The chat template ships with the enable_thinking switch correctly wired
both as a standalone chat_template.jinja AND inlined into
tokenizer_config.json["chat_template"] for engines that read inline
(vMLX, Swift swift-transformers).
Credits
- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
- Base model: MiniMaxAI — M2.7 architecture
- Downloads last month
- 903
Quantized
Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ_K
Base model
MiniMaxAI/MiniMax-M2.7