MiniMax-M2.7-JANGTQ_K

MiniMax M2.7 — 74 GB on disk (down from ~230 GB FP8 source) — mixed-bit JANGTQ_K quantization in JANGTQ-PRESTACK layout.

Source: MiniMaxAI/MiniMax-M2.7 (62 layers, 256 routed experts top-8, 196K context)
Quantization: mixed-bit MXTQ on routed experts:
- down_proj: 4-bit (output enters residual stream, more sensitive)
- gate_proj: 2-bit (gated activation, less sensitive)
- up_proj: 2-bit (gated activation)
- attention / shared expert / embed / lm_head: 8-bit affine
- norms / router gate / expert_bias: fp16 / fp32 passthrough
Routed-expert layout: pre-stacked along axis 0 per the JANGTQ-PRESTACK STANDARD — instant cold load, no runtime sidecar.
Bundle size: ~74 GB on-disk (~3-bit avg routed)
Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

Benchmarks

Metric	Value	Setup
MMLU-200	93.5% (187/200)	thinking ON, `q_per_subject=20`, 10 subjects
Median speed	~37 tok/s	M4 Max 128 GB, MLX 0.31
GPU memory at load	~75 GB	warm

MMLU eval used the standard mmlu_jangtq_resume.py runner with the model's default chat template (enable_thinking undefined → thinking ON, which the M2.7 template auto-opens with <think>\n after the assistant prefix).

Why mixed-bit?

down_proj's output enters the residual stream and accumulates across 62 layers — quantization noise compounds. gate_proj and up_proj enter through SwiGLU's multiplicative gate (silu(gate) × up) which dampens noise. Spending 4 bits on down and 2 bits on gate/up gives quality close to full-4-bit (~115 GB) at 64% the size.

Variants in the MiniMax-M2.7 line

Variant	Routed bits (avg)	Size	MMLU-200	Use case
`MiniMax-M2.7-JANGTQ`	2-bit	56 GB	91.5%	smallest, best for tight RAM
`MiniMax-M2.7-JANGTQ_K` (this)	~3-bit (mixed 2/4)	74 GB	93.5%	+2.0pp MMLU vs JANGTQ for +18 GB

Loading

pip install jang-tools mlx-lm

from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/MiniMax-M2.7-JANGTQ_K")

Reasoning + tools

Default: thinking ON (chat template inserts <think>\n after assistant prefix)

Disable reasoning:

messages = [{"role": "user", "content": "..."}]
inp = tokenizer.apply_chat_template(messages, add_generation_prompt=True, enable_thinking=False)

Reasoning parser: qwen3 (extracts <think>...</think> blocks)
Tool parser: minimax

The chat template ships with the enable_thinking switch correctly wired both as a standalone chat_template.jinja AND inlined into tokenizer_config.json["chat_template"] for engines that read inline (vMLX, Swift swift-transformers).

Credits

Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
Base model: MiniMaxAI — M2.7 architecture

Downloads last month: 903

Safetensors

Model size

20B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ_K

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(103)

this model