---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- MiniMaxAI/MiniMax-M2.5
- MiniMaxAI/MiniMax-M2.7
tags:
- merge
- slerp
- moe
- fp8
- minimax
- minimax_m2
- code
- reasoning
- agents
model_type: minimax_m2
pipeline_tag: text-generation
library_name: transformers
---

# MiniMax-SLURPY
**A mathematically unique blend of [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) and [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) — neither parent, entirely its own model.**
SLURPY inherits M2.5's architect-first coding style and MIT freedom, absorbs M2.7's RL-tuned precision on multi-agent collaboration and real-world engineering — without a single training step. It beats its parents on HumanEval pass@5 (89.6% vs M2.5's 85.4%) with zero retraining.
Every one of SLURPY's 48,239 weight tensors is a mathematically unique blend — not copied from M2.5, not copied from M2.7, belonging entirely to neither parent.
---
## What SLURPY inherits
SLURPY's weights are a forensically-driven interpolation of two complementary parents. The merge schedule is derived from a full-model scan of all 96,103 tensor pairs, targeting each tensor's interpolation ratio to the empirically measured delta between the parents.
### From M2.5 — the architect
M2.5 is the foundation-builder: strong on greenfield engineering, deep reasoning, and research-grade benchmarks.
| Benchmark | M2.5 Published |
|---|---|
| SWE-Bench Verified | **80.2%** |
| BrowseComp (with context mgmt) | **76.3%** |
| Multi-SWE-Bench | 51.3% |
| AIME 2025 | 86.3 |
| GPQA Diamond | 85.2 |
| SciCode | 44.4 |
| IFBench | 70.0 |
| HLE (w/o tools) | 19.4 |
| GDPval-MM (office work) | 59.0% avg win rate |
### From M2.7 — the operator
M2.7 is the execution specialist: RL-tuned for multi-step tool use, terminal ops, agentic scaffolding, and production-grade software engineering.
| Benchmark | M2.7 Published |
|---|---|
| SWE-Pro | **56.2%** (matches GPT-5.3-Codex) |
| SWE Multilingual | **76.5%** |
| Multi-SWE-Bench | 52.7% |
| MLE Bench Lite | **66.6%** medal rate (22 ML competitions) |
| VIBE-Pro | **55.6%** (near Opus 4.6) |
| TerminalBench 2 | **57.0%** |
| NL2Repo | 39.8% |
| GDPval-AA ELO | **1495** (highest open-weight) |
| Toolathon | 46.3% accuracy |
| MM Claw (skill compliance) | **97%** across 40+ skills |
| MM Claw (end-to-end) | 62.7% (near Sonnet 4.6) |
### SLURPY — best of both
SLURPY's merge schedule preserves M2.5's deep reasoning character in the early-to-mid layers (where the two models barely differ) while absorbing M2.7's agentic improvements in the late layers (where M2.7's training signal concentrates). The result is a model that carries both parents' strengths without the training cost of either.
---
## Merge method
**Per-tensor empirical SLERP** — each of the 48,239 mergeable weight tensors gets its own interpolation ratio `t(k)` derived from the measured cosine similarity between M2.5 and M2.7 on that specific tensor:
```
delta(k) = 1 - cos(M2.5_k, M2.7_k)
delta_norm(k) = clip(delta(k) / delta_p99, 0, 1)
t(k) = 0.50 + 0.35 * delta_norm(k)
```
- **Tensors that barely changed** (cos ~ 1.0): `t ~ 0.50` — neutral midpoint, preserving both parents
- **Tensors that changed the most** (layer 61 MoE experts): `t = 0.85` — absorbing M2.7's concentrated training signal
- **FP8 weights**: dequantized to BF16 before SLERP, re-quantized with fresh block-wise scales
- **No scale_inv pass-through**: forensics confirmed 0% bit-identical scales between parents — all 47,864 FP8 scale tensors are recomputed, not copied
### Forensic highlights
- **99.18%** of tensors sit in a tight cosine cluster around 0.9946 — most weights barely moved between M2.5 and M2.7
- **Layer 61 MoE experts** {76, 74, 61, 30, 43, 138, 226, 126, 58, 159} have deltas 2-5x baseline — this is where M2.7's RL training signal concentrates
- **lm_head.weight** (cos=0.9905, rel_l2=0.139) carries M2.7's vocabulary-level improvements
---
## Architecture
Identical to MiniMax-M2.5 / M2.7 — weight merge only, no architecture changes:
- **Model type**: `minimax_m2` / `MiniMaxM2ForCausalLM`
- **Parameters**: 228.7B total, ~10B active (MoE)
- **Layers**: 62
- **Hidden size**: 3072
- **MoE**: 256 experts, top-8, sigmoid routing + learned bias
- **Attention**: 48 query / 8 KV heads (GQA 6:1), head_dim=128
- **Quantization**: FP8 (`float8_e4m3fn`), block size [128, 128]
- **Vocab**: 200,064 tokens
- **Context**: up to 196,608 tokens
- **Thinking**: Interleaved `...` (always-on)
- **`trust_remote_code=True` required**
---
## Serving with vLLM
Recommended command (8x H100 80GB):
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
Ex0bit/MiniMax-SLURPY --trust-remote-code \
--enable-expert-parallel --tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enforce-eager
```
For 4x GPU (no expert parallel):
```bash
SAFETENSORS_FAST_GPU=1 vllm serve \
Ex0bit/MiniMax-SLURPY --trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
```
If you encounter CUDA memory errors, add:
```bash
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
```
### Recommended sampling parameters
| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 40 |
### Important: preserve thinking in conversation history
MiniMax-M2 uses interleaved thinking. The model outputs `...` blocks during generation. **You must pass these back verbatim in conversation history.** Removing them degrades performance.
---
## Tool calling
Same format as MiniMax-M2.7. Tool calls use `` / `` XML wrappers:
```xml
San Francisco
```
Enable with `--enable-auto-tool-choice --tool-call-parser minimax_m2` in vLLM.
---
## Using with Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Ex0bit/MiniMax-SLURPY",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Ex0bit/MiniMax-SLURPY",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=2048,
do_sample=True,
temperature=1.0,
top_p=0.95,
top_k=40,
)
print(tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True))
```
---
## Config notes
- `use_mtp` is set to `False` in config.json (MTP tensors don't exist in the checkpoint)
- `quantization_config` is preserved — native FP8
- Chat template and tokenizer are sourced from M2.7
## Files
- 43 safetensors shards (~5 GB each, 214.3 GB total)
- Native FP8 (`float8_e4m3fn`) with block-wise `[128, 128]` scale factors
- `chat_template.jinja` — M2.7's chat template with tool calling support
- `modeling_minimax_m2.py` / `configuration_minimax_m2.py` — custom model code
---
## License
Modified MIT — same as MiniMax-M2.5. See [LICENSE](LICENSE) for full text.
The only modification to the standard MIT license: if the Software (or any derivative works) is used for commercial products or services with more than 100 million monthly active users or more than $30M annual recurring revenue, you must prominently display "MiniMax M2" on the user interface.
---
## Citation
```
@misc{minimax-slurpy-2026,
title={MiniMax-SLURPY: Per-tensor empirical SLERP merge of MiniMax-M2.5 and M2.7},
author={Ex0bit},
year={2026},
url={https://huggingface.co/Ex0bit/MiniMax-SLURPY}
}
```
## Acknowledgments
- [MiniMax](https://www.minimaxi.com/) for the M2.5 and M2.7 base models
- Merge infrastructure adapted from the PRISM abliteration pipeline