YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MiniMax-M2.7-L3H5-DFlash
DFlash speculative-decoding drafter for cyankiwi/MiniMax-M2.7-AWQ-4bit.
โ ๏ธ Highly experimental. Trained on a small (~2200-shard) on-policy corpus. Eval m_accept โ 1.38 โ useful for spec-decode infrastructure validation, below the break-even point on Strix Halo TP=4 (which needs roughly m_accept โ 3 to match no-spec throughput). Inference will currently be slower than no-spec on that hardware.
Architecture
- 3 drafter layers,
hidden_size=3072, 0.38B params (drafter-only; embed + lm_head loaded from target at inference) - target taps: layers
[2, 16, 30, 43, 57]of MiniMax-M2.7's 62-layer target block_size=16- all
full_attention(target uses no SWA) num_attention_heads=24,num_key_value_heads=8(GQA),head_dim=128vocab_size=200064,mask_token_id=200063
Eval
Greedy-verification proxy on a 246-shard held-out set (1400 blocks), drawn from the same on-policy corpus mix as training (agent-sessions + nemotron + codealpaca).
| step | m_accept | k=1 | k=2 | k=3 | k=4 | val_loss |
|---|---|---|---|---|---|---|
| 26000 | 1.38 | 63.8% | 37.3% | 19.1% | 8.7% | 4.95 |
m_accept = mean leading run of greedy top-1 hits per block (max possible 15).
k=N cumulative = % of blocks where positions 1..N all hit top-1.
Use with vLLM
vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
--tensor-parallel-size 4 \
--speculative-config '{"method":"dflash","model":"MirecX/MiniMax-M2.7-L3H5-DFlash","num_speculative_tokens":4}'
num_speculative_tokens=4 is a reasonable choice for this drafter: m_accept of
1.38 means ideal speculative depth is โ 1.5โ2ร = 3โ4. Larger values waste
drafter compute on positions that rarely accept (k=4 acceptance is 8.7%, k=8 is
< 1%).
Training recipe (paper-faithful)
- 2211 on-policy training shards (mixed agent_sessions + nemotron + codealpaca prompts; target = MiniMax-M2.7-AWQ-4bit), 246 held-out shards
- 30000 optimizer steps, batch_size=1, grad_accum=2 (effective bs=2)
anchors_per_seq=6,loss_decay=0.85, uncapped context windowblock_size=16,mask_token_id=200063- frozen embed_tokens + lm_head (loaded from target's bf16 weights)
Caveats
- This is a relatively early checkpoint compared to z-lab's reference drafters (those use ~800K samples; we use ~2K). Expect substantial gains from continued training data.
- Tested only on the calibration distribution. Real-world prompts (long contexts, code, multi-turn) will likely show lower acceptance.
- The 5-tap pattern targets layers spaced uniformly across MiniMax-M2.7's 60-layer body (taps at ~3%, 26%, 50%, 71%, 94%); confirmed against M2.5/M2.7 having identical architecture (62 hidden layers, hidden=3072).
Companion variants
- MirecX/MiniMax-M2.7-L5H5-DFlash โ 5-layer (0.60B), slightly higher m_accept at this data scale, ~35% slower per round
- MirecX/MiniMax-M2.7-L4H6-DFlash โ 4-layer, 6 taps (untrained shell)
Built using the DFlash framework.
- Downloads last month
- 164