YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MiniMax-M2.7-L3H5-DFlash

DFlash speculative-decoding drafter for cyankiwi/MiniMax-M2.7-AWQ-4bit.

โš ๏ธ Highly experimental. Trained on a small (~2200-shard) on-policy corpus. Eval m_accept โ‰ˆ 1.38 โ€” useful for spec-decode infrastructure validation, below the break-even point on Strix Halo TP=4 (which needs roughly m_accept โ‰ˆ 3 to match no-spec throughput). Inference will currently be slower than no-spec on that hardware.

Architecture

  • 3 drafter layers, hidden_size=3072, 0.38B params (drafter-only; embed + lm_head loaded from target at inference)
  • target taps: layers [2, 16, 30, 43, 57] of MiniMax-M2.7's 62-layer target
  • block_size=16
  • all full_attention (target uses no SWA)
  • num_attention_heads=24, num_key_value_heads=8 (GQA), head_dim=128
  • vocab_size=200064, mask_token_id=200063

Eval

Greedy-verification proxy on a 246-shard held-out set (1400 blocks), drawn from the same on-policy corpus mix as training (agent-sessions + nemotron + codealpaca).

step m_accept k=1 k=2 k=3 k=4 val_loss
26000 1.38 63.8% 37.3% 19.1% 8.7% 4.95

m_accept = mean leading run of greedy top-1 hits per block (max possible 15). k=N cumulative = % of blocks where positions 1..N all hit top-1.

Use with vLLM

vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 \
  --speculative-config '{"method":"dflash","model":"MirecX/MiniMax-M2.7-L3H5-DFlash","num_speculative_tokens":4}'

num_speculative_tokens=4 is a reasonable choice for this drafter: m_accept of 1.38 means ideal speculative depth is โ‰ˆ 1.5โ€“2ร— = 3โ€“4. Larger values waste drafter compute on positions that rarely accept (k=4 acceptance is 8.7%, k=8 is < 1%).

Training recipe (paper-faithful)

  • 2211 on-policy training shards (mixed agent_sessions + nemotron + codealpaca prompts; target = MiniMax-M2.7-AWQ-4bit), 246 held-out shards
  • 30000 optimizer steps, batch_size=1, grad_accum=2 (effective bs=2)
  • anchors_per_seq=6, loss_decay=0.85, uncapped context window
  • block_size=16, mask_token_id=200063
  • frozen embed_tokens + lm_head (loaded from target's bf16 weights)

Caveats

  • This is a relatively early checkpoint compared to z-lab's reference drafters (those use ~800K samples; we use ~2K). Expect substantial gains from continued training data.
  • Tested only on the calibration distribution. Real-world prompts (long contexts, code, multi-turn) will likely show lower acceptance.
  • The 5-tap pattern targets layers spaced uniformly across MiniMax-M2.7's 60-layer body (taps at ~3%, 26%, 50%, 71%, 94%); confirmed against M2.5/M2.7 having identical architecture (62 hidden layers, hidden=3072).

Companion variants

Built using the DFlash framework.

Downloads last month
164
Safetensors
Model size
0.4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support