k-l-lambda
/

kimi-k2.7-code-eagle3-mla

speculative-decoding

Model card Files Files and versions

kimi-k2.7-code-eagle3-mla / README.md

k-l-lambda's picture

Updated README

1458b6c verified 3 days ago

|

History Blame Contribute Delete

1.76 kB

	---
	license: mit
	base_model: moonshotai/Kimi-K2.7-Code
	tags:
	- speculative-decoding
	- eagle3
	- eagle3-mla
	- draft-model
	- vllm
	language:
	- en
	---

	# Kimi-K2.7-Code Eagle3-MLA Draft

	Eagle3-MLA speculative-decoding draft model for Kimi-K2.7-Code, trained natively on
	K2.7-Code data. Pairs with the Kimi-K2.7-Code verifier under vLLM speculative decoding.

	## What this is

	- Algorithm: EAGLE-3 with MLA (multi-head latent attention), single draft decoder layer.
	- Verifier: `Kimi-K2.7-Code` (DeepSeek-V3-class architecture; arch is identical across
	K2.5 / K2.6 / K2.7). The draft reuses the verifier's frozen embedding / lm_head / norm.
	- Training data: real K2.7-Code serving traffic (agentic / coding / tool, oversampled 5x)
	mixed with kimi-mtp prompts re-answered by K2.7-Code.
	- Recipe: ttt_steps=4, ttt_step_loss_decay=1.0, off-policy tokens, l2sp_lambda=1e-4,
	cosine LR 2e-5, seq_length 8192, max_steps 120000.

	## Evaluation

	Final checkpoint, speculative-decoding eval against the Kimi-K2.7-Code verifier
	(vLLM 0.20.0, TP=8, `num_speculative_tokens=3`, c=4, greedy). Mean accepted-token length:

	\| Draft \| Real K2.7-Code traffic \| K2.6-distribution held-out \|
	\|---\|---\|---\|
	\| This model (final) \| 2.345 \| 2.246 \|

	## Usage (vLLM)

	```bash
	vllm serve /path/to/Kimi-K2.7-Code \
	--tensor-parallel-size 8 \
	--speculative-config '{"model": "k-l-lambda/kimi-k2.7-code-eagle3-mla", "num_speculative_tokens": 3, "method": "eagle3"}'
	```

	## Checkpoint

	Final checkpoint of the K2.7-native run (step 118800; val_loss had plateaued, so the run was
	stopped just short of the 120000 budget). Best by validation full-sequence accept rate among
	retained checkpoints, and the eval winner on real K2.7 traffic above.