Speculation head checkpoints

Pre-trained pipeline speculation head weights. Each .pt file is a single checkpoint produced by training; pair it with the same base model architecture it was trained on (see config["base_model_path"] inside the file).

For inference, evaluation, and training examples, see the official repo:
https://github.com/yuyijiong/speculative_pipeline_decoding

Filename format

Files are named:

{model}_s{num_stages}_l{num_spec_layers}.pt

Part	Meaning
`{model}`	Base model tag from training config (e.g. `Qwen3.5-4B`, `Qwen3.5-9B`)
`s{...}`	`num_stages` — pipeline depth (number of target-model stages)
`l{...}`	`num_spec_layers` — number of Transformer layers in the speculation module

Example: Qwen3.5-9B_s16_l2.pt → Qwen3.5-9B base, 16 stages, 2 spec layers.

Checkpoint contents

Each file is a PyTorch archive with two top-level keys:

{
    "state_dict": ...,  # weights of the speculation module
    "config": { ... },  # hyperparameters and metadata
}

`config` fields (always present)

Field	Description
`base_model_path`	Base model path recorded at training time (often a machine-local path; override at load time — see below)
`hidden_size`	Hidden size (matches base model)
`vocab_size`	Base model vocabulary size
`draft_vocab_size`	Draft head output size (full vocab or draft subset)
`num_stages`	Pipeline depth (same as `s` in filename)
`num_spec_layers`	Speculation module depth (same as `l` in filename)
`version`	Checkpoint format version (`10`)
`trained_with_use_deepest`	Whether training used deepest-layer features
`shallow_hidden_layer_indices`	Which base layers feed the speculation module

`config` fields (optional)

Field	Description
`spec_init_from_base_layers`	Base layers used to initialize the spec module (if any)
`draft_token_ids`	Draft vocabulary token ids (only when trained with a draft vocab subset)

Loading checkpoints

config["base_model_path"] is often a local path from the training machine (e.g. /share/models/Qwen3.5-4B). On your machine, pass the correct Hugging Face id or local directory via --base_model_path; it overrides the path stored in the checkpoint:

python pipeline_inference.py \
  --spec_head_ckpt /path/to/Qwen3.5-4B_s4_l2.pt \
  --base_model_path Qwen/Qwen3.5-4B

python eval.py \
  --spec_head_ckpt /path/to/Qwen3.5-4B_s4_l2.pt \
  --base_model_path /your/local/Qwen3.5-4B \
  --data_dir eval_data \
  --output_dir ./eval_output

If --base_model_path is omitted, the value from config["base_model_path"] is used as-is.

More usage details: speculative_pipeline_decoding.

Citation

If you use this repo, please cite our paper:

@misc{yu2026speculativepipelinedecodinghigheraccruacy,
      title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism}, 
      author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei},
      year={2026},
      eprint={2605.30852},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.30852}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yuyijiong/speculative_pipeline_decoding

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

Paper • 2605.30852 • Published 5 days ago • 7