proof-pilot deploy bundle (private)

內部部署用模型打包。所有子資料夾都是 Olmo3SinkForCausalLM（olmo3_sink，DeepSeek-V4-Flash transplant tokenizer，vocab 129280）家族，共用 tokenizer。

子資料夾	模型	說明
`soft-distill-7b-deploy/`	stage1-v2-7b soft-distill v2 (deploy)	off-policy soft distillation 完成版（bf16），含 tokenizer + chat template
`soft-distill-32b-deploy/`	stage1-v2-32b soft-distill v2 (deploy)	32B off-policy soft distillation（bf16, rope-legacy deploy config），含 tokenizer + chat template
`opd-32b-deploy/`	32B agentic OPD v2 — v33 / job 135076 / step_200	on-policy distillation（teacher = DeepSeek-V4-Flash），student lineage = stage1-v2-32b-softdistill-v2test（GQA-8, YaRN）。bf16 deploy（rope-legacy config + hybrid-SWA）+ tokenizer + chat template。目前部署主檔。IMO-ProofBench v2 agentic-loop（prove→verify→refine→select）量到 4.48/7（grader = flash，僅供內部相對比較）。版本來龍去脈見下方「OPD 32B 版本說明」。
`opd-32b-v33-s150/`	32B agentic OPD v2 — v33 / job 135076 / step_150	與 `opd-32b-deploy`（step_200）同一條健康 run 的較早 checkpoint，可做 s150 vs s200 比較。同 deploy 格式（bf16 / rope-legacy / hybrid-SWA）+ tokenizer + chat template。
`opd-32b-v33-s200-gptq-w4a16/`	`opd-32b-deploy`（step_200）的 GPTQ-w4a16 量化版	int4 weight-only（compressed-tensors，int4 sym group-128 GPTQ）+ 校準的 fp8 KV scale。18.74GB（bf16 65GB→int4，4 shards + index）。calibration 對齊 inference：sink-on + long-ctx(10240) + factor-32 YaRN。serve = sglang `olmo2_sink` + triton(sm120)/fa3(H200) + `--kv-cache-dtype fp8_e4m3`。詳見下方「GPTQ 量化版」。
`opd-32b-v33-s150-gptq-w4a16/`	`opd-32b-v33-s150`（step_150）的 GPTQ-w4a16 量化版	與 s200 gptq 同配方（int4 sym group-128 GPTQ + sink-on calib + long-ctx 10240 + factor-32 YaRN + 校準 fp8 KV scale）。18.74GB（4 shards + index）。serve 同 s200。可做 s150 vs s200 量化版比較。
`dflash-7b-draft/`	DFlash draft for 7B target	SGLang 可部署 draft（speculative decoding）
`dflash-32b-draft/`	DFlash draft for stage1-v2-32b target (s5317)	SGLang 可部署 draft；對齊舊 32B deploy target
`dflash-32b-draft-v2test/`	DFlash draft for stage1-v2-32b-softdistill-v2test — phase-1 warm-up（非 final）	curriculum phase-1 短 context warm-up（step-10000 快照）。SWA512 / block_size 11 / 8L / GQA-8。部署請用下方 `dflash-32b-draft-v2test-phaseL`。
`dflash-32b-draft-v2test-phaseL/`	DFlash draft for stage1-v2-32b-softdistill-v2test — phase-2 final（部署推薦）	curriculum phase-2 長 context 特化（job 140680，warm-start 自 phase-1）：train data = 真實長 proof 部署分佈（OPD 32B rollouts finish_reason=length filtered + dsflash-v2-test teacher proofs，micro 65536）、GAMMA 20。step_3000 完整收尾，acc 0.605 / greedy mean_prefix_len 4.90。SWA512 / block_size 11 / 8L / GQA-8。serving accept ~3.1–4.1（單流，dev H200）。
`dflash-32b-draft-v2test-phaseL-int4mlp/`	phase-L draft 的 int4-MLP 量化版	`dflash-32b-draft-v2test-phaseL` 的 MLP（gate/up/down）量成 compressed-tensors int4（RTN W4A16 g128），qkv/o + sink/fc/mask_embed 留 bf16（保 DFlash fused-KV）。4.82→2.30GB（−55%）。sglang DFLASH 部署實測：載成 int4（weight mem 2.16GB）、`fused KV materialization ENABLED`、accept 3.1–4.1（== bf16 draft）、單流比 bf16 draft +2~15% tok/s。serve 需 patched `dflash_sink.py`（thread quant_config→MLP）+ `--speculative-draft-model-quantization compressed-tensors`。詳見 `docs/quantization.md §13`。
`dflash-32b-draft-v2test-phaseL-int4mlp-gptq/`	phase-L draft 的 int4-MLP GPTQ 版	同上 int4-MLP，但用 GPTQ（full-rank target-hidden Hessian, 26130 rows）取代 RTN——strictly 更準（weighted-err −69% vs RTN）。部署完全相同（compressed-tensors int4、`fused-KV ENABLED`、accept ~3.0–3.8 == RTN/bf16）、2.30GB。註：draft accept 已貼 bf16 lossless-verify 天花板，RTN/GPTQ 的 τ 統計等價；此版給要最高權重保真者。詳見 `docs/quantization.md §13`。

DFlash draft 只含 config.json + model.safetensors（已 reshard 成多 shard），需搭配對應的 target 模型使用（注意 v2test draft 對應 v2test target，與舊 dflash-32b-draft 的 target 不同）。

DFlash 32B v2test curriculum（phase-1 → phase-2）

v2test target 的 draft 走兩段式 curriculum：

phase-1（dflash-32b-draft-v2test/）：短 context warm-up（DATA=l4-g2-ml4096、micro 8192），便宜暖身、acc ~0.64。非部署用 final，僅 curriculum 暖身快照。
phase-2（dflash-32b-draft-v2test-phaseL/）：warm-start 自 phase-1，於真實長 proof 部署分佈（OPD 32B rollouts + dsflash teacher proofs、micro 65536）長 context 特化。這顆才是對 OPD/soft-distill 32B target 部署用的 draft。

OPD 32B 版本說明（哪個是哪個）

agentic semi-on-policy OPD 32B（student = stage1-v2-32b-softdistill-v2test、teacher = DeepSeek-V4-Flash）一共訓練過兩次：

V32 = job 134244：在 step ~148–158 因 length 自我放大崩盤（eos 90%→13%、cap-hit→87%）後喊停，最後 checkpoint 只到 step_150。此 run 的權重未收進本 bundle。
V33 = job 135076：加了 cap-hit admission filter（+ fast sharded save + 拓樸 rebalance），健康跑到 step 237 後（user）喊停，存了 step_150 與 step_200。本 bundle 的 opd-32b-deploy（= step_200）與 opd-32b-v33-s150（= step_150）都來自這條 run。

⚠️ 先前版本的 card 把「step ~158 喊停」誤接在 step_200 上——那是 V32（134244）的崩盤，與本 bundle 的 step_200（來自 V33/135076）無關，已更正。

⚠️ V33 的 cap-hit filter 是在訓練端減緩 length 自我放大、讓 run 不崩；蒸餾出的 student 在推理時仍帶 OPD reverse-KL 的 loop 傾向（例如 refine 階段思考會落入重複吸引子、一路衝到 token cap）。這是 OPD reverse-KL 的通性，與「run 停在第幾步」無關。

部署權重的產生方式（s150 / s200 同一 recipe）

# step_NNN 訓練 checkpoint（DCP + consolidated hf/）→ serve-ready deploy dir
python deploy/make_olmo3sink_deploy.py \
  --src training/opd_v2/runs/agentic_32b_lc140k_v33/checkpoints/step_000NNN/hf \
  --dst outputs/agentic_32b_lc140k_v33-sNNN-deploy
python deploy_kaggle/enable_swa_config.py outputs/agentic_32b_lc140k_v33-sNNN-deploy

GPTQ 量化版（`opd-32b-v33-s200-gptq-w4a16`）

opd-32b-deploy（step_200 bf16）的 4-bit 部署版，給 Kaggle RTX 6000 Pro（sm120）等 VRAM 受限環境。18.74GB（bf16 65GB → int4，4 shards + index）。也同步上傳為 Kaggle private model threerabbits/opd-32b-v33-s200-gptq-w4a16。

量化配方（刻意對齊 serving 分布、消除 calib/infer mismatch）：llm-compressor GPTQ，scheme W4A16（int4 / symmetric / group_size 128 / Hessian 誤差補償），lm_head/embed/norm/sink 保 bf16。calibration：

sink-on：gpt-oss attention sink 進 eager calib forward（= sglang serve 端的算法；trained sink logit mean ~+6.7，非可忽略，故 calib 必須帶 sink）
long-context seqlen 10240：> YaRN original_max 8192 且 > sliding window 4096 → 吃到長程 / 高位置 post-YaRN K 分布
factor-32 YaRN rope（deploy config 原樣）
calib data = L4 訓練 bins（同 129280 transplant vocab），n=64

KV cache：含校準的 fp8 per-tensor static k_scale / v_scale（kv_cache_scheme 寫進 config），取代未校準的 scale=1.0。實測 k_scale 0.032–0.169（mean 0.061）、v_scale 0.0069–0.436（mean 0.161；深層 V 顯著大於淺層）。

品質（sink-on、serving regime、teacher-forced @8192、vs bf16 reference）：ppl **+0.13%**、top1_agree 0.975、KL(bf16‖q) 0.011 —— 權重幾乎無損。

Serving 注意事項：

sglang，bind-mount deploy/target/olmo2_sink.py（in-kernel sink），--attention-backend triton（sm120 唯一 sink-correct）或 fa3（H200），**--kv-cache-dtype fp8_e4m3** 才會載入校準的 k/v_scale。sglang 從 config 的 quantization_config 自動偵測 compressed-tensors，不需傳 --quantization。
⚠️ 校準的 KV scale（非 unit）與 DFlash fused-KV ring 互斥（DFlash 路徑遇 non-unit k/v_scale 會關掉 fused-KV）。要 DFlash spec-decode 時改用 unit-scale fp8 KV。
actorder=static 但 checkpoint 0 個 g_idx（重排已 bake、無 permutation 需套）→ marlin W4A16 與 humming W4A8 路徑皆安全。
⚠️ KV scale 在 calib 長度 10240 下校準；serving 到 256k 時極高位置的 K 若略超校準 amax，fp8 會 saturate（非 error）。

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

proof-pilot deploy bundle (private)

DFlash 32B v2test curriculum（phase-1 → phase-2）

OPD 32B 版本說明（哪個是哪個）

部署權重的產生方式（s150 / s200 同一 recipe）

GPTQ 量化版（opd-32b-v33-s200-gptq-w4a16）

GPTQ 量化版（`opd-32b-v33-s200-gptq-w4a16`）