| # DFlash-LoRA 评测:Accepted Length & Accuracy |
|
|
| 完整步骤:用 SGLang **走机器内网 `10.1.1.72`** 启动服务,在 |
| **HumanEval / MT-Bench / GSM8K** 三个 bench 上测试训练好的 |
| `qwen3-8b-sft-32gpu` checkpoint 的 **accepted length** 和 **accuracy**。 |
|
|
| --- |
|
|
| ## 基本信息 |
|
|
| | 项目 | 路径 / 值 | |
| |---|---| |
| | conda 环境 | `sglang` | |
| | 基座模型(target) | `/workspace/models/Qwen3-8B` | |
| | 训练输出(最终 ckpt) | `/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-sft-32gpu/epoch_1_step_6000` | |
| | 合并后 draft 模型 | `/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-sft-32gpu-merged` | |
| | Benchmark 脚本目录 | `/workspace/hanrui/syxin_old/Specforge/benchmarks/` | |
| | 本地数据集 | `/workspace/hanrui/datasets/{humaneval,mtbench,gsm8k}` | |
| | 结果输出目录 | `/workspace/hanrui/syxin_old/Specforge/benchmarks/results/` | |
| | **机器内网 IP** | **`10.1.1.72`**(`hostname -I` 确认) | |
| | GPU | 8 × H100 80GB | |
|
|
| --- |
|
|
| ## Step 1:合并 LoRA 权重 |
|
|
| DFlash-LoRA 训练只保存了 adapter 权重,SGLang 的 STANDALONE 投机解码需要一个 |
| **完整独立的模型文件**作为 draft model,所以先 merge。 |
|
|
| ```bash |
| conda activate sglang |
| python3 /workspace/hanrui/syxin_old/merge_lora.py |
| ``` |
|
|
| > 耗时约 3–5 分钟,CPU 内存占用 ≈ 16 GB。已存在则自动跳过。 |
|
|
| --- |
|
|
| ## Step 2:启动 SGLang Server(内网 + STANDALONE 投机解码) |
|
|
| **开一个新终端(终端 A)**,执行以下命令。Server 会一直在前台运行,不要关。 |
|
|
| ```bash |
| conda activate sglang |
| bash /workspace/hanrui/syxin_old/start_server.sh 8 |
| ``` |
|
|
| > 默认 tp=8,用全部 8 张 H100。如需 tp=4 改为 `start_server.sh 4`。 |
| |
| ### 参数说明 |
| |
| | 参数 | 说明 | |
| |---|---| |
| | `--host 10.1.1.72` | **必须绑定内网 IP**,不能用 `127.0.0.1` 或 `0.0.0.0` | |
| | `--speculative-algorithm STANDALONE` | 使用独立 draft model 做投机解码,是测 accepted length 的关键 | |
| | `--speculative-draft-model-path` | merge 后的 DFlash-LoRA 模型(draft),与 target 共用同一批 GPU | |
| | `--speculative-num-steps 4` | draft model 每轮生成 4 个候选 token(可调 3–8) | |
| | `--speculative-eagle-topk 1` | 每步只保留概率最高的 1 个候选(贪心,保证 accepted length 指标准确) | |
| | `--speculative-num-draft-tokens 4` | 每次验证 4 个 draft token | |
| | `--tp-size 4` | 4 路张量并行,target + draft 共享同 4 张 H100 | |
| | `--mem-fraction-static 0.80` | 每卡 80% 显存用于静态 KV cache | |
| |
| ### 验证 Server 就绪(终端 B) |
| |
| ```bash |
| curl http://10.1.1.72:30000/v1/models |
| ``` |
| |
| 返回含模型名的 JSON 即表示就绪,可以继续 Step 3。 |
| |
| --- |
| |
| ## Step 3:运行 Benchmark |
| |
| **在终端 B 中执行**(保持终端 A 的 server 运行)。 |
| |
| ### 三个 Bench 一次性全跑(推荐) |
| |
| ```bash |
| conda activate sglang |
| bash /workspace/hanrui/syxin_old/run_bench.sh |
| ``` |
| |
| ### 单独跑某个 Bench |
| |
| ```bash |
| conda activate sglang |
| bash /workspace/hanrui/syxin_old/run_bench.sh humaneval # 只跑 HumanEval |
| bash /workspace/hanrui/syxin_old/run_bench.sh mtbench # 只跑 MT-Bench |
| bash /workspace/hanrui/syxin_old/run_bench.sh gsm8k # 只跑 GSM8K |
| bash /workspace/hanrui/syxin_old/run_bench.sh humaneval gsm8k # 任意组合 |
| ``` |
| |
| 结果日志和 jsonl 文件保存在 `/workspace/hanrui/syxin_old/Specforge/benchmarks/results/`。 |
|
|
| --- |
|
|
| ## Step 4(可选):对比 baseline(原始 Qwen3-8B,无 LoRA) |
|
|
| 关掉 Step 2 的 server,换一个更简单的 baseline server,用于对比没有 DFlash-LoRA 时的 accepted length: |
|
|
| ```bash |
| # 终端 A:启动 baseline server(无投机解码) |
| conda activate sglang |
| |
| python3 -m sglang.launch_server \ |
| --model-path /workspace/models/Qwen3-8B \ |
| --tp-size 4 \ |
| --mem-fraction-static 0.85 \ |
| --trust-remote-code \ |
| --host 10.1.1.72 \ |
| --port 30000 \ |
| --dtype bfloat16 |
| ``` |
|
|
| ```bash |
| # 终端 B:跑 baseline bench |
| python3 bench_eagle3.py \ |
| --model-path $BASE_MODEL \ |
| --host $INTRANET_IP \ |
| --port $PORT \ |
| --config-list "1,0,0,0" \ |
| --benchmark-list "humaneval:164" "mtbench:80" "gsm8k:1319" \ |
| --output-dir $RESULT_DIR \ |
| --name baseline_qwen3_8b \ |
| --skip-launch-server |
| ``` |
|
|
| > `"1,0,0,0"` = batch 1,无投机解码(steps=0),`accept_length` 固定为 1.0, |
| > 可用于对比 accuracy 是否因 LoRA 训练而下降。 |
| |
| --- |
| |
| ## 结果文件说明 |
| |
| 结果保存在 `$RESULT_DIR/` 下,文件名示例: |
| ``` |
| dflash_lora_all_results_20260307_123456.jsonl |
| ``` |
|
|
| 关键字段: |
|
|
| ```json |
| { |
| "humaneval": [{ |
| "batch_size": 1, "steps": 4, "topk": 1, "num_draft_tokens": 4, |
| "metrics": [{ |
| "latency": 45.2, |
| "output_throughput": 312.5, |
| "accept_length": 2.73, ← 投机解码平均接受长度(越高越好,1.0=无效) |
| "accuracy": 0.756, ← pass@1(HumanEval)/ 数值准确率(GSM8K)/ null(MTBench) |
| "num_questions": 164 |
| }] |
| }], |
| "mtbench": [ ... ], |
| "gsm8k": [ ... ] |
| } |
| ``` |
|
|
| | 字段 | 含义 | |
| |---|---| |
| | `accept_length` | 平均每次 verify 接受的 token 数。`> 1.0` 说明 draft model 有效,越高越好 | |
| | `accuracy` | HumanEval: pass@1;GSM8K: 数值答案准确率;MT-Bench: `null` | |
| | `output_throughput` | tokens/s(含投机加速) | |
| | `latency` | 整个 bench 总耗时(秒) | |
|
|
| --- |
|
|
| ## 一键脚本(merge + server + bench + 关server) |
|
|
| 将以下内容保存为 `/workspace/hanrui/syxin_old/run_eval.sh`: |
|
|
| ```bash |
| #!/bin/bash |
| set -e |
| |
| # ===== 配置 ===== |
| INTRANET_IP=10.1.1.72 |
| PORT=30000 |
| BASE_MODEL=/workspace/models/Qwen3-8B |
| CKPT=epoch_1_step_6000 |
| ADAPTER=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-sft-32gpu/${CKPT} |
| MERGED=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-sft-32gpu-merged |
| BENCH_DIR=/workspace/hanrui/syxin_old/Specforge/benchmarks |
| RESULT_DIR=$BENCH_DIR/results |
| TP=4 |
| # ================ |
| |
| conda activate sglang |
| export PYTHONPATH=/workspace/hanrui/syxin_old/Specforge:$PYTHONPATH |
| mkdir -p $RESULT_DIR |
| |
| # ---- Step 1: merge LoRA ---- |
| if [ ! -d "$MERGED" ]; then |
| echo ">>> Merging LoRA ..." |
| python3 - <<PYEOF |
| from peft import PeftModel |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch, os |
| model = AutoModelForCausalLM.from_pretrained("$BASE_MODEL", torch_dtype=torch.bfloat16, device_map="cpu") |
| model = PeftModel.from_pretrained(model, "$ADAPTER").merge_and_unload() |
| os.makedirs("$MERGED", exist_ok=True) |
| model.save_pretrained("$MERGED", safe_serialization=True) |
| AutoTokenizer.from_pretrained("$BASE_MODEL").save_pretrained("$MERGED") |
| print("Merge done.") |
| PYEOF |
| else |
| echo ">>> Merged model exists, skip merge." |
| fi |
| |
| # ---- Step 2: launch server ---- |
| echo ">>> Starting SGLang server on $INTRANET_IP:$PORT ..." |
| python3 -m sglang.launch_server \ |
| --model-path $BASE_MODEL \ |
| --speculative-algorithm STANDALONE \ |
| --speculative-draft-model-path $MERGED \ |
| --speculative-num-steps 4 \ |
| --speculative-eagle-topk 1 \ |
| --speculative-num-draft-tokens 4 \ |
| --tp-size $TP \ |
| --mem-fraction-static 0.80 \ |
| --trust-remote-code \ |
| --host $INTRANET_IP \ |
| --port $PORT \ |
| --dtype bfloat16 \ |
| 2>&1 | tee $RESULT_DIR/server.log & |
| SERVER_PID=$! |
| |
| echo ">>> Waiting for server (up to 120s) ..." |
| for i in $(seq 1 24); do |
| curl -s http://$INTRANET_IP:$PORT/v1/models > /dev/null 2>&1 && { echo ">>> Server ready!"; break; } |
| sleep 5 |
| done |
| |
| # ---- Step 3: benchmarks ---- |
| cd $BENCH_DIR |
| |
| echo ">>> HumanEval ..." |
| python3 bench_eagle3.py \ |
| --model-path $BASE_MODEL \ |
| --speculative-draft-model-path $MERGED \ |
| --host $INTRANET_IP --port $PORT \ |
| --config-list "1,4,1,4" \ |
| --benchmark-list "humaneval:164" \ |
| --output-dir $RESULT_DIR --name ${CKPT}_humaneval \ |
| --skip-launch-server 2>&1 | tee $RESULT_DIR/humaneval.log |
| |
| echo ">>> MT-Bench ..." |
| python3 bench_eagle3.py \ |
| --model-path $BASE_MODEL \ |
| --speculative-draft-model-path $MERGED \ |
| --host $INTRANET_IP --port $PORT \ |
| --config-list "1,4,1,4" \ |
| --benchmark-list "mtbench:80" \ |
| --output-dir $RESULT_DIR --name ${CKPT}_mtbench \ |
| --skip-launch-server 2>&1 | tee $RESULT_DIR/mtbench.log |
| |
| echo ">>> GSM8K ..." |
| python3 bench_eagle3.py \ |
| --model-path $BASE_MODEL \ |
| --speculative-draft-model-path $MERGED \ |
| --host $INTRANET_IP --port $PORT \ |
| --config-list "1,4,1,4" \ |
| --benchmark-list "gsm8k:1319" \ |
| --output-dir $RESULT_DIR --name ${CKPT}_gsm8k \ |
| --skip-launch-server 2>&1 | tee $RESULT_DIR/gsm8k.log |
| |
| # ---- Step 4: shutdown ---- |
| echo ">>> Shutting down server (PID $SERVER_PID) ..." |
| kill $SERVER_PID 2>/dev/null || true |
| wait $SERVER_PID 2>/dev/null || true |
| echo ">>> All done. Results in $RESULT_DIR" |
| ls -lh $RESULT_DIR/*.jsonl 2>/dev/null |
| ``` |
|
|
| 运行: |
|
|
| ```bash |
| chmod +x /workspace/hanrui/syxin_old/run_eval.sh |
| bash /workspace/hanrui/syxin_old/run_eval.sh 2>&1 | tee /workspace/hanrui/syxin_old/eval.log |
| ``` |
|
|
| --- |
|
|
| ## 常见问题 |
|
|
| ### Q1:accept_length 始终是 1.0 |
| |
| Server 没有开启投机解码。确认 server 启动时有 `--speculative-algorithm STANDALONE`, |
| 且 `--speculative-draft-model-path` 指向 **merge 后的完整模型**(不是 adapter 目录)。 |
| |
| ### Q2:Connection refused / 连接超时 |
| |
| - 确认 server `--host` 是 `10.1.1.72`(不是 `127.0.0.1` 或 `0.0.0.0`) |
| - bench 命令里 `--host` 也是 `10.1.1.72` |
| - `bench_eagle3.py` 已修复 `base_url = f"http://{args.host}:{args.port}"`(原来硬编码 `localhost`) |
| |
| ### Q3:数据集下载失败(无外网) |
| |
| 三个 benchmarker 已改为优先读本地文件: |
| |
| | bench | 本地文件 | |
| |---|---| |
| | GSM8K | `/workspace/hanrui/datasets/gsm8k/test.jsonl` | |
| | MT-Bench | `/workspace/hanrui/datasets/mtbench/question.jsonl` | |
| | HumanEval | `/workspace/hanrui/datasets/humaneval/test.jsonl` | |
| |
| ### Q4:OOM |
| |
| - 减小 `--mem-fraction-static`(试 `0.70`) |
| - 减小 `--tp-size`(试 `2`,但更慢) |
| - 减少 `--speculative-num-steps`(试 `3`) |
| |
| ### Q5:如何测其他 checkpoint |
| |
| 修改 `CKPT` 变量,重新 merge(保存到不同目录): |
| |
| ```bash |
| CKPT=epoch_2_step_15000 |
| ADAPTER=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-sft-32gpu/${CKPT} |
| MERGED=/workspace/hanrui/syxin_old/Specforge/outputs/qwen3-8b-sft-32gpu-merged-${CKPT} |
| # 重新 merge 后重启 server 即可 |
| ``` |
| |
| --- |
| |
| *内网 IP:`10.1.1.72` | 基座:`/workspace/models/Qwen3-8B` | 最终 ckpt:`epoch_1_step_6000`* |
| |