File size: 16,997 Bytes

40d87dd

Now I have a complete picture. Here's the full analysis and proposed plan.                                                               
                                                                                                                                           
  ---                                                                                                                                      
  Current Logic of train_dflash.py                                                                                                         
                                                                                                                                           
  This script trains a small, standalone DFlash draft model (1–10 transformer layers) that learns to predict token blocks in parallel,     
  guided by hidden states extracted from a frozen target model.

  Architecture (two separate models)

  1. Target model (frozen, e.g. Qwen3-8B/32B) — runs a full forward pass to produce hidden states at selected layers.
  2. Draft model (DFlashDraftModel, ~1–10 layers) — a lightweight Qwen3-based decoder that takes noise embeddings + target hidden states
  and predicts the block tokens.
  3. Target embed_tokens + lm_head — loaded separately via TargetEmbeddingsAndHead to avoid duplicating the full target model in memory.

  Key locations

  ┌──────────────────────────┬────────────────────────────────────┬───────────────────────────────────────────────────────┐
  │        Component         │                File                │                         Lines                         │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ Model init               │ scripts/train_dflash.py            │ build_models() L254–311                               │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ Target hidden extraction │ scripts/train_dflash.py            │ L644–647 (target_model.generate_dflash_data)          │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ Forward pass             │ specforge/core/dflash.py           │ OnlineDFlashModel.forward() L243–332                  │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ Loss calculation         │ specforge/core/dflash.py           │ _full_lm_loss() L382–417, _chunked_lm_loss() L419–478 │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ Loss mask                │ specforge/core/dflash.py           │ create_dflash_loss_mask() L481–509                    │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ Draft model architecture │ specforge/modeling/draft/dflash.py │ DFlashDraftModel L212–266                             │
  ├──────────────────────────┼────────────────────────────────────┼───────────────────────────────────────────────────────┤
  │ DFlash attention         │ specforge/modeling/draft/dflash.py │ Qwen3DFlashAttention L42–134                          │
  └──────────────────────────┴────────────────────────────────────┴───────────────────────────────────────────────────────┘

  Forward pass flow (per training step)

  input_ids, attention_mask, loss_mask  →  target_model.generate_dflash_data()
                                                ↓
                                       hidden_states (from target layers [1,9,17,25,33])
                                                ↓
                                OnlineDFlashModel.forward():
                                  1. Truncate to block boundary
                                  2. prepare_noise_input(): anchor tokens kept, rest → MASK
                                  3. embed_tokens(noise_input_ids) → noise_embedding
                                  4. Build DFlash attention mask (flex_attention or additive)
                                  5. draft_model(noise_embedding, target_hidden, mask)
                                  6. lm_head(hidden) → logits
                                  7. CE loss on non-anchor positions (weighted by loss_mask × decay)

  The draft model's custom Qwen3DFlashAttention concatenates [context_hidden, noise_hidden] as KV, with queries only from noise tokens. The
   attention mask enforces: block tokens see all preceding blocks' context + bidirectional within their own block.

  ---
  What already exists: train_dflash_lora.py

  Interestingly, the repo already has a LoRA variant at scripts/train_dflash_lora.py with its own model (DFlashLoRADraftModel) and wrapper
  (OnlineDFlashLoRAModel). This is exactly the approach you described — Qwen3-8B + LoRA, no separate target model, 1-step diffusion
  training. The key differences from train_dflash.py:

  ┌─────────────────┬─────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
  │     Aspect      │                       train_dflash.py                       │                train_dflash_lora.py                │
  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
  │ Draft model     │ Small custom DFlashDraftModel (1–10 layers)                 │ Full Qwen3-8B + LoRA adapters                      │
  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
  │ Target model    │ Separate frozen model for hidden state extraction           │ None — model uses its own representations          │
  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
  │ Attention       │ Custom Qwen3DFlashAttention (Q from noise, KV from [ctx,    │ Standard HF attention with 4D additive DFlash mask │
  │                 │ noise])                                                     │                                                    │
  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
  │ Forward         │ draft_model(noise_emb, target_hidden, mask)                 │ model(noise_input_ids, 4d_mask, position_ids) →    │
  │                 │                                                             │ logits                                             │
  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
  │ Trainable       │ All draft model params                                      │ Only LoRA (q/k/v/o_proj), base frozen              │
  │ params          │                                                             │                                                    │
  ├─────────────────┼─────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────┤
  │ FSDP strategy   │ SHARD_GRAD_OP                                               │ FULL_SHARD                                         │
  └─────────────────┴─────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

  ---
  Proposed Modification Plan

  Since train_dflash_lora.py already implements the core idea, the plan focuses on what's missing or needs improvement to make it a proper
  "1-step dLLM draft model" for your research:

  Phase 1: Validate and extend the existing LoRA pipeline

  1. Add MLP to LoRA targets — The current config only targets q_proj, k_proj, v_proj, o_proj. For stronger 1-step diffusion capability,
  add gate_proj, up_proj, down_proj to lora_target_modules. This gives the model more capacity to learn the non-autoregressive distribution
   shift.
  2. Add multi-step noise schedule support — Currently the training is strictly 1-step (all non-anchors → MASK). For a proper diffusion/AR
  fusion, add an option for a noise schedule where a fraction of block tokens are revealed (not just the anchor), controlled by a
  noise_ratio parameter. This would modify prepare_noise_input() in OnlineDFlashLoRAModel:
  # Instead of: all non-anchor → MASK
  # Allow: randomly keep some non-anchor tokens with probability (1 - noise_ratio)
  3. Add configurable context_len strategy — Currently context_len=0 treats the whole sequence as blocks. Add a --context-ratio arg that
  dynamically sets context_len as a fraction of the sequence, so the model learns to condition on varying amounts of AR-decoded prefix.

  Phase 2: Training logic improvements

  4. Add KL divergence loss — In addition to CE loss against ground truth, add an optional KL loss against the base model's AR distribution
   (teacher forcing). This regularizes the LoRA model to stay close to the original Qwen3-8B distribution. Modify
  OnlineDFlashLoRAModel.forward():
  # Compute base model logits (no_grad, no LoRA) as teacher
  # KL(draft_logits || teacher_logits) on block positions
  # total_loss = ce_loss + kl_weight * kl_loss
  5. Add evaluation with speculative decoding metrics — The current accuracy metric is block-wise acceptance rate. Add an eval loop that
  actually runs speculative decoding (draft → verify) to measure real speedup, using the LoRA model as the drafter and the base model (with
   LoRA disabled) as the verifier.

  Phase 3: Integration with train_dflash.py style features

  6. Port random anchor sampling — train_dflash.py has --random-anchor (L147–156) which samples diverse anchor positions per sequence. This
   is missing from the LoRA variant and would improve training diversity. Port _sample_anchor_positions and _build_blocks_from_anchors from
   OnlineDFlashModel to OnlineDFlashLoRAModel.
  7. Port chunked lm_head — The LoRA variant materializes full [bsz, seq_len, vocab_size] logits. For Qwen3-8B's 152K vocab, this is ~1.2GB
   per sample at seq_len=2048. Port the _chunked_lm_loss from OnlineDFlashModel for memory efficiency.
  8. Add tensor parallelism support — The LoRA script currently forces tp_size=1. For Qwen3-8B on multi-GPU setups, add TP support to shard
   the base model across GPUs while keeping LoRA params on each rank.

  Minimal starting point

  If you want to just get training running immediately, the existing train_dflash_lora.py + configs/qwen3-8b-dflash-lora.json should work
  out of the box:

  torchrun --nproc_per_node=4 scripts/train_dflash_lora.py \
    --model-path Qwen/Qwen3-8B \
    --lora-config configs/qwen3-8b-dflash-lora.json \
    --train-data-path /path/to/data.jsonl \
    --output-dir ./outputs/qwen3-8b-dflash-lora \
    --block-size 16 \
    --gradient-checkpointing \
    --learning-rate 2e-4 \
    --batch-size 1 \
    --max-length 2048 \
    --num-epochs 3

  Want me to start implementing any of these phases? I'd suggest starting with Phase 1 (noise schedule + MLP targets) since that's the most
   impactful for the diffusion/AR fusion research angle.