Upload clean block-causal and packed pilot checkpoints
Browse files- README.md +77 -215
- checkpoints/search_step_2000.compare_retrieval.json +127 -0
- checkpoints/search_step_2000.pt +3 -0
- checkpoints_block_d128/search_step_1000.compare_retrieval.json +31 -0
- checkpoints_block_d128/search_step_1000.k_sweep_exact.json +74 -0
- checkpoints_block_d128/search_step_1000.pt +3 -0
- checkpoints_block_d128/search_step_1000.quest_page16.json +72 -0
- checkpoints_block_d128/search_step_200.pt +3 -0
- checkpoints_block_d128/search_step_400.pt +3 -0
- checkpoints_block_d128/search_step_600.pt +3 -0
- checkpoints_block_d128/search_step_800.pt +3 -0
- checkpoints_d64/search_step_200.pt +3 -0
- checkpoints_d64/search_step_400.pt +3 -0
- checkpoints_d64/search_step_600.pt +3 -0
- checkpoints_packed_d128/search_step_1000.compare_retrieval.json +127 -0
- checkpoints_packed_d128/search_step_1000.k_sweep.json +74 -0
- checkpoints_packed_d128/search_step_1000.k_sweep_exact.json +74 -0
- checkpoints_packed_d128/search_step_1000.k_sweep_exact_skip16.json +74 -0
- checkpoints_packed_d128/search_step_1000.pt +3 -0
- checkpoints_packed_d128/search_step_200.pt +3 -0
- checkpoints_packed_d128/search_step_400.pt +3 -0
- checkpoints_packed_d128/search_step_600.pt +3 -0
- checkpoints_packed_d128/search_step_800.pt +3 -0
- checkpoints_packed_d256/search_step_1000.compare_retrieval.json +127 -0
- checkpoints_packed_d256/search_step_1000.pt +3 -0
- checkpoints_packed_d256/search_step_200.pt +3 -0
- checkpoints_packed_d256/search_step_400.pt +3 -0
- checkpoints_packed_d256/search_step_600.pt +3 -0
- checkpoints_packed_d256/search_step_800.pt +3 -0
- checkpoints_packed_d64/search_step_1000.compare_retrieval.json +127 -0
- checkpoints_packed_d64/search_step_1000.pt +3 -0
- checkpoints_packed_d64/search_step_200.pt +3 -0
- checkpoints_packed_d64/search_step_400.pt +3 -0
- checkpoints_packed_d64/search_step_600.pt +3 -0
- checkpoints_packed_d64/search_step_800.pt +3 -0
README.md
CHANGED
|
@@ -1,225 +1,87 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
- en
|
| 4 |
-
license: apache-2.0
|
| 5 |
base_model: Qwen/Qwen3-4B-Instruct-2507
|
| 6 |
tags:
|
| 7 |
- sparse-attention
|
| 8 |
-
- ann
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
library_name: pytorch
|
| 13 |
---
|
| 14 |
|
| 15 |
-
#
|
| 16 |
|
| 17 |
-
|
| 18 |
-
[`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
## Current
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
| K | Recall@K | PPL_ANN | PPL gap |
|
| 93 |
-
|---|---|---|---|
|
| 94 |
-
| 16 | 24.9% | 10.71 | +7.51% |
|
| 95 |
-
| 32 | 22.8% | 10.41 | +4.51% |
|
| 96 |
-
| 64 | 23.1% | 10.20 | +2.42% |
|
| 97 |
-
| 128 | 26.0% | 10.04 | +0.82% |
|
| 98 |
-
| 256 | 31.6% | 9.88 | **−0.79%** |
|
| 99 |
-
| 512 | 40.8% | 9.67 | **−2.89%** |
|
| 100 |
-
|
| 101 |
-
On this small WikiText slice, K ≥ 256 produced lower measured PPL than
|
| 102 |
-
the full-attention reference. A plausible explanation is sparse-softmax
|
| 103 |
-
denoising, but with 12 eval batches, sample noise, packed-boundary artifacts
|
| 104 |
-
(pilot trained with packing on; default in the repo is now off), and
|
| 105 |
-
partial-layer substitution acting like regularization are also candidates.
|
| 106 |
-
Treating it as a hypothesis to confirm via an exact-topK oracle (full QK^T
|
| 107 |
-
→ top-K → restricted attention) at the same K — that separates "denoising
|
| 108 |
-
from any sparsity" from "denoising from learned projections."
|
| 109 |
-
|
| 110 |
-
Code-level sanity checks pass: same input sequences for `ppl_full` vs
|
| 111 |
-
`ppl_ann`, intact causal mask in retrieval, single softmax over retrieved
|
| 112 |
-
K with no wrapper leakage between iterations.
|
| 113 |
-
|
| 114 |
-
### Compute / quality knobs (FLOP-counted)
|
| 115 |
-
|
| 116 |
-
`L = 4096`. Compute reduction is the attention scoring step, ≈ `L / K`.
|
| 117 |
-
These are FLOP estimates, not measured wall-clock — the FAISS path in this
|
| 118 |
-
repo is a research prototype that does CPU index builds and GPU↔CPU
|
| 119 |
-
transfers, so it is not the right thing to time.
|
| 120 |
-
|
| 121 |
-
| K | PPL gap | Attention scoring reduction |
|
| 122 |
-
|---|---|---|
|
| 123 |
-
| 512 | −2.89% | ~8× |
|
| 124 |
-
| 256 | −0.79% | ~16× |
|
| 125 |
-
| 128 | +0.82% | ~32× |
|
| 126 |
-
| 64 | +2.42% | ~64× |
|
| 127 |
-
| 32 | +4.51% | ~128× |
|
| 128 |
-
| 16 | +7.51% | ~256× |
|
| 129 |
-
|
| 130 |
-
Eval scope: 12 sequences × 4K tokens of WikiText-103 validation (~50K
|
| 131 |
-
tokens). Read these as "what we observed on this slice", not population-
|
| 132 |
-
level estimates.
|
| 133 |
-
|
| 134 |
-
The K-sweep recall numbers (24–41%) and the in-training `evaluate()` recall
|
| 135 |
-
(50.9% at K=128) come from different sampled subsets of the streaming split
|
| 136 |
-
and shouldn't be directly compared. The repo also reports `mass@K` (sum of
|
| 137 |
-
teacher attention probability captured by the search top-K) — that's the
|
| 138 |
-
more direct retrieval-quality metric when softmax is sharp.
|
| 139 |
-
|
| 140 |
-
### Per-layer recall (pilot)
|
| 141 |
-
|
| 142 |
-
| Layer | Recall@K=128 | Recall@K=512 |
|
| 143 |
-
|---|---|---|
|
| 144 |
-
| 4 | 15.8% | 34.7% |
|
| 145 |
-
| 8 | 22.2% | 38.7% |
|
| 146 |
-
| 12 | 23.4% | 39.1% |
|
| 147 |
-
| 16 | 31.9% | 45.2% |
|
| 148 |
-
| 20 | 31.4% | 42.6% |
|
| 149 |
-
| 24 | 31.1% | 44.4% |
|
| 150 |
-
|
| 151 |
-
Early layers are harder for content-addressable retrieval — their attention
|
| 152 |
-
is more local/positional than semantic. Consistent across K, so it's a
|
| 153 |
-
property of the layer rather than noise.
|
| 154 |
-
|
| 155 |
-
### Caveats / what's next
|
| 156 |
-
|
| 157 |
-
- **Packing**: pilot training and eval ran with sequence packing on (no
|
| 158 |
-
segment-level causal mask, since transformers' default forward doesn't
|
| 159 |
-
build them). The relative PPL gap between full and ANN is internally
|
| 160 |
-
consistent under this confound, but the negative gap at K≥256 has at
|
| 161 |
-
least three candidate explanations we haven't disentangled —
|
| 162 |
-
(a) sparse-softmax denoising, (b) ANN happening to filter cross-document
|
| 163 |
-
keys that full attention attends to, (c) sample noise on a small eval.
|
| 164 |
-
The default config now has packing off so the next run isolates (a).
|
| 165 |
-
- **Exact-topK oracle**: a four-way Pareto (full vs. exact top-K vs.
|
| 166 |
-
search-topK exact vs. search-ANN) is the natural follow-up to separate
|
| 167 |
-
"denoising from any sparsity" from "denoising from learned projections."
|
| 168 |
-
- **Wall-clock**: not measured. The FAISS path in the repo is a CPU-side
|
| 169 |
-
research prototype, not a deployable runtime. A GPU-resident topk kernel
|
| 170 |
-
is the next-step engineering.
|
| 171 |
-
- **34-layer headline** was queued (`make_headline_config()` is wired) and
|
| 172 |
-
will mirror its checkpoints here when it runs.
|
| 173 |
-
|
| 174 |
-
## Files
|
| 175 |
-
|
| 176 |
-
| File | What |
|
| 177 |
-
|---|---|
|
| 178 |
-
| `search_step_1000.pt` | Mid-training checkpoint (step 1000, 0.68% PPL gap) |
|
| 179 |
-
| `search_step_2000.pt` | Final pilot checkpoint (step 2000, 0.71% PPL gap) |
|
| 180 |
-
|
| 181 |
-
Each contains `{step, search_module: state_dict, optimizer, scheduler, config}`.
|
| 182 |
-
|
| 183 |
-
## Loading
|
| 184 |
-
|
| 185 |
-
```python
|
| 186 |
-
import torch
|
| 187 |
-
from transformers import AutoModelForCausalLM
|
| 188 |
-
# Search module class is in the GitHub repo (model.py)
|
| 189 |
-
from model import SearchProjectionModule
|
| 190 |
-
|
| 191 |
-
base = AutoModelForCausalLM.from_pretrained(
|
| 192 |
-
"Qwen/Qwen3-4B-Instruct-2507",
|
| 193 |
-
dtype=torch.bfloat16,
|
| 194 |
-
device_map="auto",
|
| 195 |
-
attn_implementation="sdpa",
|
| 196 |
-
)
|
| 197 |
-
|
| 198 |
-
search = SearchProjectionModule(
|
| 199 |
-
d_model=2560, d_search=64,
|
| 200 |
-
layer_indices=[4, 8, 12, 16, 20, 24],
|
| 201 |
-
use_mlp=False,
|
| 202 |
-
).to(base.device).to(torch.bfloat16)
|
| 203 |
-
|
| 204 |
-
ckpt = torch.load("search_step_2000.pt", map_location="cpu", weights_only=False)
|
| 205 |
-
search.load_state_dict(ckpt["search_module"])
|
| 206 |
-
```
|
| 207 |
-
|
| 208 |
-
Use `inference.install_ann_attention(...)` (in the GitHub repo) to monkey-patch
|
| 209 |
-
the trained layers and run with FAISS HNSW retrieval at inference time.
|
| 210 |
-
|
| 211 |
-
## Training recipe
|
| 212 |
-
|
| 213 |
-
- Frozen base: Qwen3-4B-Instruct-2507 (36 layers, hidden 2560, GQA 32:8).
|
| 214 |
-
- Data: WikiText-103 raw, 4K-token sequences (packing was on at training
|
| 215 |
-
time; default in the repo is now off — see Caveats).
|
| 216 |
-
- 2000 steps, batch 8, lr 1e-4 (cosine, 100-step warmup), AdamW.
|
| 217 |
-
- `α=β=1` (contrastive + KL distillation, both layers averaged).
|
| 218 |
-
- bf16 weights, fp32 loss math.
|
| 219 |
-
- SDPA attention (B200, no flash-attn package needed).
|
| 220 |
-
- Liger fused RMSNorm/SwiGLU/RoPE on the frozen base.
|
| 221 |
-
- Total wall-clock: ~25 min on a single B200.
|
| 222 |
-
|
| 223 |
-
## License
|
| 224 |
-
|
| 225 |
-
The search projections are released under Apache-2.0 (matching the base model).
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
|
|
|
|
|
|
| 3 |
base_model: Qwen/Qwen3-4B-Instruct-2507
|
| 4 |
tags:
|
| 5 |
- sparse-attention
|
| 6 |
+
- ann
|
| 7 |
+
- qwen3
|
| 8 |
+
- retrieval
|
| 9 |
+
- research-artifact
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# ANN Sparse Attention Checkpoints
|
| 13 |
|
| 14 |
+
Research artifact for distillation-trained ANN-friendly search projections for sparse attention.
|
|
|
|
| 15 |
|
| 16 |
+
Base model: `Qwen/Qwen3-4B-Instruct-2507`.
|
| 17 |
|
| 18 |
+
## Current Clean Result
|
| 19 |
|
| 20 |
+
The clean methodology is the packed block-causal d128 run in `checkpoints_block_d128/`.
|
| 21 |
+
Packed examples are isolated with per-document `segment_ids`, reset `position_ids`, and a 4D block-causal attention mask. Retrieval, loss masking, mass@K, and recall@K use the same segment-causal eligibility mask.
|
| 22 |
+
|
| 23 |
+
Clean block-causal d128 checkpoint:
|
| 24 |
+
|
| 25 |
+
- `checkpoints_block_d128/search_step_1000.pt`
|
| 26 |
+
- 6 trained layers: `[4, 8, 12, 16, 20, 24]`
|
| 27 |
+
- `d_search=128`, 3.93M trainable parameters
|
| 28 |
+
- K=128 exact learned search: PPL gap `+0.07%`, mass@K `0.787`, recall@K `0.744`
|
| 29 |
+
- K=256 exact learned search: PPL gap `+0.01%`, mass@K `0.953`, recall@K `0.879`
|
| 30 |
+
|
| 31 |
+
Interpretation: clean block-causal evaluation shows full-attention parity, not a clean denoising/improvement claim.
|
| 32 |
+
|
| 33 |
+
## Clean Per-layer Retrieval, K=128
|
| 34 |
+
|
| 35 |
+
From `checkpoints_block_d128/search_step_1000.compare_retrieval.json`:
|
| 36 |
+
|
| 37 |
+
| Layer | raw-QK oracle mass | learned d128 mass |
|
| 38 |
+
|---|---:|---:|
|
| 39 |
+
| 4 | 0.956 | 0.950 |
|
| 40 |
+
| 8 | 0.977 | 0.976 |
|
| 41 |
+
| 12 | 0.970 | 0.977 |
|
| 42 |
+
| 16 | 0.964 | 0.970 |
|
| 43 |
+
| 20 | 0.970 | 0.983 |
|
| 44 |
+
| 24 | 0.978 | 0.984 |
|
| 45 |
+
| avg | 0.969 | 0.973 |
|
| 46 |
+
|
| 47 |
+
With segment isolation, early trained layers are not uniquely diffuse or hard; all six trained layers have high oracle mass and learned projections match/slightly exceed raw-QK retrieval.
|
| 48 |
+
|
| 49 |
+
## Quest-style Page Baseline
|
| 50 |
+
|
| 51 |
+
From `checkpoints_block_d128/search_step_1000.quest_page16.json`, using page size 16, native post-RoPE Q/K min/max pages, and the same block-causal eligibility mask:
|
| 52 |
+
|
| 53 |
+
| Method | K | Recall@K | mass@K | PPL | PPL gap |
|
| 54 |
+
|---|---:|---:|---:|---:|---:|
|
| 55 |
+
| learned search exact | 128 | 0.744 | 0.787 | 30.47 | +0.07% |
|
| 56 |
+
| Quest-style page | 128 | 0.669 | 0.727 | 30.41 | -0.11% |
|
| 57 |
+
| learned search exact | 256 | 0.879 | 0.953 | 30.45 | +0.01% |
|
| 58 |
+
| Quest-style page | 256 | 0.838 | 0.909 | 30.45 | +0.03% |
|
| 59 |
+
|
| 60 |
+
Both methods are effectively full-attention parity on PPL. Learned projections recover more teacher attention mass at the same token budget, especially at K=128, but do not yet show a clean PPL advantage over Quest on this slice.
|
| 61 |
+
|
| 62 |
+
## Packed Leakage-confounded Ablations
|
| 63 |
+
|
| 64 |
+
The packed d64/d128/d256 runs are included for capacity-scaling history and should not be used for clean quality claims because packed examples could attend across document boundaries.
|
| 65 |
+
|
| 66 |
+
Packed d_search ablation at K=128:
|
| 67 |
+
|
| 68 |
+
| d_search | learned mass@K=128 | raw-QK oracle | learned/oracle | final PPL gap |
|
| 69 |
+
|---|---:|---:|---:|---:|
|
| 70 |
+
| 64 | 0.492 | 0.488 | 1.01x | +2.39% |
|
| 71 |
+
| 128 | 0.503 | 0.488 | 1.03x | -1.81% |
|
| 72 |
+
| 256 | 0.509 | 0.488 | 1.04x | -1.85% |
|
| 73 |
+
|
| 74 |
+
The large negative packed K-sweep gaps were leakage-confounded and should be treated as historical/debugging evidence only, not as the headline.
|
| 75 |
+
|
| 76 |
+
## Folder Guide
|
| 77 |
+
|
| 78 |
+
- `checkpoints_block_d128/`: clean block-causal d128 checkpoint and JSON eval artifacts. Use this for current claims.
|
| 79 |
+
- `checkpoints_packed_d64/`, `checkpoints_packed_d128/`, `checkpoints_packed_d256/`: leakage-confounded packed ablation checkpoints.
|
| 80 |
+
- `checkpoints_d64/`: earlier unpacked d64 checkpoints.
|
| 81 |
+
- `checkpoints/`: original pilot checkpoint and compare JSON.
|
| 82 |
+
|
| 83 |
+
## Code
|
| 84 |
+
|
| 85 |
+
Source repo: https://github.com/unixsysdev/ann-sparseattention
|
| 86 |
+
|
| 87 |
+
The repo README contains the current methodology notes and reproduction commands.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
checkpoints/search_step_2000.compare_retrieval.json
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "Qwen/Qwen3-4B-Instruct-2507",
|
| 3 |
+
"ckpt": "/tmp/checkpoints/search_step_2000.pt",
|
| 4 |
+
"by_K": {
|
| 5 |
+
"16": {
|
| 6 |
+
"raw_qk": {
|
| 7 |
+
"per_layer": {
|
| 8 |
+
"4": 0.4115331669648488,
|
| 9 |
+
"8": 0.2802686393260956,
|
| 10 |
+
"12": 0.1508728675544262,
|
| 11 |
+
"16": 0.07771651136378448,
|
| 12 |
+
"20": 0.053202067812283836,
|
| 13 |
+
"24": 0.10271163408954938
|
| 14 |
+
},
|
| 15 |
+
"avg": 0.1793841478518314
|
| 16 |
+
},
|
| 17 |
+
"learned": {
|
| 18 |
+
"per_layer": {
|
| 19 |
+
"4": 0.1072007929906249,
|
| 20 |
+
"8": 0.23697869976361594,
|
| 21 |
+
"12": 0.11224483884871006,
|
| 22 |
+
"16": 0.07637482260664304,
|
| 23 |
+
"20": 0.06903641019016504,
|
| 24 |
+
"24": 0.10397349204868078
|
| 25 |
+
},
|
| 26 |
+
"avg": 0.11763484274140662
|
| 27 |
+
}
|
| 28 |
+
},
|
| 29 |
+
"32": {
|
| 30 |
+
"raw_qk": {
|
| 31 |
+
"per_layer": {
|
| 32 |
+
"4": 0.4668281575043996,
|
| 33 |
+
"8": 0.3293568876882394,
|
| 34 |
+
"12": 0.19662011042237282,
|
| 35 |
+
"16": 0.11227216385304928,
|
| 36 |
+
"20": 0.07988839099804561,
|
| 37 |
+
"24": 0.144585732370615
|
| 38 |
+
},
|
| 39 |
+
"avg": 0.22159190713945362
|
| 40 |
+
},
|
| 41 |
+
"learned": {
|
| 42 |
+
"per_layer": {
|
| 43 |
+
"4": 0.13327929687996706,
|
| 44 |
+
"8": 0.2569987513124943,
|
| 45 |
+
"12": 0.1446449818710486,
|
| 46 |
+
"16": 0.10432848272224267,
|
| 47 |
+
"20": 0.09580977975080411,
|
| 48 |
+
"24": 0.13778831561406454
|
| 49 |
+
},
|
| 50 |
+
"avg": 0.14547493469177022
|
| 51 |
+
}
|
| 52 |
+
},
|
| 53 |
+
"64": {
|
| 54 |
+
"raw_qk": {
|
| 55 |
+
"per_layer": {
|
| 56 |
+
"4": 0.5168702056010565,
|
| 57 |
+
"8": 0.390024371445179,
|
| 58 |
+
"12": 0.25787363573908806,
|
| 59 |
+
"16": 0.15796820322672525,
|
| 60 |
+
"20": 0.11998403631150723,
|
| 61 |
+
"24": 0.2020296814541022
|
| 62 |
+
},
|
| 63 |
+
"avg": 0.2741250222962764
|
| 64 |
+
},
|
| 65 |
+
"learned": {
|
| 66 |
+
"per_layer": {
|
| 67 |
+
"4": 0.1732035626967748,
|
| 68 |
+
"8": 0.2893482334911823,
|
| 69 |
+
"12": 0.19321986908713976,
|
| 70 |
+
"16": 0.14695298795898756,
|
| 71 |
+
"20": 0.13634028658270836,
|
| 72 |
+
"24": 0.1862101349979639
|
| 73 |
+
},
|
| 74 |
+
"avg": 0.18754584580245945
|
| 75 |
+
}
|
| 76 |
+
},
|
| 77 |
+
"128": {
|
| 78 |
+
"raw_qk": {
|
| 79 |
+
"per_layer": {
|
| 80 |
+
"4": 0.571592112382253,
|
| 81 |
+
"8": 0.463805615901947,
|
| 82 |
+
"12": 0.33948806673288345,
|
| 83 |
+
"16": 0.22183777391910553,
|
| 84 |
+
"20": 0.18010229741533598,
|
| 85 |
+
"24": 0.27901028965910274
|
| 86 |
+
},
|
| 87 |
+
"avg": 0.34263935933510464
|
| 88 |
+
},
|
| 89 |
+
"learned": {
|
| 90 |
+
"per_layer": {
|
| 91 |
+
"4": 0.235384251922369,
|
| 92 |
+
"8": 0.3402557211617629,
|
| 93 |
+
"12": 0.2643987759947777,
|
| 94 |
+
"16": 0.21102494125564894,
|
| 95 |
+
"20": 0.19678996006647745,
|
| 96 |
+
"24": 0.25442706421017647
|
| 97 |
+
},
|
| 98 |
+
"avg": 0.25038011910186875
|
| 99 |
+
}
|
| 100 |
+
},
|
| 101 |
+
"256": {
|
| 102 |
+
"raw_qk": {
|
| 103 |
+
"per_layer": {
|
| 104 |
+
"4": 0.6327291478713354,
|
| 105 |
+
"8": 0.5521376008788744,
|
| 106 |
+
"12": 0.43828647087017697,
|
| 107 |
+
"16": 0.31400871525208157,
|
| 108 |
+
"20": 0.2687250425418218,
|
| 109 |
+
"24": 0.37980743249257404
|
| 110 |
+
},
|
| 111 |
+
"avg": 0.4309490683178107
|
| 112 |
+
},
|
| 113 |
+
"learned": {
|
| 114 |
+
"per_layer": {
|
| 115 |
+
"4": 0.3288092017173767,
|
| 116 |
+
"8": 0.41752680391073227,
|
| 117 |
+
"12": 0.36623430997133255,
|
| 118 |
+
"16": 0.30524607251087826,
|
| 119 |
+
"20": 0.2853706416984399,
|
| 120 |
+
"24": 0.34850213179985684
|
| 121 |
+
},
|
| 122 |
+
"avg": 0.3419481936014361
|
| 123 |
+
}
|
| 124 |
+
}
|
| 125 |
+
},
|
| 126 |
+
"learned_over_raw_K128": 0.7307395145371917
|
| 127 |
+
}
|
checkpoints/search_step_2000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e30257eb414fa7595580d4a3efbb247e02d1bef6875f60fbab90edeb568012d
|
| 3 |
+
size 11814585
|
checkpoints_block_d128/search_step_1000.compare_retrieval.json
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "Qwen/Qwen3-4B-Instruct-2507",
|
| 3 |
+
"ckpt": "/tmp/checkpoints_block_d128/search_step_1000.pt",
|
| 4 |
+
"by_K": {
|
| 5 |
+
"128": {
|
| 6 |
+
"raw_qk": {
|
| 7 |
+
"per_layer": {
|
| 8 |
+
"4": 0.95585672929883,
|
| 9 |
+
"8": 0.977135993540287,
|
| 10 |
+
"12": 0.969620868563652,
|
| 11 |
+
"16": 0.9640911668539047,
|
| 12 |
+
"20": 0.9696248471736908,
|
| 13 |
+
"24": 0.9784364253282547
|
| 14 |
+
},
|
| 15 |
+
"avg": 0.9691276717931032
|
| 16 |
+
},
|
| 17 |
+
"learned": {
|
| 18 |
+
"per_layer": {
|
| 19 |
+
"4": 0.949747696518898,
|
| 20 |
+
"8": 0.9758064672350883,
|
| 21 |
+
"12": 0.9767678007483482,
|
| 22 |
+
"16": 0.9701211974024773,
|
| 23 |
+
"20": 0.9826441332697868,
|
| 24 |
+
"24": 0.9841717407107353
|
| 25 |
+
},
|
| 26 |
+
"avg": 0.9732098393142223
|
| 27 |
+
}
|
| 28 |
+
}
|
| 29 |
+
},
|
| 30 |
+
"learned_over_raw_K128": 1.0042122081949907
|
| 31 |
+
}
|
checkpoints_block_d128/search_step_1000.k_sweep_exact.json
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"ppl_full": 30.444138765335083,
|
| 3 |
+
"by_K": {
|
| 4 |
+
"128": {
|
| 5 |
+
"recall_avg": 0.7435683117319801,
|
| 6 |
+
"recall_per_layer": {
|
| 7 |
+
"4": 0.7372374021825518,
|
| 8 |
+
"8": 0.7400615944244054,
|
| 9 |
+
"12": 0.7399612933300728,
|
| 10 |
+
"16": 0.7452810723237757,
|
| 11 |
+
"20": 0.7493442218927676,
|
| 12 |
+
"24": 0.7495242862383078
|
| 13 |
+
},
|
| 14 |
+
"mass_avg": 0.7874226044715501,
|
| 15 |
+
"mass_per_layer": {
|
| 16 |
+
"4": 0.7964790942667894,
|
| 17 |
+
"8": 0.760353729720074,
|
| 18 |
+
"12": 0.7879844721965367,
|
| 19 |
+
"16": 0.8230430181841107,
|
| 20 |
+
"20": 0.7992641063936694,
|
| 21 |
+
"24": 0.7574112060681207
|
| 22 |
+
},
|
| 23 |
+
"ppl_ann": 30.465327858924866,
|
| 24 |
+
"ppl_gap_relative": 0.0006959991134290015,
|
| 25 |
+
"faiss_diag": {}
|
| 26 |
+
},
|
| 27 |
+
"256": {
|
| 28 |
+
"recall_avg": 0.8794096146506826,
|
| 29 |
+
"recall_per_layer": {
|
| 30 |
+
"4": 0.8783885035021551,
|
| 31 |
+
"8": 0.878973599137931,
|
| 32 |
+
"12": 0.8767847521551724,
|
| 33 |
+
"16": 0.877142544450431,
|
| 34 |
+
"20": 0.88153076171875,
|
| 35 |
+
"24": 0.8836375269396551
|
| 36 |
+
},
|
| 37 |
+
"mass_avg": 0.9531509574802443,
|
| 38 |
+
"mass_per_layer": {
|
| 39 |
+
"4": 0.9445537698679957,
|
| 40 |
+
"8": 0.9476728768184267,
|
| 41 |
+
"12": 0.963769320783944,
|
| 42 |
+
"16": 0.9684469288793104,
|
| 43 |
+
"20": 0.9556469095164332,
|
| 44 |
+
"24": 0.9388159390153556
|
| 45 |
+
},
|
| 46 |
+
"ppl_ann": 30.448541164398193,
|
| 47 |
+
"ppl_gap_relative": 0.0001446058007107463,
|
| 48 |
+
"faiss_diag": {}
|
| 49 |
+
},
|
| 50 |
+
"512": {
|
| 51 |
+
"recall_avg": 0.0,
|
| 52 |
+
"recall_per_layer": {
|
| 53 |
+
"4": 0.0,
|
| 54 |
+
"8": 0.0,
|
| 55 |
+
"12": 0.0,
|
| 56 |
+
"16": 0.0,
|
| 57 |
+
"20": 0.0,
|
| 58 |
+
"24": 0.0
|
| 59 |
+
},
|
| 60 |
+
"mass_avg": 0.0,
|
| 61 |
+
"mass_per_layer": {
|
| 62 |
+
"4": 0.0,
|
| 63 |
+
"8": 0.0,
|
| 64 |
+
"12": 0.0,
|
| 65 |
+
"16": 0.0,
|
| 66 |
+
"20": 0.0,
|
| 67 |
+
"24": 0.0
|
| 68 |
+
},
|
| 69 |
+
"ppl_ann": 30.447556257247925,
|
| 70 |
+
"ppl_gap_relative": 0.00011225451109601071,
|
| 71 |
+
"faiss_diag": {}
|
| 72 |
+
}
|
| 73 |
+
}
|
| 74 |
+
}
|
checkpoints_block_d128/search_step_1000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c384da8d5a0022126ad37dff6d31cf48a7ea5a9fbd35b02d998349cd15fd32cd
|
| 3 |
+
size 23611193
|
checkpoints_block_d128/search_step_1000.quest_page16.json
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"ppl_full": 30.444138765335083,
|
| 3 |
+
"page_size": 16,
|
| 4 |
+
"by_K": {
|
| 5 |
+
"128": {
|
| 6 |
+
"mass_avg": 0.7268716323509558,
|
| 7 |
+
"mass_per_layer": {
|
| 8 |
+
"4": 0.7368688553639114,
|
| 9 |
+
"8": 0.7033998429093813,
|
| 10 |
+
"12": 0.7266774880402885,
|
| 11 |
+
"16": 0.7590455168754203,
|
| 12 |
+
"20": 0.742104841151724,
|
| 13 |
+
"24": 0.6931332497650089
|
| 14 |
+
},
|
| 15 |
+
"recall_avg": 0.6693170907452658,
|
| 16 |
+
"recall_per_layer": {
|
| 17 |
+
"4": 0.6675438136673136,
|
| 18 |
+
"8": 0.6733688743494758,
|
| 19 |
+
"12": 0.669839395260821,
|
| 20 |
+
"16": 0.6732250795960827,
|
| 21 |
+
"20": 0.6668304967626628,
|
| 22 |
+
"24": 0.6650948848352387
|
| 23 |
+
},
|
| 24 |
+
"ppl_quest": 30.409221529960632,
|
| 25 |
+
"ppl_gap_relative": -0.0011469280061950332
|
| 26 |
+
},
|
| 27 |
+
"256": {
|
| 28 |
+
"mass_avg": 0.9088168527888155,
|
| 29 |
+
"mass_per_layer": {
|
| 30 |
+
"4": 0.9362414129849138,
|
| 31 |
+
"8": 0.8892738079202587,
|
| 32 |
+
"12": 0.8947217217807112,
|
| 33 |
+
"16": 0.92559814453125,
|
| 34 |
+
"20": 0.9104866817079741,
|
| 35 |
+
"24": 0.8965793478077856
|
| 36 |
+
},
|
| 37 |
+
"recall_avg": 0.8383452755281295,
|
| 38 |
+
"recall_per_layer": {
|
| 39 |
+
"4": 0.8439739490377491,
|
| 40 |
+
"8": 0.8412483478414601,
|
| 41 |
+
"12": 0.8357682721368198,
|
| 42 |
+
"16": 0.8378993067248114,
|
| 43 |
+
"20": 0.8356821125951307,
|
| 44 |
+
"24": 0.8354996648328058
|
| 45 |
+
},
|
| 46 |
+
"ppl_quest": 30.45472514629364,
|
| 47 |
+
"ppl_gap_relative": 0.0003477313331198978
|
| 48 |
+
},
|
| 49 |
+
"512": {
|
| 50 |
+
"mass_avg": 0.0,
|
| 51 |
+
"mass_per_layer": {
|
| 52 |
+
"4": NaN,
|
| 53 |
+
"8": NaN,
|
| 54 |
+
"12": NaN,
|
| 55 |
+
"16": NaN,
|
| 56 |
+
"20": NaN,
|
| 57 |
+
"24": NaN
|
| 58 |
+
},
|
| 59 |
+
"recall_avg": 0.0,
|
| 60 |
+
"recall_per_layer": {
|
| 61 |
+
"4": NaN,
|
| 62 |
+
"8": NaN,
|
| 63 |
+
"12": NaN,
|
| 64 |
+
"16": NaN,
|
| 65 |
+
"20": NaN,
|
| 66 |
+
"24": NaN
|
| 67 |
+
},
|
| 68 |
+
"ppl_quest": 30.45515537261963,
|
| 69 |
+
"ppl_gap_relative": 0.0003618629966662039
|
| 70 |
+
}
|
| 71 |
+
}
|
| 72 |
+
}
|
checkpoints_block_d128/search_step_200.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:12773073681e6c897150afbfb35d29d62f61916fae404b72449d27e1834a1df6
|
| 3 |
+
size 23611075
|
checkpoints_block_d128/search_step_400.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b53e45792882ea76d16f616f7669b31d071586aee4a229b0ea2b9f74c2646e61
|
| 3 |
+
size 23611139
|
checkpoints_block_d128/search_step_600.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c4a9441f38ba4de142b31e295a7069fa48f46444dd97ae1cb459563c6d5d7fbd
|
| 3 |
+
size 23611139
|
checkpoints_block_d128/search_step_800.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8f64aaa18268c8ca4dde60fa81dbb75c3bc8bc4361f8ef73eb0b8bd217998c4c
|
| 3 |
+
size 23611139
|
checkpoints_d64/search_step_200.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d25e49d3c884670ced90685bc76de51fa0390e4b98dd5ac573c4d6f3cff679c5
|
| 3 |
+
size 11814595
|
checkpoints_d64/search_step_400.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9fbdbf7f19caf959babc916be0da9a209d8fa7afe1fb19ae0cb89718917ae7e5
|
| 3 |
+
size 11814595
|
checkpoints_d64/search_step_600.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a891d7560d4c40acee1658d1e25079a37714966cf096a073cd5177c3e272d4c3
|
| 3 |
+
size 11814595
|
checkpoints_packed_d128/search_step_1000.compare_retrieval.json
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "Qwen/Qwen3-4B-Instruct-2507",
|
| 3 |
+
"ckpt": "/tmp/checkpoints_packed_d128/search_step_1000.pt",
|
| 4 |
+
"by_K": {
|
| 5 |
+
"16": {
|
| 6 |
+
"raw_qk": {
|
| 7 |
+
"per_layer": {
|
| 8 |
+
"4": 0.31979141881068546,
|
| 9 |
+
"8": 0.3186202198266983,
|
| 10 |
+
"12": 0.14174955089886984,
|
| 11 |
+
"16": 0.20945234596729279,
|
| 12 |
+
"20": 0.2801314915219943,
|
| 13 |
+
"24": 0.49455158909161884
|
| 14 |
+
},
|
| 15 |
+
"avg": 0.2940494360195266
|
| 16 |
+
},
|
| 17 |
+
"learned": {
|
| 18 |
+
"per_layer": {
|
| 19 |
+
"4": 0.29460810621579486,
|
| 20 |
+
"8": 0.2879024048646291,
|
| 21 |
+
"12": 0.38745074967543286,
|
| 22 |
+
"16": 0.2947010373075803,
|
| 23 |
+
"20": 0.3400069276491801,
|
| 24 |
+
"24": 0.5236761594812075
|
| 25 |
+
},
|
| 26 |
+
"avg": 0.3547242308656375
|
| 27 |
+
}
|
| 28 |
+
},
|
| 29 |
+
"32": {
|
| 30 |
+
"raw_qk": {
|
| 31 |
+
"per_layer": {
|
| 32 |
+
"4": 0.341281255086263,
|
| 33 |
+
"8": 0.36785539239645004,
|
| 34 |
+
"12": 0.21019610886772475,
|
| 35 |
+
"16": 0.28323352212707203,
|
| 36 |
+
"20": 0.3406280030806859,
|
| 37 |
+
"24": 0.5262841482957205
|
| 38 |
+
},
|
| 39 |
+
"avg": 0.3449130716423194
|
| 40 |
+
},
|
| 41 |
+
"learned": {
|
| 42 |
+
"per_layer": {
|
| 43 |
+
"4": 0.31440528233846027,
|
| 44 |
+
"8": 0.318613442281882,
|
| 45 |
+
"12": 0.42397742718458176,
|
| 46 |
+
"16": 0.33971526473760605,
|
| 47 |
+
"20": 0.3927851542830467,
|
| 48 |
+
"24": 0.5538897663354874
|
| 49 |
+
},
|
| 50 |
+
"avg": 0.390564389526844
|
| 51 |
+
}
|
| 52 |
+
},
|
| 53 |
+
"64": {
|
| 54 |
+
"raw_qk": {
|
| 55 |
+
"per_layer": {
|
| 56 |
+
"4": 0.37279334167639416,
|
| 57 |
+
"8": 0.4341067870457967,
|
| 58 |
+
"12": 0.295872134466966,
|
| 59 |
+
"16": 0.37213313827912015,
|
| 60 |
+
"20": 0.4122797225912412,
|
| 61 |
+
"24": 0.5643380333979925
|
| 62 |
+
},
|
| 63 |
+
"avg": 0.40858719290958506
|
| 64 |
+
},
|
| 65 |
+
"learned": {
|
| 66 |
+
"per_layer": {
|
| 67 |
+
"4": 0.34206587572892505,
|
| 68 |
+
"8": 0.3614834249019623,
|
| 69 |
+
"12": 0.47101640701293945,
|
| 70 |
+
"16": 0.40079283465941745,
|
| 71 |
+
"20": 0.46224164466063183,
|
| 72 |
+
"24": 0.5945162226756414
|
| 73 |
+
},
|
| 74 |
+
"avg": 0.4386860682732529
|
| 75 |
+
}
|
| 76 |
+
},
|
| 77 |
+
"128": {
|
| 78 |
+
"raw_qk": {
|
| 79 |
+
"per_layer": {
|
| 80 |
+
"4": 0.4215788046518962,
|
| 81 |
+
"8": 0.5176494866609573,
|
| 82 |
+
"12": 0.4037476380666097,
|
| 83 |
+
"16": 0.47523776690165204,
|
| 84 |
+
"20": 0.49859579155842465,
|
| 85 |
+
"24": 0.6136594414710999
|
| 86 |
+
},
|
| 87 |
+
"avg": 0.48841148821844
|
| 88 |
+
},
|
| 89 |
+
"learned": {
|
| 90 |
+
"per_layer": {
|
| 91 |
+
"4": 0.38235953201850253,
|
| 92 |
+
"8": 0.4212384819984436,
|
| 93 |
+
"12": 0.5328463464975357,
|
| 94 |
+
"16": 0.4814741685986519,
|
| 95 |
+
"20": 0.551252673069636,
|
| 96 |
+
"24": 0.6483117590347925
|
| 97 |
+
},
|
| 98 |
+
"avg": 0.5029138268695937
|
| 99 |
+
}
|
| 100 |
+
},
|
| 101 |
+
"256": {
|
| 102 |
+
"raw_qk": {
|
| 103 |
+
"per_layer": {
|
| 104 |
+
"4": 0.4964489738146464,
|
| 105 |
+
"8": 0.6146760632594427,
|
| 106 |
+
"12": 0.5339744637409846,
|
| 107 |
+
"16": 0.5874835352102915,
|
| 108 |
+
"20": 0.602066790064176,
|
| 109 |
+
"24": 0.6779818137486776
|
| 110 |
+
},
|
| 111 |
+
"avg": 0.5854386066397032
|
| 112 |
+
},
|
| 113 |
+
"learned": {
|
| 114 |
+
"per_layer": {
|
| 115 |
+
"4": 0.4403117299079895,
|
| 116 |
+
"8": 0.5022278105219206,
|
| 117 |
+
"12": 0.6128821273644766,
|
| 118 |
+
"16": 0.582685723900795,
|
| 119 |
+
"20": 0.65766608218352,
|
| 120 |
+
"24": 0.716396709283193
|
| 121 |
+
},
|
| 122 |
+
"avg": 0.5853616971936492
|
| 123 |
+
}
|
| 124 |
+
}
|
| 125 |
+
},
|
| 126 |
+
"learned_over_raw_K128": 1.0296928696416485
|
| 127 |
+
}
|
checkpoints_packed_d128/search_step_1000.k_sweep.json
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"ppl_full": 224.6417384147644,
|
| 3 |
+
"by_K": {
|
| 4 |
+
"128": {
|
| 5 |
+
"recall_avg": 0.16599421347341228,
|
| 6 |
+
"recall_per_layer": {
|
| 7 |
+
"4": 0.1287512010143649,
|
| 8 |
+
"8": 0.14480467765561997,
|
| 9 |
+
"12": 0.1611505323840726,
|
| 10 |
+
"16": 0.17931722825573337,
|
| 11 |
+
"20": 0.204620607437626,
|
| 12 |
+
"24": 0.17732103409305697
|
| 13 |
+
},
|
| 14 |
+
"mass_avg": 0.2555904160904628,
|
| 15 |
+
"mass_per_layer": {
|
| 16 |
+
"4": 0.13555656902251706,
|
| 17 |
+
"8": 0.22613397721321352,
|
| 18 |
+
"12": 0.2812447086457283,
|
| 19 |
+
"16": 0.2538035146651729,
|
| 20 |
+
"20": 0.27192030414458246,
|
| 21 |
+
"24": 0.3648834228515625
|
| 22 |
+
},
|
| 23 |
+
"ppl_ann": 203.62537622451782,
|
| 24 |
+
"ppl_gap_relative": -0.09355501937686794,
|
| 25 |
+
"faiss_diag": {}
|
| 26 |
+
},
|
| 27 |
+
"256": {
|
| 28 |
+
"recall_avg": 0.23324623107910156,
|
| 29 |
+
"recall_per_layer": {
|
| 30 |
+
"4": 0.19934444427490233,
|
| 31 |
+
"8": 0.2069737116495768,
|
| 32 |
+
"12": 0.2247191111246745,
|
| 33 |
+
"16": 0.24465274810791016,
|
| 34 |
+
"20": 0.27422657012939455,
|
| 35 |
+
"24": 0.24956080118815105
|
| 36 |
+
},
|
| 37 |
+
"mass_avg": 0.3174514651298523,
|
| 38 |
+
"mass_per_layer": {
|
| 39 |
+
"4": 0.20090028444925945,
|
| 40 |
+
"8": 0.2829151630401611,
|
| 41 |
+
"12": 0.3357037226359049,
|
| 42 |
+
"16": 0.3171180884043376,
|
| 43 |
+
"20": 0.3458467245101929,
|
| 44 |
+
"24": 0.4222248077392578
|
| 45 |
+
},
|
| 46 |
+
"ppl_ann": 207.06066703796387,
|
| 47 |
+
"ppl_gap_relative": -0.07826271066483625,
|
| 48 |
+
"faiss_diag": {}
|
| 49 |
+
},
|
| 50 |
+
"512": {
|
| 51 |
+
"recall_avg": 0.33908390431177043,
|
| 52 |
+
"recall_per_layer": {
|
| 53 |
+
"4": 0.30769617216927664,
|
| 54 |
+
"8": 0.3079519953046526,
|
| 55 |
+
"12": 0.3272709846496582,
|
| 56 |
+
"16": 0.34647955213274273,
|
| 57 |
+
"20": 0.38336784499032156,
|
| 58 |
+
"24": 0.36173687662397114
|
| 59 |
+
},
|
| 60 |
+
"mass_avg": 0.4086176788523084,
|
| 61 |
+
"mass_per_layer": {
|
| 62 |
+
"4": 0.30153139574187143,
|
| 63 |
+
"8": 0.36809664964675903,
|
| 64 |
+
"12": 0.4162691150392805,
|
| 65 |
+
"16": 0.4079152686255319,
|
| 66 |
+
"20": 0.45096262863704134,
|
| 67 |
+
"24": 0.506931015423366
|
| 68 |
+
},
|
| 69 |
+
"ppl_ann": 211.92854118347168,
|
| 70 |
+
"ppl_gap_relative": -0.056593210687409634,
|
| 71 |
+
"faiss_diag": {}
|
| 72 |
+
}
|
| 73 |
+
}
|
| 74 |
+
}
|
checkpoints_packed_d128/search_step_1000.k_sweep_exact.json
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"ppl_full": 224.6417384147644,
|
| 3 |
+
"by_K": {
|
| 4 |
+
"128": {
|
| 5 |
+
"recall_avg": 0.16599421347341228,
|
| 6 |
+
"recall_per_layer": {
|
| 7 |
+
"4": 0.1287512010143649,
|
| 8 |
+
"8": 0.14480467765561997,
|
| 9 |
+
"12": 0.1611505323840726,
|
| 10 |
+
"16": 0.17931722825573337,
|
| 11 |
+
"20": 0.204620607437626,
|
| 12 |
+
"24": 0.17732103409305697
|
| 13 |
+
},
|
| 14 |
+
"mass_avg": 0.2555904160904628,
|
| 15 |
+
"mass_per_layer": {
|
| 16 |
+
"4": 0.13555656902251706,
|
| 17 |
+
"8": 0.22613397721321352,
|
| 18 |
+
"12": 0.2812447086457283,
|
| 19 |
+
"16": 0.2538035146651729,
|
| 20 |
+
"20": 0.27192030414458246,
|
| 21 |
+
"24": 0.3648834228515625
|
| 22 |
+
},
|
| 23 |
+
"ppl_ann": 203.62537622451782,
|
| 24 |
+
"ppl_gap_relative": -0.09355501937686794,
|
| 25 |
+
"faiss_diag": {}
|
| 26 |
+
},
|
| 27 |
+
"256": {
|
| 28 |
+
"recall_avg": 0.23324623107910156,
|
| 29 |
+
"recall_per_layer": {
|
| 30 |
+
"4": 0.19934444427490233,
|
| 31 |
+
"8": 0.2069737116495768,
|
| 32 |
+
"12": 0.2247191111246745,
|
| 33 |
+
"16": 0.24465274810791016,
|
| 34 |
+
"20": 0.27422657012939455,
|
| 35 |
+
"24": 0.24956080118815105
|
| 36 |
+
},
|
| 37 |
+
"mass_avg": 0.3174514651298523,
|
| 38 |
+
"mass_per_layer": {
|
| 39 |
+
"4": 0.20090028444925945,
|
| 40 |
+
"8": 0.2829151630401611,
|
| 41 |
+
"12": 0.3357037226359049,
|
| 42 |
+
"16": 0.3171180884043376,
|
| 43 |
+
"20": 0.3458467245101929,
|
| 44 |
+
"24": 0.4222248077392578
|
| 45 |
+
},
|
| 46 |
+
"ppl_ann": 207.06066703796387,
|
| 47 |
+
"ppl_gap_relative": -0.07826271066483625,
|
| 48 |
+
"faiss_diag": {}
|
| 49 |
+
},
|
| 50 |
+
"512": {
|
| 51 |
+
"recall_avg": 0.33908390431177043,
|
| 52 |
+
"recall_per_layer": {
|
| 53 |
+
"4": 0.30769617216927664,
|
| 54 |
+
"8": 0.3079519953046526,
|
| 55 |
+
"12": 0.3272709846496582,
|
| 56 |
+
"16": 0.34647955213274273,
|
| 57 |
+
"20": 0.38336784499032156,
|
| 58 |
+
"24": 0.36173687662397114
|
| 59 |
+
},
|
| 60 |
+
"mass_avg": 0.4086176788523084,
|
| 61 |
+
"mass_per_layer": {
|
| 62 |
+
"4": 0.30153139574187143,
|
| 63 |
+
"8": 0.36809664964675903,
|
| 64 |
+
"12": 0.4162691150392805,
|
| 65 |
+
"16": 0.4079152686255319,
|
| 66 |
+
"20": 0.45096262863704134,
|
| 67 |
+
"24": 0.506931015423366
|
| 68 |
+
},
|
| 69 |
+
"ppl_ann": 211.92854118347168,
|
| 70 |
+
"ppl_gap_relative": -0.056593210687409634,
|
| 71 |
+
"faiss_diag": {}
|
| 72 |
+
}
|
| 73 |
+
}
|
| 74 |
+
}
|
checkpoints_packed_d128/search_step_1000.k_sweep_exact_skip16.json
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"ppl_full": 193.67203998565674,
|
| 3 |
+
"by_K": {
|
| 4 |
+
"128": {
|
| 5 |
+
"recall_avg": 0.1631414557016024,
|
| 6 |
+
"recall_per_layer": {
|
| 7 |
+
"4": 0.1270386480516003,
|
| 8 |
+
"8": 0.14388988864037297,
|
| 9 |
+
"12": 0.15978523992723034,
|
| 10 |
+
"16": 0.18374190791960684,
|
| 11 |
+
"20": 0.20087420555853075,
|
| 12 |
+
"24": 0.1635188441122732
|
| 13 |
+
},
|
| 14 |
+
"mass_avg": 0.341150162521229,
|
| 15 |
+
"mass_per_layer": {
|
| 16 |
+
"4": 0.13802351394007284,
|
| 17 |
+
"8": 0.31434367164488763,
|
| 18 |
+
"12": 0.39897729504492974,
|
| 19 |
+
"16": 0.32906081599573933,
|
| 20 |
+
"20": 0.34529122229545345,
|
| 21 |
+
"24": 0.521204456206291
|
| 22 |
+
},
|
| 23 |
+
"ppl_ann": 176.66846084594727,
|
| 24 |
+
"ppl_gap_relative": -0.0877957352076673,
|
| 25 |
+
"faiss_diag": {}
|
| 26 |
+
},
|
| 27 |
+
"256": {
|
| 28 |
+
"recall_avg": 0.23127511342366536,
|
| 29 |
+
"recall_per_layer": {
|
| 30 |
+
"4": 0.19980506896972655,
|
| 31 |
+
"8": 0.20826168060302735,
|
| 32 |
+
"12": 0.22342586517333984,
|
| 33 |
+
"16": 0.25203742980957033,
|
| 34 |
+
"20": 0.270965576171875,
|
| 35 |
+
"24": 0.23315505981445311
|
| 36 |
+
},
|
| 37 |
+
"mass_avg": 0.39741740624109906,
|
| 38 |
+
"mass_per_layer": {
|
| 39 |
+
"4": 0.2033478021621704,
|
| 40 |
+
"8": 0.364719812075297,
|
| 41 |
+
"12": 0.4482226053873698,
|
| 42 |
+
"16": 0.3924812396367391,
|
| 43 |
+
"20": 0.4137035528818766,
|
| 44 |
+
"24": 0.5620294253031413
|
| 45 |
+
},
|
| 46 |
+
"ppl_ann": 178.96899604797363,
|
| 47 |
+
"ppl_gap_relative": -0.07591722552605945,
|
| 48 |
+
"faiss_diag": {}
|
| 49 |
+
},
|
| 50 |
+
"512": {
|
| 51 |
+
"recall_avg": 0.33962979203178767,
|
| 52 |
+
"recall_per_layer": {
|
| 53 |
+
"4": 0.31196675981794086,
|
| 54 |
+
"8": 0.3126195158277239,
|
| 55 |
+
"12": 0.32580297333853586,
|
| 56 |
+
"16": 0.36052775382995605,
|
| 57 |
+
"20": 0.3826852866581508,
|
| 58 |
+
"24": 0.3441764627184187
|
| 59 |
+
},
|
| 60 |
+
"mass_avg": 0.4830506351732072,
|
| 61 |
+
"mass_per_layer": {
|
| 62 |
+
"4": 0.30267453619412016,
|
| 63 |
+
"8": 0.44439977407455444,
|
| 64 |
+
"12": 0.5229217324938092,
|
| 65 |
+
"16": 0.4867537532533918,
|
| 66 |
+
"20": 0.5154515334538051,
|
| 67 |
+
"24": 0.6261024815695626
|
| 68 |
+
},
|
| 69 |
+
"ppl_ann": 181.6412000656128,
|
| 70 |
+
"ppl_gap_relative": -0.06211965300171849,
|
| 71 |
+
"faiss_diag": {}
|
| 72 |
+
}
|
| 73 |
+
}
|
| 74 |
+
}
|
checkpoints_packed_d128/search_step_1000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c831fcdae22879b34815b7af47215dcb51772cf0f27fd59f15e9b1eb4f0d2137
|
| 3 |
+
size 23611129
|
checkpoints_packed_d128/search_step_200.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:631120827ea8141d461c10ed0b24317643395d66ca707a548f4553e5f38646ac
|
| 3 |
+
size 23611075
|
checkpoints_packed_d128/search_step_400.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:45c8d82cd98746fb98f266066431357c7b2662deed2ebb5d34d6a1c85969b073
|
| 3 |
+
size 23611075
|
checkpoints_packed_d128/search_step_600.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dcf85676cb91a94c07dfc27db483bd4be83a83374d0f48a79f38118cd6edb8fb
|
| 3 |
+
size 23611075
|
checkpoints_packed_d128/search_step_800.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5502cff4ec669daa37c1b989f2344dd5648572a1855e2de32f33d6fd9fcb1a0c
|
| 3 |
+
size 23611075
|
checkpoints_packed_d256/search_step_1000.compare_retrieval.json
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "Qwen/Qwen3-4B-Instruct-2507",
|
| 3 |
+
"ckpt": "/tmp/checkpoints_packed_d256/search_step_1000.pt",
|
| 4 |
+
"by_K": {
|
| 5 |
+
"16": {
|
| 6 |
+
"raw_qk": {
|
| 7 |
+
"per_layer": {
|
| 8 |
+
"4": 0.31979141881068546,
|
| 9 |
+
"8": 0.3186202198266983,
|
| 10 |
+
"12": 0.14174955089886984,
|
| 11 |
+
"16": 0.20945234596729279,
|
| 12 |
+
"20": 0.2801314915219943,
|
| 13 |
+
"24": 0.49455158909161884
|
| 14 |
+
},
|
| 15 |
+
"avg": 0.2940494360195266
|
| 16 |
+
},
|
| 17 |
+
"learned": {
|
| 18 |
+
"per_layer": {
|
| 19 |
+
"4": 0.2933313399553299,
|
| 20 |
+
"8": 0.29298429439465207,
|
| 21 |
+
"12": 0.39142733812332153,
|
| 22 |
+
"16": 0.2983766943216324,
|
| 23 |
+
"20": 0.3438563868403435,
|
| 24 |
+
"24": 0.525188535451889
|
| 25 |
+
},
|
| 26 |
+
"avg": 0.35752743151452804
|
| 27 |
+
}
|
| 28 |
+
},
|
| 29 |
+
"32": {
|
| 30 |
+
"raw_qk": {
|
| 31 |
+
"per_layer": {
|
| 32 |
+
"4": 0.341281255086263,
|
| 33 |
+
"8": 0.36785539239645004,
|
| 34 |
+
"12": 0.21019610886772475,
|
| 35 |
+
"16": 0.28323352212707203,
|
| 36 |
+
"20": 0.3406280030806859,
|
| 37 |
+
"24": 0.5262841482957205
|
| 38 |
+
},
|
| 39 |
+
"avg": 0.3449130716423194
|
| 40 |
+
},
|
| 41 |
+
"learned": {
|
| 42 |
+
"per_layer": {
|
| 43 |
+
"4": 0.3145047202706337,
|
| 44 |
+
"8": 0.3253612567981084,
|
| 45 |
+
"12": 0.4279355009396871,
|
| 46 |
+
"16": 0.344377006093661,
|
| 47 |
+
"20": 0.3977118283510208,
|
| 48 |
+
"24": 0.556083157658577
|
| 49 |
+
},
|
| 50 |
+
"avg": 0.39432891168528134
|
| 51 |
+
}
|
| 52 |
+
},
|
| 53 |
+
"64": {
|
| 54 |
+
"raw_qk": {
|
| 55 |
+
"per_layer": {
|
| 56 |
+
"4": 0.37279334167639416,
|
| 57 |
+
"8": 0.4341067870457967,
|
| 58 |
+
"12": 0.295872134466966,
|
| 59 |
+
"16": 0.37213313827912015,
|
| 60 |
+
"20": 0.4122797225912412,
|
| 61 |
+
"24": 0.5643380333979925
|
| 62 |
+
},
|
| 63 |
+
"avg": 0.40858719290958506
|
| 64 |
+
},
|
| 65 |
+
"learned": {
|
| 66 |
+
"per_layer": {
|
| 67 |
+
"4": 0.34352220843235654,
|
| 68 |
+
"8": 0.37063097457091015,
|
| 69 |
+
"12": 0.4752006282409032,
|
| 70 |
+
"16": 0.4067305897672971,
|
| 71 |
+
"20": 0.46815096338589984,
|
| 72 |
+
"24": 0.5977186262607574
|
| 73 |
+
},
|
| 74 |
+
"avg": 0.44365899844302076
|
| 75 |
+
}
|
| 76 |
+
},
|
| 77 |
+
"128": {
|
| 78 |
+
"raw_qk": {
|
| 79 |
+
"per_layer": {
|
| 80 |
+
"4": 0.4215788046518962,
|
| 81 |
+
"8": 0.5176494866609573,
|
| 82 |
+
"12": 0.4037476380666097,
|
| 83 |
+
"16": 0.47523776690165204,
|
| 84 |
+
"20": 0.49859579155842465,
|
| 85 |
+
"24": 0.6136594414710999
|
| 86 |
+
},
|
| 87 |
+
"avg": 0.48841148821844
|
| 88 |
+
},
|
| 89 |
+
"learned": {
|
| 90 |
+
"per_layer": {
|
| 91 |
+
"4": 0.3849342664082845,
|
| 92 |
+
"8": 0.43360541264216107,
|
| 93 |
+
"12": 0.5375104447205862,
|
| 94 |
+
"16": 0.488710917532444,
|
| 95 |
+
"20": 0.5576231330633163,
|
| 96 |
+
"24": 0.6525773753722509
|
| 97 |
+
},
|
| 98 |
+
"avg": 0.5091602582898406
|
| 99 |
+
}
|
| 100 |
+
},
|
| 101 |
+
"256": {
|
| 102 |
+
"raw_qk": {
|
| 103 |
+
"per_layer": {
|
| 104 |
+
"4": 0.4964489738146464,
|
| 105 |
+
"8": 0.6146760632594427,
|
| 106 |
+
"12": 0.5339744637409846,
|
| 107 |
+
"16": 0.5874835352102915,
|
| 108 |
+
"20": 0.602066790064176,
|
| 109 |
+
"24": 0.6779818137486776
|
| 110 |
+
},
|
| 111 |
+
"avg": 0.5854386066397032
|
| 112 |
+
},
|
| 113 |
+
"learned": {
|
| 114 |
+
"per_layer": {
|
| 115 |
+
"4": 0.44419410079717636,
|
| 116 |
+
"8": 0.5183951507012049,
|
| 117 |
+
"12": 0.6179872999588648,
|
| 118 |
+
"16": 0.5908408363660177,
|
| 119 |
+
"20": 0.6638876696427664,
|
| 120 |
+
"24": 0.7216567993164062
|
| 121 |
+
},
|
| 122 |
+
"avg": 0.592826976130406
|
| 123 |
+
}
|
| 124 |
+
}
|
| 125 |
+
},
|
| 126 |
+
"learned_over_raw_K128": 1.0424821499328059
|
| 127 |
+
}
|
checkpoints_packed_d256/search_step_1000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f93073485115c5ac669cce2f483179310be7e1caa88e53091cc36fd1b31435e3
|
| 3 |
+
size 47204153
|
checkpoints_packed_d256/search_step_200.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8f763b0057300594fd26916b2137bdde8f004b077ba21fb760fa950b3d1058a9
|
| 3 |
+
size 47204099
|
checkpoints_packed_d256/search_step_400.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0e14e97ca1f49b5ac54273be7fdf57218e644d9d062edea3a7afb225e76e195c
|
| 3 |
+
size 47204099
|
checkpoints_packed_d256/search_step_600.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:85a94bcd5b978022c4010e19ee7c4aca2d267a2d7d7896fb7025595198d0ee8f
|
| 3 |
+
size 47204099
|
checkpoints_packed_d256/search_step_800.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7a7db9c635cc462e3b08d402ad047cfd33a90fe5e6e57bf8576172e271b67255
|
| 3 |
+
size 47204099
|
checkpoints_packed_d64/search_step_1000.compare_retrieval.json
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "Qwen/Qwen3-4B-Instruct-2507",
|
| 3 |
+
"ckpt": "/tmp/checkpoints_packed_d64/search_step_1000.pt",
|
| 4 |
+
"by_K": {
|
| 5 |
+
"16": {
|
| 6 |
+
"raw_qk": {
|
| 7 |
+
"per_layer": {
|
| 8 |
+
"4": 0.31979141881068546,
|
| 9 |
+
"8": 0.3186202198266983,
|
| 10 |
+
"12": 0.14174955089886984,
|
| 11 |
+
"16": 0.20945234596729279,
|
| 12 |
+
"20": 0.2801314915219943,
|
| 13 |
+
"24": 0.49455158909161884
|
| 14 |
+
},
|
| 15 |
+
"avg": 0.2940494360195266
|
| 16 |
+
},
|
| 17 |
+
"learned": {
|
| 18 |
+
"per_layer": {
|
| 19 |
+
"4": 0.29329264909029007,
|
| 20 |
+
"8": 0.28124504536390305,
|
| 21 |
+
"12": 0.37904051691293716,
|
| 22 |
+
"16": 0.2890646557013194,
|
| 23 |
+
"20": 0.3339161326487859,
|
| 24 |
+
"24": 0.5209685415029526
|
| 25 |
+
},
|
| 26 |
+
"avg": 0.34958792353669804
|
| 27 |
+
}
|
| 28 |
+
},
|
| 29 |
+
"32": {
|
| 30 |
+
"raw_qk": {
|
| 31 |
+
"per_layer": {
|
| 32 |
+
"4": 0.341281255086263,
|
| 33 |
+
"8": 0.36785539239645004,
|
| 34 |
+
"12": 0.21019610886772475,
|
| 35 |
+
"16": 0.28323352212707203,
|
| 36 |
+
"20": 0.3406280030806859,
|
| 37 |
+
"24": 0.5262841482957205
|
| 38 |
+
},
|
| 39 |
+
"avg": 0.3449130716423194
|
| 40 |
+
},
|
| 41 |
+
"learned": {
|
| 42 |
+
"per_layer": {
|
| 43 |
+
"4": 0.31185250480969745,
|
| 44 |
+
"8": 0.3099167247613271,
|
| 45 |
+
"12": 0.41439520567655563,
|
| 46 |
+
"16": 0.33257966736952466,
|
| 47 |
+
"20": 0.3849489390850067,
|
| 48 |
+
"24": 0.5501216500997543
|
| 49 |
+
},
|
| 50 |
+
"avg": 0.383969115300311
|
| 51 |
+
}
|
| 52 |
+
},
|
| 53 |
+
"64": {
|
| 54 |
+
"raw_qk": {
|
| 55 |
+
"per_layer": {
|
| 56 |
+
"4": 0.37279334167639416,
|
| 57 |
+
"8": 0.4341067870457967,
|
| 58 |
+
"12": 0.295872134466966,
|
| 59 |
+
"16": 0.37213313827912015,
|
| 60 |
+
"20": 0.4122797225912412,
|
| 61 |
+
"24": 0.5643380333979925
|
| 62 |
+
},
|
| 63 |
+
"avg": 0.40858719290958506
|
| 64 |
+
},
|
| 65 |
+
"learned": {
|
| 66 |
+
"per_layer": {
|
| 67 |
+
"4": 0.3371334026257197,
|
| 68 |
+
"8": 0.3503660187125206,
|
| 69 |
+
"12": 0.4602071891228358,
|
| 70 |
+
"16": 0.3917141556739807,
|
| 71 |
+
"20": 0.45263657718896866,
|
| 72 |
+
"24": 0.5895731498797735
|
| 73 |
+
},
|
| 74 |
+
"avg": 0.43027174886729985
|
| 75 |
+
}
|
| 76 |
+
},
|
| 77 |
+
"128": {
|
| 78 |
+
"raw_qk": {
|
| 79 |
+
"per_layer": {
|
| 80 |
+
"4": 0.4215788046518962,
|
| 81 |
+
"8": 0.5176494866609573,
|
| 82 |
+
"12": 0.4037476380666097,
|
| 83 |
+
"16": 0.47523776690165204,
|
| 84 |
+
"20": 0.49859579155842465,
|
| 85 |
+
"24": 0.6136594414710999
|
| 86 |
+
},
|
| 87 |
+
"avg": 0.48841148821844
|
| 88 |
+
},
|
| 89 |
+
"learned": {
|
| 90 |
+
"per_layer": {
|
| 91 |
+
"4": 0.37442514797051746,
|
| 92 |
+
"8": 0.40738723675409955,
|
| 93 |
+
"12": 0.5206492592891058,
|
| 94 |
+
"16": 0.4700211783250173,
|
| 95 |
+
"20": 0.5400509238243103,
|
| 96 |
+
"24": 0.6421113312244415
|
| 97 |
+
},
|
| 98 |
+
"avg": 0.4924408462312486
|
| 99 |
+
}
|
| 100 |
+
},
|
| 101 |
+
"256": {
|
| 102 |
+
"raw_qk": {
|
| 103 |
+
"per_layer": {
|
| 104 |
+
"4": 0.4964489738146464,
|
| 105 |
+
"8": 0.6146760632594427,
|
| 106 |
+
"12": 0.5339744637409846,
|
| 107 |
+
"16": 0.5874835352102915,
|
| 108 |
+
"20": 0.602066790064176,
|
| 109 |
+
"24": 0.6779818137486776
|
| 110 |
+
},
|
| 111 |
+
"avg": 0.5854386066397032
|
| 112 |
+
},
|
| 113 |
+
"learned": {
|
| 114 |
+
"per_layer": {
|
| 115 |
+
"4": 0.4301423355937004,
|
| 116 |
+
"8": 0.48568252722422284,
|
| 117 |
+
"12": 0.599395309885343,
|
| 118 |
+
"16": 0.5693490405877432,
|
| 119 |
+
"20": 0.6461250483989716,
|
| 120 |
+
"24": 0.709612175822258
|
| 121 |
+
},
|
| 122 |
+
"avg": 0.5733844062520398
|
| 123 |
+
}
|
| 124 |
+
}
|
| 125 |
+
},
|
| 126 |
+
"learned_over_raw_K128": 1.0082499247253711
|
| 127 |
+
}
|
checkpoints_packed_d64/search_step_1000.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5c3c1e631d9a7a3838dbba1274d598f143b4e9e1dffca474b9f27f292470962a
|
| 3 |
+
size 11814649
|
checkpoints_packed_d64/search_step_200.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:63fe19831fae43c71698480b3fbee53712a69375a3526b54d78e454124ec120c
|
| 3 |
+
size 11814595
|
checkpoints_packed_d64/search_step_400.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:993c324c3b9e1a74ea5600e90662db176db7e33d027609447c6277354fa35ae0
|
| 3 |
+
size 11814595
|
checkpoints_packed_d64/search_step_600.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8a229f9353532df19b4af03e668938205e4ff4576fa53095bf632bf945e9f4ff
|
| 3 |
+
size 11814595
|
checkpoints_packed_d64/search_step_800.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b6e49834e2902223c4c3f512ef2affa2920c03905162c6a53ccb0749bd1606c1
|
| 3 |
+
size 11814595
|