datasysdev commited on
Commit
57c6b5b
·
verified ·
1 Parent(s): 66efa56

Upload clean block-causal and packed pilot checkpoints

Browse files
Files changed (35) hide show
  1. README.md +77 -215
  2. checkpoints/search_step_2000.compare_retrieval.json +127 -0
  3. checkpoints/search_step_2000.pt +3 -0
  4. checkpoints_block_d128/search_step_1000.compare_retrieval.json +31 -0
  5. checkpoints_block_d128/search_step_1000.k_sweep_exact.json +74 -0
  6. checkpoints_block_d128/search_step_1000.pt +3 -0
  7. checkpoints_block_d128/search_step_1000.quest_page16.json +72 -0
  8. checkpoints_block_d128/search_step_200.pt +3 -0
  9. checkpoints_block_d128/search_step_400.pt +3 -0
  10. checkpoints_block_d128/search_step_600.pt +3 -0
  11. checkpoints_block_d128/search_step_800.pt +3 -0
  12. checkpoints_d64/search_step_200.pt +3 -0
  13. checkpoints_d64/search_step_400.pt +3 -0
  14. checkpoints_d64/search_step_600.pt +3 -0
  15. checkpoints_packed_d128/search_step_1000.compare_retrieval.json +127 -0
  16. checkpoints_packed_d128/search_step_1000.k_sweep.json +74 -0
  17. checkpoints_packed_d128/search_step_1000.k_sweep_exact.json +74 -0
  18. checkpoints_packed_d128/search_step_1000.k_sweep_exact_skip16.json +74 -0
  19. checkpoints_packed_d128/search_step_1000.pt +3 -0
  20. checkpoints_packed_d128/search_step_200.pt +3 -0
  21. checkpoints_packed_d128/search_step_400.pt +3 -0
  22. checkpoints_packed_d128/search_step_600.pt +3 -0
  23. checkpoints_packed_d128/search_step_800.pt +3 -0
  24. checkpoints_packed_d256/search_step_1000.compare_retrieval.json +127 -0
  25. checkpoints_packed_d256/search_step_1000.pt +3 -0
  26. checkpoints_packed_d256/search_step_200.pt +3 -0
  27. checkpoints_packed_d256/search_step_400.pt +3 -0
  28. checkpoints_packed_d256/search_step_600.pt +3 -0
  29. checkpoints_packed_d256/search_step_800.pt +3 -0
  30. checkpoints_packed_d64/search_step_1000.compare_retrieval.json +127 -0
  31. checkpoints_packed_d64/search_step_1000.pt +3 -0
  32. checkpoints_packed_d64/search_step_200.pt +3 -0
  33. checkpoints_packed_d64/search_step_400.pt +3 -0
  34. checkpoints_packed_d64/search_step_600.pt +3 -0
  35. checkpoints_packed_d64/search_step_800.pt +3 -0
README.md CHANGED
@@ -1,225 +1,87 @@
1
  ---
2
- language:
3
- - en
4
- license: apache-2.0
5
  base_model: Qwen/Qwen3-4B-Instruct-2507
6
  tags:
7
  - sparse-attention
8
- - ann-attention
9
- - distillation
10
- - search-projection
11
- - inference-optimization
12
- library_name: pytorch
13
  ---
14
 
15
- # ann-sparseattention
16
 
17
- Search projections for ANN-substituted attention on
18
- [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).
19
 
20
- Code: [github.com/unixsysdev/ann-sparseattention](https://github.com/unixsysdev/ann-sparseattention)
21
 
22
- ## Current status
23
 
24
- Research prototype. Trained projections work, runtime is a correctness
25
- prototype, eval envelope is narrow. Treat reported numbers as preliminary.
26
-
27
- **Validated:** 6-layer pilot on Qwen3-4B-Instruct-2507; WikiText-103 PPL
28
- preserved at K=128 (gap ≈ +0.7%); learned projections retrieve attention-
29
- relevant keys.
30
-
31
- **Not yet validated:** 34-layer / whole-model substitution; long-context
32
- tasks (LongBench, RULER, needle); wall-clock speedup vs FlashAttention/SDPA;
33
- KV-cache decode-mode integration; GPU-resident ANN kernel.
34
-
35
- **Runtime caveat:** the FAISS path here builds CPU indexes per batch and
36
- the gather step uses dense-style tensor expansion. Compute-reduction
37
- numbers below are *algorithmic scoring reductions, not measured wall-clock
38
- speedups.*
39
-
40
- ## Relation to RetrievalAttention
41
-
42
- RetrievalAttention (Liu et al., 2024) shows that **vanilla ANN over the
43
- model's native Q, K vectors fails** because Q and K live in mismatched
44
- distributions they were never trained to be each other's nearest
45
- neighbors, only to score via dot product. Their fix is at *index time*:
46
- an attention-aware graph construction (RoarGraph-style).
47
-
48
- This work attacks the same problem from the opposite direction. We
49
- **train a tiny shared projection** (`W_Qs, W_Ks → R^64`) so that
50
- `q_search` and `k_search` live in the same distribution by construction.
51
- Off-the-shelf FAISS HNSW with default parameters then suffices.
52
-
53
- | | Search space | Index | Trainable |
54
- |---|---|---|---|
55
- | Raw Q/K + vanilla ANN | original Q/K | off-the-shelf | no fails (Q/K OOD) |
56
- | RetrievalAttention | original Q/K | attention-aware graph | no |
57
- | **This work** | **learned Q\_s / K\_s** | **off-the-shelf** | **yes (~2-11M params)** |
58
-
59
- Contribution: *eliminate Q/K mismatch at index-build time via distillation,
60
- instead of patching it at search time.* The clean validating experiment
61
- vanilla FAISS over raw Q/K vs. learned Q\_s/K\_s vs. exact teacher top-K
62
- is the next planned run.
63
-
64
- ## What's in this repo
65
-
66
- Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
67
- trained against the frozen base model's attention via contrastive +
68
- distillation losses. At inference these produce 64-d "search vectors" that
69
- let an off-the-shelf FAISS HNSW index pick the top-K keys to attend to,
70
- replacing dense `O(L²)` attention with `O(L·K)` ANN-substituted attention.
71
-
72
- Layers covered (pilot): `[4, 8, 12, 16, 20, 24]` 6 of 36 layers, ~2M trainable params.
73
-
74
- ## Pilot results (final, 2K steps on WikiText-103)
75
-
76
- | Step | Recall@K=128 | PPL gap (full vs ANN) |
77
- |---|---|---|
78
- | 500 | 47.4% | 1.21% |
79
- | 1000 | 50.7% | 0.68% |
80
- | 1500 | 50.9% | 0.68% |
81
- | **2000 (final)** | **50.9%** | **0.71%** |
82
-
83
- PPL gap is the primary signal at <1% relative gap, the model's output
84
- quality is preserved under ANN substitution. Recall plateaus around step 1000
85
- because the softmax-relevant keys concentrate in the top ~30; disagreement
86
- on positions 30-128 is on near-zero-weight tail and doesn't affect output.
87
-
88
- ### K-retrieve Pareto (pilot step 2000, FAISS HNSW)
89
-
90
- `PPL_full = 9.958`
91
-
92
- | K | Recall@K | PPL_ANN | PPL gap |
93
- |---|---|---|---|
94
- | 16 | 24.9% | 10.71 | +7.51% |
95
- | 32 | 22.8% | 10.41 | +4.51% |
96
- | 64 | 23.1% | 10.20 | +2.42% |
97
- | 128 | 26.0% | 10.04 | +0.82% |
98
- | 256 | 31.6% | 9.88 | **−0.79%** |
99
- | 512 | 40.8% | 9.67 | **−2.89%** |
100
-
101
- On this small WikiText slice, K ≥ 256 produced lower measured PPL than
102
- the full-attention reference. A plausible explanation is sparse-softmax
103
- denoising, but with 12 eval batches, sample noise, packed-boundary artifacts
104
- (pilot trained with packing on; default in the repo is now off), and
105
- partial-layer substitution acting like regularization are also candidates.
106
- Treating it as a hypothesis to confirm via an exact-topK oracle (full QK^T
107
- → top-K → restricted attention) at the same K — that separates "denoising
108
- from any sparsity" from "denoising from learned projections."
109
-
110
- Code-level sanity checks pass: same input sequences for `ppl_full` vs
111
- `ppl_ann`, intact causal mask in retrieval, single softmax over retrieved
112
- K with no wrapper leakage between iterations.
113
-
114
- ### Compute / quality knobs (FLOP-counted)
115
-
116
- `L = 4096`. Compute reduction is the attention scoring step, ≈ `L / K`.
117
- These are FLOP estimates, not measured wall-clock — the FAISS path in this
118
- repo is a research prototype that does CPU index builds and GPU↔CPU
119
- transfers, so it is not the right thing to time.
120
-
121
- | K | PPL gap | Attention scoring reduction |
122
- |---|---|---|
123
- | 512 | −2.89% | ~8× |
124
- | 256 | −0.79% | ~16× |
125
- | 128 | +0.82% | ~32× |
126
- | 64 | +2.42% | ~64× |
127
- | 32 | +4.51% | ~128× |
128
- | 16 | +7.51% | ~256× |
129
-
130
- Eval scope: 12 sequences × 4K tokens of WikiText-103 validation (~50K
131
- tokens). Read these as "what we observed on this slice", not population-
132
- level estimates.
133
-
134
- The K-sweep recall numbers (24–41%) and the in-training `evaluate()` recall
135
- (50.9% at K=128) come from different sampled subsets of the streaming split
136
- and shouldn't be directly compared. The repo also reports `mass@K` (sum of
137
- teacher attention probability captured by the search top-K) — that's the
138
- more direct retrieval-quality metric when softmax is sharp.
139
-
140
- ### Per-layer recall (pilot)
141
-
142
- | Layer | Recall@K=128 | Recall@K=512 |
143
- |---|---|---|
144
- | 4 | 15.8% | 34.7% |
145
- | 8 | 22.2% | 38.7% |
146
- | 12 | 23.4% | 39.1% |
147
- | 16 | 31.9% | 45.2% |
148
- | 20 | 31.4% | 42.6% |
149
- | 24 | 31.1% | 44.4% |
150
-
151
- Early layers are harder for content-addressable retrieval — their attention
152
- is more local/positional than semantic. Consistent across K, so it's a
153
- property of the layer rather than noise.
154
-
155
- ### Caveats / what's next
156
-
157
- - **Packing**: pilot training and eval ran with sequence packing on (no
158
- segment-level causal mask, since transformers' default forward doesn't
159
- build them). The relative PPL gap between full and ANN is internally
160
- consistent under this confound, but the negative gap at K≥256 has at
161
- least three candidate explanations we haven't disentangled —
162
- (a) sparse-softmax denoising, (b) ANN happening to filter cross-document
163
- keys that full attention attends to, (c) sample noise on a small eval.
164
- The default config now has packing off so the next run isolates (a).
165
- - **Exact-topK oracle**: a four-way Pareto (full vs. exact top-K vs.
166
- search-topK exact vs. search-ANN) is the natural follow-up to separate
167
- "denoising from any sparsity" from "denoising from learned projections."
168
- - **Wall-clock**: not measured. The FAISS path in the repo is a CPU-side
169
- research prototype, not a deployable runtime. A GPU-resident topk kernel
170
- is the next-step engineering.
171
- - **34-layer headline** was queued (`make_headline_config()` is wired) and
172
- will mirror its checkpoints here when it runs.
173
-
174
- ## Files
175
-
176
- | File | What |
177
- |---|---|
178
- | `search_step_1000.pt` | Mid-training checkpoint (step 1000, 0.68% PPL gap) |
179
- | `search_step_2000.pt` | Final pilot checkpoint (step 2000, 0.71% PPL gap) |
180
-
181
- Each contains `{step, search_module: state_dict, optimizer, scheduler, config}`.
182
-
183
- ## Loading
184
-
185
- ```python
186
- import torch
187
- from transformers import AutoModelForCausalLM
188
- # Search module class is in the GitHub repo (model.py)
189
- from model import SearchProjectionModule
190
-
191
- base = AutoModelForCausalLM.from_pretrained(
192
- "Qwen/Qwen3-4B-Instruct-2507",
193
- dtype=torch.bfloat16,
194
- device_map="auto",
195
- attn_implementation="sdpa",
196
- )
197
-
198
- search = SearchProjectionModule(
199
- d_model=2560, d_search=64,
200
- layer_indices=[4, 8, 12, 16, 20, 24],
201
- use_mlp=False,
202
- ).to(base.device).to(torch.bfloat16)
203
-
204
- ckpt = torch.load("search_step_2000.pt", map_location="cpu", weights_only=False)
205
- search.load_state_dict(ckpt["search_module"])
206
- ```
207
-
208
- Use `inference.install_ann_attention(...)` (in the GitHub repo) to monkey-patch
209
- the trained layers and run with FAISS HNSW retrieval at inference time.
210
-
211
- ## Training recipe
212
-
213
- - Frozen base: Qwen3-4B-Instruct-2507 (36 layers, hidden 2560, GQA 32:8).
214
- - Data: WikiText-103 raw, 4K-token sequences (packing was on at training
215
- time; default in the repo is now off — see Caveats).
216
- - 2000 steps, batch 8, lr 1e-4 (cosine, 100-step warmup), AdamW.
217
- - `α=β=1` (contrastive + KL distillation, both layers averaged).
218
- - bf16 weights, fp32 loss math.
219
- - SDPA attention (B200, no flash-attn package needed).
220
- - Liger fused RMSNorm/SwiGLU/RoPE on the frozen base.
221
- - Total wall-clock: ~25 min on a single B200.
222
-
223
- ## License
224
-
225
- The search projections are released under Apache-2.0 (matching the base model).
 
1
  ---
2
+ license: mit
 
 
3
  base_model: Qwen/Qwen3-4B-Instruct-2507
4
  tags:
5
  - sparse-attention
6
+ - ann
7
+ - qwen3
8
+ - retrieval
9
+ - research-artifact
 
10
  ---
11
 
12
+ # ANN Sparse Attention Checkpoints
13
 
14
+ Research artifact for distillation-trained ANN-friendly search projections for sparse attention.
 
15
 
16
+ Base model: `Qwen/Qwen3-4B-Instruct-2507`.
17
 
18
+ ## Current Clean Result
19
 
20
+ The clean methodology is the packed block-causal d128 run in `checkpoints_block_d128/`.
21
+ Packed examples are isolated with per-document `segment_ids`, reset `position_ids`, and a 4D block-causal attention mask. Retrieval, loss masking, mass@K, and recall@K use the same segment-causal eligibility mask.
22
+
23
+ Clean block-causal d128 checkpoint:
24
+
25
+ - `checkpoints_block_d128/search_step_1000.pt`
26
+ - 6 trained layers: `[4, 8, 12, 16, 20, 24]`
27
+ - `d_search=128`, 3.93M trainable parameters
28
+ - K=128 exact learned search: PPL gap `+0.07%`, mass@K `0.787`, recall@K `0.744`
29
+ - K=256 exact learned search: PPL gap `+0.01%`, mass@K `0.953`, recall@K `0.879`
30
+
31
+ Interpretation: clean block-causal evaluation shows full-attention parity, not a clean denoising/improvement claim.
32
+
33
+ ## Clean Per-layer Retrieval, K=128
34
+
35
+ From `checkpoints_block_d128/search_step_1000.compare_retrieval.json`:
36
+
37
+ | Layer | raw-QK oracle mass | learned d128 mass |
38
+ |---|---:|---:|
39
+ | 4 | 0.956 | 0.950 |
40
+ | 8 | 0.977 | 0.976 |
41
+ | 12 | 0.970 | 0.977 |
42
+ | 16 | 0.964 | 0.970 |
43
+ | 20 | 0.970 | 0.983 |
44
+ | 24 | 0.978 | 0.984 |
45
+ | avg | 0.969 | 0.973 |
46
+
47
+ With segment isolation, early trained layers are not uniquely diffuse or hard; all six trained layers have high oracle mass and learned projections match/slightly exceed raw-QK retrieval.
48
+
49
+ ## Quest-style Page Baseline
50
+
51
+ From `checkpoints_block_d128/search_step_1000.quest_page16.json`, using page size 16, native post-RoPE Q/K min/max pages, and the same block-causal eligibility mask:
52
+
53
+ | Method | K | Recall@K | mass@K | PPL | PPL gap |
54
+ |---|---:|---:|---:|---:|---:|
55
+ | learned search exact | 128 | 0.744 | 0.787 | 30.47 | +0.07% |
56
+ | Quest-style page | 128 | 0.669 | 0.727 | 30.41 | -0.11% |
57
+ | learned search exact | 256 | 0.879 | 0.953 | 30.45 | +0.01% |
58
+ | Quest-style page | 256 | 0.838 | 0.909 | 30.45 | +0.03% |
59
+
60
+ Both methods are effectively full-attention parity on PPL. Learned projections recover more teacher attention mass at the same token budget, especially at K=128, but do not yet show a clean PPL advantage over Quest on this slice.
61
+
62
+ ## Packed Leakage-confounded Ablations
63
+
64
+ The packed d64/d128/d256 runs are included for capacity-scaling history and should not be used for clean quality claims because packed examples could attend across document boundaries.
65
+
66
+ Packed d_search ablation at K=128:
67
+
68
+ | d_search | learned mass@K=128 | raw-QK oracle | learned/oracle | final PPL gap |
69
+ |---|---:|---:|---:|---:|
70
+ | 64 | 0.492 | 0.488 | 1.01x | +2.39% |
71
+ | 128 | 0.503 | 0.488 | 1.03x | -1.81% |
72
+ | 256 | 0.509 | 0.488 | 1.04x | -1.85% |
73
+
74
+ The large negative packed K-sweep gaps were leakage-confounded and should be treated as historical/debugging evidence only, not as the headline.
75
+
76
+ ## Folder Guide
77
+
78
+ - `checkpoints_block_d128/`: clean block-causal d128 checkpoint and JSON eval artifacts. Use this for current claims.
79
+ - `checkpoints_packed_d64/`, `checkpoints_packed_d128/`, `checkpoints_packed_d256/`: leakage-confounded packed ablation checkpoints.
80
+ - `checkpoints_d64/`: earlier unpacked d64 checkpoints.
81
+ - `checkpoints/`: original pilot checkpoint and compare JSON.
82
+
83
+ ## Code
84
+
85
+ Source repo: https://github.com/unixsysdev/ann-sparseattention
86
+
87
+ The repo README contains the current methodology notes and reproduction commands.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
checkpoints/search_step_2000.compare_retrieval.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "Qwen/Qwen3-4B-Instruct-2507",
3
+ "ckpt": "/tmp/checkpoints/search_step_2000.pt",
4
+ "by_K": {
5
+ "16": {
6
+ "raw_qk": {
7
+ "per_layer": {
8
+ "4": 0.4115331669648488,
9
+ "8": 0.2802686393260956,
10
+ "12": 0.1508728675544262,
11
+ "16": 0.07771651136378448,
12
+ "20": 0.053202067812283836,
13
+ "24": 0.10271163408954938
14
+ },
15
+ "avg": 0.1793841478518314
16
+ },
17
+ "learned": {
18
+ "per_layer": {
19
+ "4": 0.1072007929906249,
20
+ "8": 0.23697869976361594,
21
+ "12": 0.11224483884871006,
22
+ "16": 0.07637482260664304,
23
+ "20": 0.06903641019016504,
24
+ "24": 0.10397349204868078
25
+ },
26
+ "avg": 0.11763484274140662
27
+ }
28
+ },
29
+ "32": {
30
+ "raw_qk": {
31
+ "per_layer": {
32
+ "4": 0.4668281575043996,
33
+ "8": 0.3293568876882394,
34
+ "12": 0.19662011042237282,
35
+ "16": 0.11227216385304928,
36
+ "20": 0.07988839099804561,
37
+ "24": 0.144585732370615
38
+ },
39
+ "avg": 0.22159190713945362
40
+ },
41
+ "learned": {
42
+ "per_layer": {
43
+ "4": 0.13327929687996706,
44
+ "8": 0.2569987513124943,
45
+ "12": 0.1446449818710486,
46
+ "16": 0.10432848272224267,
47
+ "20": 0.09580977975080411,
48
+ "24": 0.13778831561406454
49
+ },
50
+ "avg": 0.14547493469177022
51
+ }
52
+ },
53
+ "64": {
54
+ "raw_qk": {
55
+ "per_layer": {
56
+ "4": 0.5168702056010565,
57
+ "8": 0.390024371445179,
58
+ "12": 0.25787363573908806,
59
+ "16": 0.15796820322672525,
60
+ "20": 0.11998403631150723,
61
+ "24": 0.2020296814541022
62
+ },
63
+ "avg": 0.2741250222962764
64
+ },
65
+ "learned": {
66
+ "per_layer": {
67
+ "4": 0.1732035626967748,
68
+ "8": 0.2893482334911823,
69
+ "12": 0.19321986908713976,
70
+ "16": 0.14695298795898756,
71
+ "20": 0.13634028658270836,
72
+ "24": 0.1862101349979639
73
+ },
74
+ "avg": 0.18754584580245945
75
+ }
76
+ },
77
+ "128": {
78
+ "raw_qk": {
79
+ "per_layer": {
80
+ "4": 0.571592112382253,
81
+ "8": 0.463805615901947,
82
+ "12": 0.33948806673288345,
83
+ "16": 0.22183777391910553,
84
+ "20": 0.18010229741533598,
85
+ "24": 0.27901028965910274
86
+ },
87
+ "avg": 0.34263935933510464
88
+ },
89
+ "learned": {
90
+ "per_layer": {
91
+ "4": 0.235384251922369,
92
+ "8": 0.3402557211617629,
93
+ "12": 0.2643987759947777,
94
+ "16": 0.21102494125564894,
95
+ "20": 0.19678996006647745,
96
+ "24": 0.25442706421017647
97
+ },
98
+ "avg": 0.25038011910186875
99
+ }
100
+ },
101
+ "256": {
102
+ "raw_qk": {
103
+ "per_layer": {
104
+ "4": 0.6327291478713354,
105
+ "8": 0.5521376008788744,
106
+ "12": 0.43828647087017697,
107
+ "16": 0.31400871525208157,
108
+ "20": 0.2687250425418218,
109
+ "24": 0.37980743249257404
110
+ },
111
+ "avg": 0.4309490683178107
112
+ },
113
+ "learned": {
114
+ "per_layer": {
115
+ "4": 0.3288092017173767,
116
+ "8": 0.41752680391073227,
117
+ "12": 0.36623430997133255,
118
+ "16": 0.30524607251087826,
119
+ "20": 0.2853706416984399,
120
+ "24": 0.34850213179985684
121
+ },
122
+ "avg": 0.3419481936014361
123
+ }
124
+ }
125
+ },
126
+ "learned_over_raw_K128": 0.7307395145371917
127
+ }
checkpoints/search_step_2000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e30257eb414fa7595580d4a3efbb247e02d1bef6875f60fbab90edeb568012d
3
+ size 11814585
checkpoints_block_d128/search_step_1000.compare_retrieval.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "Qwen/Qwen3-4B-Instruct-2507",
3
+ "ckpt": "/tmp/checkpoints_block_d128/search_step_1000.pt",
4
+ "by_K": {
5
+ "128": {
6
+ "raw_qk": {
7
+ "per_layer": {
8
+ "4": 0.95585672929883,
9
+ "8": 0.977135993540287,
10
+ "12": 0.969620868563652,
11
+ "16": 0.9640911668539047,
12
+ "20": 0.9696248471736908,
13
+ "24": 0.9784364253282547
14
+ },
15
+ "avg": 0.9691276717931032
16
+ },
17
+ "learned": {
18
+ "per_layer": {
19
+ "4": 0.949747696518898,
20
+ "8": 0.9758064672350883,
21
+ "12": 0.9767678007483482,
22
+ "16": 0.9701211974024773,
23
+ "20": 0.9826441332697868,
24
+ "24": 0.9841717407107353
25
+ },
26
+ "avg": 0.9732098393142223
27
+ }
28
+ }
29
+ },
30
+ "learned_over_raw_K128": 1.0042122081949907
31
+ }
checkpoints_block_d128/search_step_1000.k_sweep_exact.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ppl_full": 30.444138765335083,
3
+ "by_K": {
4
+ "128": {
5
+ "recall_avg": 0.7435683117319801,
6
+ "recall_per_layer": {
7
+ "4": 0.7372374021825518,
8
+ "8": 0.7400615944244054,
9
+ "12": 0.7399612933300728,
10
+ "16": 0.7452810723237757,
11
+ "20": 0.7493442218927676,
12
+ "24": 0.7495242862383078
13
+ },
14
+ "mass_avg": 0.7874226044715501,
15
+ "mass_per_layer": {
16
+ "4": 0.7964790942667894,
17
+ "8": 0.760353729720074,
18
+ "12": 0.7879844721965367,
19
+ "16": 0.8230430181841107,
20
+ "20": 0.7992641063936694,
21
+ "24": 0.7574112060681207
22
+ },
23
+ "ppl_ann": 30.465327858924866,
24
+ "ppl_gap_relative": 0.0006959991134290015,
25
+ "faiss_diag": {}
26
+ },
27
+ "256": {
28
+ "recall_avg": 0.8794096146506826,
29
+ "recall_per_layer": {
30
+ "4": 0.8783885035021551,
31
+ "8": 0.878973599137931,
32
+ "12": 0.8767847521551724,
33
+ "16": 0.877142544450431,
34
+ "20": 0.88153076171875,
35
+ "24": 0.8836375269396551
36
+ },
37
+ "mass_avg": 0.9531509574802443,
38
+ "mass_per_layer": {
39
+ "4": 0.9445537698679957,
40
+ "8": 0.9476728768184267,
41
+ "12": 0.963769320783944,
42
+ "16": 0.9684469288793104,
43
+ "20": 0.9556469095164332,
44
+ "24": 0.9388159390153556
45
+ },
46
+ "ppl_ann": 30.448541164398193,
47
+ "ppl_gap_relative": 0.0001446058007107463,
48
+ "faiss_diag": {}
49
+ },
50
+ "512": {
51
+ "recall_avg": 0.0,
52
+ "recall_per_layer": {
53
+ "4": 0.0,
54
+ "8": 0.0,
55
+ "12": 0.0,
56
+ "16": 0.0,
57
+ "20": 0.0,
58
+ "24": 0.0
59
+ },
60
+ "mass_avg": 0.0,
61
+ "mass_per_layer": {
62
+ "4": 0.0,
63
+ "8": 0.0,
64
+ "12": 0.0,
65
+ "16": 0.0,
66
+ "20": 0.0,
67
+ "24": 0.0
68
+ },
69
+ "ppl_ann": 30.447556257247925,
70
+ "ppl_gap_relative": 0.00011225451109601071,
71
+ "faiss_diag": {}
72
+ }
73
+ }
74
+ }
checkpoints_block_d128/search_step_1000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c384da8d5a0022126ad37dff6d31cf48a7ea5a9fbd35b02d998349cd15fd32cd
3
+ size 23611193
checkpoints_block_d128/search_step_1000.quest_page16.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ppl_full": 30.444138765335083,
3
+ "page_size": 16,
4
+ "by_K": {
5
+ "128": {
6
+ "mass_avg": 0.7268716323509558,
7
+ "mass_per_layer": {
8
+ "4": 0.7368688553639114,
9
+ "8": 0.7033998429093813,
10
+ "12": 0.7266774880402885,
11
+ "16": 0.7590455168754203,
12
+ "20": 0.742104841151724,
13
+ "24": 0.6931332497650089
14
+ },
15
+ "recall_avg": 0.6693170907452658,
16
+ "recall_per_layer": {
17
+ "4": 0.6675438136673136,
18
+ "8": 0.6733688743494758,
19
+ "12": 0.669839395260821,
20
+ "16": 0.6732250795960827,
21
+ "20": 0.6668304967626628,
22
+ "24": 0.6650948848352387
23
+ },
24
+ "ppl_quest": 30.409221529960632,
25
+ "ppl_gap_relative": -0.0011469280061950332
26
+ },
27
+ "256": {
28
+ "mass_avg": 0.9088168527888155,
29
+ "mass_per_layer": {
30
+ "4": 0.9362414129849138,
31
+ "8": 0.8892738079202587,
32
+ "12": 0.8947217217807112,
33
+ "16": 0.92559814453125,
34
+ "20": 0.9104866817079741,
35
+ "24": 0.8965793478077856
36
+ },
37
+ "recall_avg": 0.8383452755281295,
38
+ "recall_per_layer": {
39
+ "4": 0.8439739490377491,
40
+ "8": 0.8412483478414601,
41
+ "12": 0.8357682721368198,
42
+ "16": 0.8378993067248114,
43
+ "20": 0.8356821125951307,
44
+ "24": 0.8354996648328058
45
+ },
46
+ "ppl_quest": 30.45472514629364,
47
+ "ppl_gap_relative": 0.0003477313331198978
48
+ },
49
+ "512": {
50
+ "mass_avg": 0.0,
51
+ "mass_per_layer": {
52
+ "4": NaN,
53
+ "8": NaN,
54
+ "12": NaN,
55
+ "16": NaN,
56
+ "20": NaN,
57
+ "24": NaN
58
+ },
59
+ "recall_avg": 0.0,
60
+ "recall_per_layer": {
61
+ "4": NaN,
62
+ "8": NaN,
63
+ "12": NaN,
64
+ "16": NaN,
65
+ "20": NaN,
66
+ "24": NaN
67
+ },
68
+ "ppl_quest": 30.45515537261963,
69
+ "ppl_gap_relative": 0.0003618629966662039
70
+ }
71
+ }
72
+ }
checkpoints_block_d128/search_step_200.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:12773073681e6c897150afbfb35d29d62f61916fae404b72449d27e1834a1df6
3
+ size 23611075
checkpoints_block_d128/search_step_400.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b53e45792882ea76d16f616f7669b31d071586aee4a229b0ea2b9f74c2646e61
3
+ size 23611139
checkpoints_block_d128/search_step_600.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4a9441f38ba4de142b31e295a7069fa48f46444dd97ae1cb459563c6d5d7fbd
3
+ size 23611139
checkpoints_block_d128/search_step_800.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f64aaa18268c8ca4dde60fa81dbb75c3bc8bc4361f8ef73eb0b8bd217998c4c
3
+ size 23611139
checkpoints_d64/search_step_200.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d25e49d3c884670ced90685bc76de51fa0390e4b98dd5ac573c4d6f3cff679c5
3
+ size 11814595
checkpoints_d64/search_step_400.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9fbdbf7f19caf959babc916be0da9a209d8fa7afe1fb19ae0cb89718917ae7e5
3
+ size 11814595
checkpoints_d64/search_step_600.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a891d7560d4c40acee1658d1e25079a37714966cf096a073cd5177c3e272d4c3
3
+ size 11814595
checkpoints_packed_d128/search_step_1000.compare_retrieval.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "Qwen/Qwen3-4B-Instruct-2507",
3
+ "ckpt": "/tmp/checkpoints_packed_d128/search_step_1000.pt",
4
+ "by_K": {
5
+ "16": {
6
+ "raw_qk": {
7
+ "per_layer": {
8
+ "4": 0.31979141881068546,
9
+ "8": 0.3186202198266983,
10
+ "12": 0.14174955089886984,
11
+ "16": 0.20945234596729279,
12
+ "20": 0.2801314915219943,
13
+ "24": 0.49455158909161884
14
+ },
15
+ "avg": 0.2940494360195266
16
+ },
17
+ "learned": {
18
+ "per_layer": {
19
+ "4": 0.29460810621579486,
20
+ "8": 0.2879024048646291,
21
+ "12": 0.38745074967543286,
22
+ "16": 0.2947010373075803,
23
+ "20": 0.3400069276491801,
24
+ "24": 0.5236761594812075
25
+ },
26
+ "avg": 0.3547242308656375
27
+ }
28
+ },
29
+ "32": {
30
+ "raw_qk": {
31
+ "per_layer": {
32
+ "4": 0.341281255086263,
33
+ "8": 0.36785539239645004,
34
+ "12": 0.21019610886772475,
35
+ "16": 0.28323352212707203,
36
+ "20": 0.3406280030806859,
37
+ "24": 0.5262841482957205
38
+ },
39
+ "avg": 0.3449130716423194
40
+ },
41
+ "learned": {
42
+ "per_layer": {
43
+ "4": 0.31440528233846027,
44
+ "8": 0.318613442281882,
45
+ "12": 0.42397742718458176,
46
+ "16": 0.33971526473760605,
47
+ "20": 0.3927851542830467,
48
+ "24": 0.5538897663354874
49
+ },
50
+ "avg": 0.390564389526844
51
+ }
52
+ },
53
+ "64": {
54
+ "raw_qk": {
55
+ "per_layer": {
56
+ "4": 0.37279334167639416,
57
+ "8": 0.4341067870457967,
58
+ "12": 0.295872134466966,
59
+ "16": 0.37213313827912015,
60
+ "20": 0.4122797225912412,
61
+ "24": 0.5643380333979925
62
+ },
63
+ "avg": 0.40858719290958506
64
+ },
65
+ "learned": {
66
+ "per_layer": {
67
+ "4": 0.34206587572892505,
68
+ "8": 0.3614834249019623,
69
+ "12": 0.47101640701293945,
70
+ "16": 0.40079283465941745,
71
+ "20": 0.46224164466063183,
72
+ "24": 0.5945162226756414
73
+ },
74
+ "avg": 0.4386860682732529
75
+ }
76
+ },
77
+ "128": {
78
+ "raw_qk": {
79
+ "per_layer": {
80
+ "4": 0.4215788046518962,
81
+ "8": 0.5176494866609573,
82
+ "12": 0.4037476380666097,
83
+ "16": 0.47523776690165204,
84
+ "20": 0.49859579155842465,
85
+ "24": 0.6136594414710999
86
+ },
87
+ "avg": 0.48841148821844
88
+ },
89
+ "learned": {
90
+ "per_layer": {
91
+ "4": 0.38235953201850253,
92
+ "8": 0.4212384819984436,
93
+ "12": 0.5328463464975357,
94
+ "16": 0.4814741685986519,
95
+ "20": 0.551252673069636,
96
+ "24": 0.6483117590347925
97
+ },
98
+ "avg": 0.5029138268695937
99
+ }
100
+ },
101
+ "256": {
102
+ "raw_qk": {
103
+ "per_layer": {
104
+ "4": 0.4964489738146464,
105
+ "8": 0.6146760632594427,
106
+ "12": 0.5339744637409846,
107
+ "16": 0.5874835352102915,
108
+ "20": 0.602066790064176,
109
+ "24": 0.6779818137486776
110
+ },
111
+ "avg": 0.5854386066397032
112
+ },
113
+ "learned": {
114
+ "per_layer": {
115
+ "4": 0.4403117299079895,
116
+ "8": 0.5022278105219206,
117
+ "12": 0.6128821273644766,
118
+ "16": 0.582685723900795,
119
+ "20": 0.65766608218352,
120
+ "24": 0.716396709283193
121
+ },
122
+ "avg": 0.5853616971936492
123
+ }
124
+ }
125
+ },
126
+ "learned_over_raw_K128": 1.0296928696416485
127
+ }
checkpoints_packed_d128/search_step_1000.k_sweep.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ppl_full": 224.6417384147644,
3
+ "by_K": {
4
+ "128": {
5
+ "recall_avg": 0.16599421347341228,
6
+ "recall_per_layer": {
7
+ "4": 0.1287512010143649,
8
+ "8": 0.14480467765561997,
9
+ "12": 0.1611505323840726,
10
+ "16": 0.17931722825573337,
11
+ "20": 0.204620607437626,
12
+ "24": 0.17732103409305697
13
+ },
14
+ "mass_avg": 0.2555904160904628,
15
+ "mass_per_layer": {
16
+ "4": 0.13555656902251706,
17
+ "8": 0.22613397721321352,
18
+ "12": 0.2812447086457283,
19
+ "16": 0.2538035146651729,
20
+ "20": 0.27192030414458246,
21
+ "24": 0.3648834228515625
22
+ },
23
+ "ppl_ann": 203.62537622451782,
24
+ "ppl_gap_relative": -0.09355501937686794,
25
+ "faiss_diag": {}
26
+ },
27
+ "256": {
28
+ "recall_avg": 0.23324623107910156,
29
+ "recall_per_layer": {
30
+ "4": 0.19934444427490233,
31
+ "8": 0.2069737116495768,
32
+ "12": 0.2247191111246745,
33
+ "16": 0.24465274810791016,
34
+ "20": 0.27422657012939455,
35
+ "24": 0.24956080118815105
36
+ },
37
+ "mass_avg": 0.3174514651298523,
38
+ "mass_per_layer": {
39
+ "4": 0.20090028444925945,
40
+ "8": 0.2829151630401611,
41
+ "12": 0.3357037226359049,
42
+ "16": 0.3171180884043376,
43
+ "20": 0.3458467245101929,
44
+ "24": 0.4222248077392578
45
+ },
46
+ "ppl_ann": 207.06066703796387,
47
+ "ppl_gap_relative": -0.07826271066483625,
48
+ "faiss_diag": {}
49
+ },
50
+ "512": {
51
+ "recall_avg": 0.33908390431177043,
52
+ "recall_per_layer": {
53
+ "4": 0.30769617216927664,
54
+ "8": 0.3079519953046526,
55
+ "12": 0.3272709846496582,
56
+ "16": 0.34647955213274273,
57
+ "20": 0.38336784499032156,
58
+ "24": 0.36173687662397114
59
+ },
60
+ "mass_avg": 0.4086176788523084,
61
+ "mass_per_layer": {
62
+ "4": 0.30153139574187143,
63
+ "8": 0.36809664964675903,
64
+ "12": 0.4162691150392805,
65
+ "16": 0.4079152686255319,
66
+ "20": 0.45096262863704134,
67
+ "24": 0.506931015423366
68
+ },
69
+ "ppl_ann": 211.92854118347168,
70
+ "ppl_gap_relative": -0.056593210687409634,
71
+ "faiss_diag": {}
72
+ }
73
+ }
74
+ }
checkpoints_packed_d128/search_step_1000.k_sweep_exact.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ppl_full": 224.6417384147644,
3
+ "by_K": {
4
+ "128": {
5
+ "recall_avg": 0.16599421347341228,
6
+ "recall_per_layer": {
7
+ "4": 0.1287512010143649,
8
+ "8": 0.14480467765561997,
9
+ "12": 0.1611505323840726,
10
+ "16": 0.17931722825573337,
11
+ "20": 0.204620607437626,
12
+ "24": 0.17732103409305697
13
+ },
14
+ "mass_avg": 0.2555904160904628,
15
+ "mass_per_layer": {
16
+ "4": 0.13555656902251706,
17
+ "8": 0.22613397721321352,
18
+ "12": 0.2812447086457283,
19
+ "16": 0.2538035146651729,
20
+ "20": 0.27192030414458246,
21
+ "24": 0.3648834228515625
22
+ },
23
+ "ppl_ann": 203.62537622451782,
24
+ "ppl_gap_relative": -0.09355501937686794,
25
+ "faiss_diag": {}
26
+ },
27
+ "256": {
28
+ "recall_avg": 0.23324623107910156,
29
+ "recall_per_layer": {
30
+ "4": 0.19934444427490233,
31
+ "8": 0.2069737116495768,
32
+ "12": 0.2247191111246745,
33
+ "16": 0.24465274810791016,
34
+ "20": 0.27422657012939455,
35
+ "24": 0.24956080118815105
36
+ },
37
+ "mass_avg": 0.3174514651298523,
38
+ "mass_per_layer": {
39
+ "4": 0.20090028444925945,
40
+ "8": 0.2829151630401611,
41
+ "12": 0.3357037226359049,
42
+ "16": 0.3171180884043376,
43
+ "20": 0.3458467245101929,
44
+ "24": 0.4222248077392578
45
+ },
46
+ "ppl_ann": 207.06066703796387,
47
+ "ppl_gap_relative": -0.07826271066483625,
48
+ "faiss_diag": {}
49
+ },
50
+ "512": {
51
+ "recall_avg": 0.33908390431177043,
52
+ "recall_per_layer": {
53
+ "4": 0.30769617216927664,
54
+ "8": 0.3079519953046526,
55
+ "12": 0.3272709846496582,
56
+ "16": 0.34647955213274273,
57
+ "20": 0.38336784499032156,
58
+ "24": 0.36173687662397114
59
+ },
60
+ "mass_avg": 0.4086176788523084,
61
+ "mass_per_layer": {
62
+ "4": 0.30153139574187143,
63
+ "8": 0.36809664964675903,
64
+ "12": 0.4162691150392805,
65
+ "16": 0.4079152686255319,
66
+ "20": 0.45096262863704134,
67
+ "24": 0.506931015423366
68
+ },
69
+ "ppl_ann": 211.92854118347168,
70
+ "ppl_gap_relative": -0.056593210687409634,
71
+ "faiss_diag": {}
72
+ }
73
+ }
74
+ }
checkpoints_packed_d128/search_step_1000.k_sweep_exact_skip16.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "ppl_full": 193.67203998565674,
3
+ "by_K": {
4
+ "128": {
5
+ "recall_avg": 0.1631414557016024,
6
+ "recall_per_layer": {
7
+ "4": 0.1270386480516003,
8
+ "8": 0.14388988864037297,
9
+ "12": 0.15978523992723034,
10
+ "16": 0.18374190791960684,
11
+ "20": 0.20087420555853075,
12
+ "24": 0.1635188441122732
13
+ },
14
+ "mass_avg": 0.341150162521229,
15
+ "mass_per_layer": {
16
+ "4": 0.13802351394007284,
17
+ "8": 0.31434367164488763,
18
+ "12": 0.39897729504492974,
19
+ "16": 0.32906081599573933,
20
+ "20": 0.34529122229545345,
21
+ "24": 0.521204456206291
22
+ },
23
+ "ppl_ann": 176.66846084594727,
24
+ "ppl_gap_relative": -0.0877957352076673,
25
+ "faiss_diag": {}
26
+ },
27
+ "256": {
28
+ "recall_avg": 0.23127511342366536,
29
+ "recall_per_layer": {
30
+ "4": 0.19980506896972655,
31
+ "8": 0.20826168060302735,
32
+ "12": 0.22342586517333984,
33
+ "16": 0.25203742980957033,
34
+ "20": 0.270965576171875,
35
+ "24": 0.23315505981445311
36
+ },
37
+ "mass_avg": 0.39741740624109906,
38
+ "mass_per_layer": {
39
+ "4": 0.2033478021621704,
40
+ "8": 0.364719812075297,
41
+ "12": 0.4482226053873698,
42
+ "16": 0.3924812396367391,
43
+ "20": 0.4137035528818766,
44
+ "24": 0.5620294253031413
45
+ },
46
+ "ppl_ann": 178.96899604797363,
47
+ "ppl_gap_relative": -0.07591722552605945,
48
+ "faiss_diag": {}
49
+ },
50
+ "512": {
51
+ "recall_avg": 0.33962979203178767,
52
+ "recall_per_layer": {
53
+ "4": 0.31196675981794086,
54
+ "8": 0.3126195158277239,
55
+ "12": 0.32580297333853586,
56
+ "16": 0.36052775382995605,
57
+ "20": 0.3826852866581508,
58
+ "24": 0.3441764627184187
59
+ },
60
+ "mass_avg": 0.4830506351732072,
61
+ "mass_per_layer": {
62
+ "4": 0.30267453619412016,
63
+ "8": 0.44439977407455444,
64
+ "12": 0.5229217324938092,
65
+ "16": 0.4867537532533918,
66
+ "20": 0.5154515334538051,
67
+ "24": 0.6261024815695626
68
+ },
69
+ "ppl_ann": 181.6412000656128,
70
+ "ppl_gap_relative": -0.06211965300171849,
71
+ "faiss_diag": {}
72
+ }
73
+ }
74
+ }
checkpoints_packed_d128/search_step_1000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c831fcdae22879b34815b7af47215dcb51772cf0f27fd59f15e9b1eb4f0d2137
3
+ size 23611129
checkpoints_packed_d128/search_step_200.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:631120827ea8141d461c10ed0b24317643395d66ca707a548f4553e5f38646ac
3
+ size 23611075
checkpoints_packed_d128/search_step_400.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45c8d82cd98746fb98f266066431357c7b2662deed2ebb5d34d6a1c85969b073
3
+ size 23611075
checkpoints_packed_d128/search_step_600.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcf85676cb91a94c07dfc27db483bd4be83a83374d0f48a79f38118cd6edb8fb
3
+ size 23611075
checkpoints_packed_d128/search_step_800.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5502cff4ec669daa37c1b989f2344dd5648572a1855e2de32f33d6fd9fcb1a0c
3
+ size 23611075
checkpoints_packed_d256/search_step_1000.compare_retrieval.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "Qwen/Qwen3-4B-Instruct-2507",
3
+ "ckpt": "/tmp/checkpoints_packed_d256/search_step_1000.pt",
4
+ "by_K": {
5
+ "16": {
6
+ "raw_qk": {
7
+ "per_layer": {
8
+ "4": 0.31979141881068546,
9
+ "8": 0.3186202198266983,
10
+ "12": 0.14174955089886984,
11
+ "16": 0.20945234596729279,
12
+ "20": 0.2801314915219943,
13
+ "24": 0.49455158909161884
14
+ },
15
+ "avg": 0.2940494360195266
16
+ },
17
+ "learned": {
18
+ "per_layer": {
19
+ "4": 0.2933313399553299,
20
+ "8": 0.29298429439465207,
21
+ "12": 0.39142733812332153,
22
+ "16": 0.2983766943216324,
23
+ "20": 0.3438563868403435,
24
+ "24": 0.525188535451889
25
+ },
26
+ "avg": 0.35752743151452804
27
+ }
28
+ },
29
+ "32": {
30
+ "raw_qk": {
31
+ "per_layer": {
32
+ "4": 0.341281255086263,
33
+ "8": 0.36785539239645004,
34
+ "12": 0.21019610886772475,
35
+ "16": 0.28323352212707203,
36
+ "20": 0.3406280030806859,
37
+ "24": 0.5262841482957205
38
+ },
39
+ "avg": 0.3449130716423194
40
+ },
41
+ "learned": {
42
+ "per_layer": {
43
+ "4": 0.3145047202706337,
44
+ "8": 0.3253612567981084,
45
+ "12": 0.4279355009396871,
46
+ "16": 0.344377006093661,
47
+ "20": 0.3977118283510208,
48
+ "24": 0.556083157658577
49
+ },
50
+ "avg": 0.39432891168528134
51
+ }
52
+ },
53
+ "64": {
54
+ "raw_qk": {
55
+ "per_layer": {
56
+ "4": 0.37279334167639416,
57
+ "8": 0.4341067870457967,
58
+ "12": 0.295872134466966,
59
+ "16": 0.37213313827912015,
60
+ "20": 0.4122797225912412,
61
+ "24": 0.5643380333979925
62
+ },
63
+ "avg": 0.40858719290958506
64
+ },
65
+ "learned": {
66
+ "per_layer": {
67
+ "4": 0.34352220843235654,
68
+ "8": 0.37063097457091015,
69
+ "12": 0.4752006282409032,
70
+ "16": 0.4067305897672971,
71
+ "20": 0.46815096338589984,
72
+ "24": 0.5977186262607574
73
+ },
74
+ "avg": 0.44365899844302076
75
+ }
76
+ },
77
+ "128": {
78
+ "raw_qk": {
79
+ "per_layer": {
80
+ "4": 0.4215788046518962,
81
+ "8": 0.5176494866609573,
82
+ "12": 0.4037476380666097,
83
+ "16": 0.47523776690165204,
84
+ "20": 0.49859579155842465,
85
+ "24": 0.6136594414710999
86
+ },
87
+ "avg": 0.48841148821844
88
+ },
89
+ "learned": {
90
+ "per_layer": {
91
+ "4": 0.3849342664082845,
92
+ "8": 0.43360541264216107,
93
+ "12": 0.5375104447205862,
94
+ "16": 0.488710917532444,
95
+ "20": 0.5576231330633163,
96
+ "24": 0.6525773753722509
97
+ },
98
+ "avg": 0.5091602582898406
99
+ }
100
+ },
101
+ "256": {
102
+ "raw_qk": {
103
+ "per_layer": {
104
+ "4": 0.4964489738146464,
105
+ "8": 0.6146760632594427,
106
+ "12": 0.5339744637409846,
107
+ "16": 0.5874835352102915,
108
+ "20": 0.602066790064176,
109
+ "24": 0.6779818137486776
110
+ },
111
+ "avg": 0.5854386066397032
112
+ },
113
+ "learned": {
114
+ "per_layer": {
115
+ "4": 0.44419410079717636,
116
+ "8": 0.5183951507012049,
117
+ "12": 0.6179872999588648,
118
+ "16": 0.5908408363660177,
119
+ "20": 0.6638876696427664,
120
+ "24": 0.7216567993164062
121
+ },
122
+ "avg": 0.592826976130406
123
+ }
124
+ }
125
+ },
126
+ "learned_over_raw_K128": 1.0424821499328059
127
+ }
checkpoints_packed_d256/search_step_1000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f93073485115c5ac669cce2f483179310be7e1caa88e53091cc36fd1b31435e3
3
+ size 47204153
checkpoints_packed_d256/search_step_200.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f763b0057300594fd26916b2137bdde8f004b077ba21fb760fa950b3d1058a9
3
+ size 47204099
checkpoints_packed_d256/search_step_400.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e14e97ca1f49b5ac54273be7fdf57218e644d9d062edea3a7afb225e76e195c
3
+ size 47204099
checkpoints_packed_d256/search_step_600.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85a94bcd5b978022c4010e19ee7c4aca2d267a2d7d7896fb7025595198d0ee8f
3
+ size 47204099
checkpoints_packed_d256/search_step_800.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a7db9c635cc462e3b08d402ad047cfd33a90fe5e6e57bf8576172e271b67255
3
+ size 47204099
checkpoints_packed_d64/search_step_1000.compare_retrieval.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "Qwen/Qwen3-4B-Instruct-2507",
3
+ "ckpt": "/tmp/checkpoints_packed_d64/search_step_1000.pt",
4
+ "by_K": {
5
+ "16": {
6
+ "raw_qk": {
7
+ "per_layer": {
8
+ "4": 0.31979141881068546,
9
+ "8": 0.3186202198266983,
10
+ "12": 0.14174955089886984,
11
+ "16": 0.20945234596729279,
12
+ "20": 0.2801314915219943,
13
+ "24": 0.49455158909161884
14
+ },
15
+ "avg": 0.2940494360195266
16
+ },
17
+ "learned": {
18
+ "per_layer": {
19
+ "4": 0.29329264909029007,
20
+ "8": 0.28124504536390305,
21
+ "12": 0.37904051691293716,
22
+ "16": 0.2890646557013194,
23
+ "20": 0.3339161326487859,
24
+ "24": 0.5209685415029526
25
+ },
26
+ "avg": 0.34958792353669804
27
+ }
28
+ },
29
+ "32": {
30
+ "raw_qk": {
31
+ "per_layer": {
32
+ "4": 0.341281255086263,
33
+ "8": 0.36785539239645004,
34
+ "12": 0.21019610886772475,
35
+ "16": 0.28323352212707203,
36
+ "20": 0.3406280030806859,
37
+ "24": 0.5262841482957205
38
+ },
39
+ "avg": 0.3449130716423194
40
+ },
41
+ "learned": {
42
+ "per_layer": {
43
+ "4": 0.31185250480969745,
44
+ "8": 0.3099167247613271,
45
+ "12": 0.41439520567655563,
46
+ "16": 0.33257966736952466,
47
+ "20": 0.3849489390850067,
48
+ "24": 0.5501216500997543
49
+ },
50
+ "avg": 0.383969115300311
51
+ }
52
+ },
53
+ "64": {
54
+ "raw_qk": {
55
+ "per_layer": {
56
+ "4": 0.37279334167639416,
57
+ "8": 0.4341067870457967,
58
+ "12": 0.295872134466966,
59
+ "16": 0.37213313827912015,
60
+ "20": 0.4122797225912412,
61
+ "24": 0.5643380333979925
62
+ },
63
+ "avg": 0.40858719290958506
64
+ },
65
+ "learned": {
66
+ "per_layer": {
67
+ "4": 0.3371334026257197,
68
+ "8": 0.3503660187125206,
69
+ "12": 0.4602071891228358,
70
+ "16": 0.3917141556739807,
71
+ "20": 0.45263657718896866,
72
+ "24": 0.5895731498797735
73
+ },
74
+ "avg": 0.43027174886729985
75
+ }
76
+ },
77
+ "128": {
78
+ "raw_qk": {
79
+ "per_layer": {
80
+ "4": 0.4215788046518962,
81
+ "8": 0.5176494866609573,
82
+ "12": 0.4037476380666097,
83
+ "16": 0.47523776690165204,
84
+ "20": 0.49859579155842465,
85
+ "24": 0.6136594414710999
86
+ },
87
+ "avg": 0.48841148821844
88
+ },
89
+ "learned": {
90
+ "per_layer": {
91
+ "4": 0.37442514797051746,
92
+ "8": 0.40738723675409955,
93
+ "12": 0.5206492592891058,
94
+ "16": 0.4700211783250173,
95
+ "20": 0.5400509238243103,
96
+ "24": 0.6421113312244415
97
+ },
98
+ "avg": 0.4924408462312486
99
+ }
100
+ },
101
+ "256": {
102
+ "raw_qk": {
103
+ "per_layer": {
104
+ "4": 0.4964489738146464,
105
+ "8": 0.6146760632594427,
106
+ "12": 0.5339744637409846,
107
+ "16": 0.5874835352102915,
108
+ "20": 0.602066790064176,
109
+ "24": 0.6779818137486776
110
+ },
111
+ "avg": 0.5854386066397032
112
+ },
113
+ "learned": {
114
+ "per_layer": {
115
+ "4": 0.4301423355937004,
116
+ "8": 0.48568252722422284,
117
+ "12": 0.599395309885343,
118
+ "16": 0.5693490405877432,
119
+ "20": 0.6461250483989716,
120
+ "24": 0.709612175822258
121
+ },
122
+ "avg": 0.5733844062520398
123
+ }
124
+ }
125
+ },
126
+ "learned_over_raw_K128": 1.0082499247253711
127
+ }
checkpoints_packed_d64/search_step_1000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c3c1e631d9a7a3838dbba1274d598f143b4e9e1dffca474b9f27f292470962a
3
+ size 11814649
checkpoints_packed_d64/search_step_200.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63fe19831fae43c71698480b3fbee53712a69375a3526b54d78e454124ec120c
3
+ size 11814595
checkpoints_packed_d64/search_step_400.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:993c324c3b9e1a74ea5600e90662db176db7e33d027609447c6277354fa35ae0
3
+ size 11814595
checkpoints_packed_d64/search_step_600.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a229f9353532df19b4af03e668938205e4ff4576fa53095bf632bf945e9f4ff
3
+ size 11814595
checkpoints_packed_d64/search_step_800.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6e49834e2902223c4c3f512ef2affa2920c03905162c6a53ccb0749bd1606c1
3
+ size 11814595