datasysdev commited on about 17 hours ago

Commit

57c6b5b

verified ·

1 Parent(s): 66efa56

Upload clean block-causal and packed pilot checkpoints

Browse files

Files changed (35) hide show

README.md +77 -215
checkpoints/search_step_2000.compare_retrieval.json +127 -0
checkpoints/search_step_2000.pt +3 -0
checkpoints_block_d128/search_step_1000.compare_retrieval.json +31 -0
checkpoints_block_d128/search_step_1000.k_sweep_exact.json +74 -0
checkpoints_block_d128/search_step_1000.pt +3 -0
checkpoints_block_d128/search_step_1000.quest_page16.json +72 -0
checkpoints_block_d128/search_step_200.pt +3 -0
checkpoints_block_d128/search_step_400.pt +3 -0
checkpoints_block_d128/search_step_600.pt +3 -0
checkpoints_block_d128/search_step_800.pt +3 -0
checkpoints_d64/search_step_200.pt +3 -0
checkpoints_d64/search_step_400.pt +3 -0
checkpoints_d64/search_step_600.pt +3 -0
checkpoints_packed_d128/search_step_1000.compare_retrieval.json +127 -0
checkpoints_packed_d128/search_step_1000.k_sweep.json +74 -0
checkpoints_packed_d128/search_step_1000.k_sweep_exact.json +74 -0
checkpoints_packed_d128/search_step_1000.k_sweep_exact_skip16.json +74 -0
checkpoints_packed_d128/search_step_1000.pt +3 -0
checkpoints_packed_d128/search_step_200.pt +3 -0
checkpoints_packed_d128/search_step_400.pt +3 -0
checkpoints_packed_d128/search_step_600.pt +3 -0
checkpoints_packed_d128/search_step_800.pt +3 -0
checkpoints_packed_d256/search_step_1000.compare_retrieval.json +127 -0
checkpoints_packed_d256/search_step_1000.pt +3 -0
checkpoints_packed_d256/search_step_200.pt +3 -0
checkpoints_packed_d256/search_step_400.pt +3 -0
checkpoints_packed_d256/search_step_600.pt +3 -0
checkpoints_packed_d256/search_step_800.pt +3 -0
checkpoints_packed_d64/search_step_1000.compare_retrieval.json +127 -0
checkpoints_packed_d64/search_step_1000.pt +3 -0
checkpoints_packed_d64/search_step_200.pt +3 -0
checkpoints_packed_d64/search_step_400.pt +3 -0
checkpoints_packed_d64/search_step_600.pt +3 -0
checkpoints_packed_d64/search_step_800.pt +3 -0

README.md CHANGED Viewed

@@ -1,225 +1,87 @@
 ---
-language:
-- en
-license: apache-2.0
 base_model: Qwen/Qwen3-4B-Instruct-2507
 tags:
 - sparse-attention
-- ann-attention
-- distillation
-- search-projection
-- inference-optimization
-library_name: pytorch
 ---
-# ann-sparseattention
-Search projections for ANN-substituted attention on
-[`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).
-Code: [github.com/unixsysdev/ann-sparseattention](https://github.com/unixsysdev/ann-sparseattention)
-## Current status
-Research prototype. Trained projections work, runtime is a correctness
-prototype, eval envelope is narrow. Treat reported numbers as preliminary.
-**Validated:** 6-layer pilot on Qwen3-4B-Instruct-2507; WikiText-103 PPL
-preserved at K=128 (gap ≈ +0.7%); learned projections retrieve attention-
-relevant keys.
-**Not yet validated:** 34-layer / whole-model substitution; long-context
-tasks (LongBench, RULER, needle); wall-clock speedup vs FlashAttention/SDPA;
-KV-cache decode-mode integration; GPU-resident ANN kernel.
-**Runtime caveat:** the FAISS path here builds CPU indexes per batch and
-the gather step uses dense-style tensor expansion. Compute-reduction
-numbers below are *algorithmic scoring reductions, not measured wall-clock
-speedups.*
-## Relation to RetrievalAttention
-RetrievalAttention (Liu et al., 2024) shows that **vanilla ANN over the
-model's native Q, K vectors fails** because Q and K live in mismatched
-distributions — they were never trained to be each other's nearest
-neighbors, only to score via dot product. Their fix is at *index time*:
-an attention-aware graph construction (RoarGraph-style).
-This work attacks the same problem from the opposite direction. We
-**train a tiny shared projection** (`W_Qs, W_Ks → R^64`) so that
-`q_search` and `k_search` live in the same distribution by construction.
-Off-the-shelf FAISS HNSW with default parameters then suffices.
-| | Search space | Index | Trainable |
-|---|---|---|---|
-| Raw Q/K + vanilla ANN | original Q/K | off-the-shelf | no — fails (Q/K OOD) |
-| RetrievalAttention | original Q/K | attention-aware graph | no |
-| **This work** | **learned Q\_s / K\_s** | **off-the-shelf** | **yes (~2-11M params)** |
-Contribution: *eliminate Q/K mismatch at index-build time via distillation,
-instead of patching it at search time.* The clean validating experiment —
-vanilla FAISS over raw Q/K vs. learned Q\_s/K\_s vs. exact teacher top-K
-— is the next planned run.
-## What's in this repo
-Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
-trained against the frozen base model's attention via contrastive +
-distillation losses. At inference these produce 64-d "search vectors" that
-let an off-the-shelf FAISS HNSW index pick the top-K keys to attend to,
-replacing dense `O(L²)` attention with `O(L·K)` ANN-substituted attention.
-Layers covered (pilot): `[4, 8, 12, 16, 20, 24]` — 6 of 36 layers, ~2M trainable params.
-## Pilot results (final, 2K steps on WikiText-103)
-| Step | Recall@K=128 | PPL gap (full vs ANN) |
-|---|---|---|
-| 500 | 47.4% | 1.21% |
-| 1000 | 50.7% | 0.68% |
-| 1500 | 50.9% | 0.68% |
-| **2000 (final)** | **50.9%** | **0.71%** |
-PPL gap is the primary signal — at <1% relative gap, the model's output
-quality is preserved under ANN substitution. Recall plateaus around step 1000
-because the softmax-relevant keys concentrate in the top ~30; disagreement
-on positions 30-128 is on near-zero-weight tail and doesn't affect output.
-### K-retrieve Pareto (pilot step 2000, FAISS HNSW)
-`PPL_full = 9.958`
-| K | Recall@K | PPL_ANN | PPL gap |
-|---|---|---|---|
-| 16 | 24.9% | 10.71 | +7.51% |
-| 32 | 22.8% | 10.41 | +4.51% |
-| 64 | 23.1% | 10.20 | +2.42% |
-| 128 | 26.0% | 10.04 | +0.82% |
-| 256 | 31.6% | 9.88 | **−0.79%** |
-| 512 | 40.8% | 9.67 | **−2.89%** |
-On this small WikiText slice, K ≥ 256 produced lower measured PPL than
-the full-attention reference. A plausible explanation is sparse-softmax
-denoising, but with 12 eval batches, sample noise, packed-boundary artifacts
-(pilot trained with packing on; default in the repo is now off), and
-partial-layer substitution acting like regularization are also candidates.
-Treating it as a hypothesis to confirm via an exact-topK oracle (full QK^T
-→ top-K → restricted attention) at the same K — that separates "denoising
-from any sparsity" from "denoising from learned projections."
-Code-level sanity checks pass: same input sequences for `ppl_full` vs
-`ppl_ann`, intact causal mask in retrieval, single softmax over retrieved
-K with no wrapper leakage between iterations.
-### Compute / quality knobs (FLOP-counted)
-`L = 4096`. Compute reduction is the attention scoring step, ≈ `L / K`.
-These are FLOP estimates, not measured wall-clock — the FAISS path in this
-repo is a research prototype that does CPU index builds and GPU↔CPU
-transfers, so it is not the right thing to time.
-| K | PPL gap | Attention scoring reduction |
-|---|---|---|
-| 512 | −2.89% | ~8× |
-| 256 | −0.79% | ~16× |
-| 128 | +0.82% | ~32× |
-| 64 | +2.42% | ~64× |
-| 32 | +4.51% | ~128× |
-| 16 | +7.51% | ~256× |
-Eval scope: 12 sequences × 4K tokens of WikiText-103 validation (~50K
-tokens). Read these as "what we observed on this slice", not population-
-level estimates.
-The K-sweep recall numbers (24–41%) and the in-training `evaluate()` recall
-(50.9% at K=128) come from different sampled subsets of the streaming split
-and shouldn't be directly compared. The repo also reports `mass@K` (sum of
-teacher attention probability captured by the search top-K) — that's the
-more direct retrieval-quality metric when softmax is sharp.
-### Per-layer recall (pilot)
-| Layer | Recall@K=128 | Recall@K=512 |
-|---|---|---|
-| 4 | 15.8% | 34.7% |
-| 8 | 22.2% | 38.7% |
-| 12 | 23.4% | 39.1% |
-| 16 | 31.9% | 45.2% |
-| 20 | 31.4% | 42.6% |
-| 24 | 31.1% | 44.4% |
-Early layers are harder for content-addressable retrieval — their attention
-is more local/positional than semantic. Consistent across K, so it's a
-property of the layer rather than noise.
-### Caveats / what's next
-- **Packing**: pilot training and eval ran with sequence packing on (no
-  segment-level causal mask, since transformers' default forward doesn't
-  build them). The relative PPL gap between full and ANN is internally
-  consistent under this confound, but the negative gap at K≥256 has at
-  least three candidate explanations we haven't disentangled —
-  (a) sparse-softmax denoising, (b) ANN happening to filter cross-document
-  keys that full attention attends to, (c) sample noise on a small eval.
-  The default config now has packing off so the next run isolates (a).
-- **Exact-topK oracle**: a four-way Pareto (full vs. exact top-K vs.
-  search-topK exact vs. search-ANN) is the natural follow-up to separate
-  "denoising from any sparsity" from "denoising from learned projections."
-- **Wall-clock**: not measured. The FAISS path in the repo is a CPU-side
-  research prototype, not a deployable runtime. A GPU-resident topk kernel
-  is the next-step engineering.
-- **34-layer headline** was queued (`make_headline_config()` is wired) and
-  will mirror its checkpoints here when it runs.
-## Files
-| File | What |
-|---|---|
-| `search_step_1000.pt` | Mid-training checkpoint (step 1000, 0.68% PPL gap) |
-| `search_step_2000.pt` | Final pilot checkpoint (step 2000, 0.71% PPL gap) |
-Each contains `{step, search_module: state_dict, optimizer, scheduler, config}`.
-## Loading
-```python
-import torch
-from transformers import AutoModelForCausalLM
-# Search module class is in the GitHub repo (model.py)
-from model import SearchProjectionModule
-base = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen3-4B-Instruct-2507",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    attn_implementation="sdpa",
-)
-search = SearchProjectionModule(
-    d_model=2560, d_search=64,
-    layer_indices=[4, 8, 12, 16, 20, 24],
-    use_mlp=False,
-).to(base.device).to(torch.bfloat16)
-ckpt = torch.load("search_step_2000.pt", map_location="cpu", weights_only=False)
-search.load_state_dict(ckpt["search_module"])
-```
-Use `inference.install_ann_attention(...)` (in the GitHub repo) to monkey-patch
-the trained layers and run with FAISS HNSW retrieval at inference time.
-## Training recipe
-- Frozen base: Qwen3-4B-Instruct-2507 (36 layers, hidden 2560, GQA 32:8).
-- Data: WikiText-103 raw, 4K-token sequences (packing was on at training
-  time; default in the repo is now off — see Caveats).
-- 2000 steps, batch 8, lr 1e-4 (cosine, 100-step warmup), AdamW.
-- `α=β=1` (contrastive + KL distillation, both layers averaged).
-- bf16 weights, fp32 loss math.
-- SDPA attention (B200, no flash-attn package needed).
-- Liger fused RMSNorm/SwiGLU/RoPE on the frozen base.
-- Total wall-clock: ~25 min on a single B200.
-## License
-The search projections are released under Apache-2.0 (matching the base model).

 ---
+license: mit
 base_model: Qwen/Qwen3-4B-Instruct-2507
 tags:
 - sparse-attention
+- ann
+- qwen3
+- retrieval
+- research-artifact
 ---
+# ANN Sparse Attention Checkpoints
+Research artifact for distillation-trained ANN-friendly search projections for sparse attention.
+Base model: `Qwen/Qwen3-4B-Instruct-2507`.
+## Current Clean Result
+The clean methodology is the packed block-causal d128 run in `checkpoints_block_d128/`.
+Packed examples are isolated with per-document `segment_ids`, reset `position_ids`, and a 4D block-causal attention mask. Retrieval, loss masking, mass@K, and recall@K use the same segment-causal eligibility mask.
+Clean block-causal d128 checkpoint:
+- `checkpoints_block_d128/search_step_1000.pt`
+- 6 trained layers: `[4, 8, 12, 16, 20, 24]`
+- `d_search=128`, 3.93M trainable parameters
+- K=128 exact learned search: PPL gap `+0.07%`, mass@K `0.787`, recall@K `0.744`
+- K=256 exact learned search: PPL gap `+0.01%`, mass@K `0.953`, recall@K `0.879`
+Interpretation: clean block-causal evaluation shows full-attention parity, not a clean denoising/improvement claim.
+## Clean Per-layer Retrieval, K=128
+From `checkpoints_block_d128/search_step_1000.compare_retrieval.json`:
+| Layer | raw-QK oracle mass | learned d128 mass |
+|---|---:|---:|
+| 4 | 0.956 | 0.950 |
+| 8 | 0.977 | 0.976 |
+| 12 | 0.970 | 0.977 |
+| 16 | 0.964 | 0.970 |
+| 20 | 0.970 | 0.983 |
+| 24 | 0.978 | 0.984 |
+| avg | 0.969 | 0.973 |
+With segment isolation, early trained layers are not uniquely diffuse or hard; all six trained layers have high oracle mass and learned projections match/slightly exceed raw-QK retrieval.
+## Quest-style Page Baseline
+From `checkpoints_block_d128/search_step_1000.quest_page16.json`, using page size 16, native post-RoPE Q/K min/max pages, and the same block-causal eligibility mask:
+| Method | K | Recall@K | mass@K | PPL | PPL gap |
+|---|---:|---:|---:|---:|---:|
+| learned search exact | 128 | 0.744 | 0.787 | 30.47 | +0.07% |
+| Quest-style page | 128 | 0.669 | 0.727 | 30.41 | -0.11% |
+| learned search exact | 256 | 0.879 | 0.953 | 30.45 | +0.01% |
+| Quest-style page | 256 | 0.838 | 0.909 | 30.45 | +0.03% |
+Both methods are effectively full-attention parity on PPL. Learned projections recover more teacher attention mass at the same token budget, especially at K=128, but do not yet show a clean PPL advantage over Quest on this slice.
+## Packed Leakage-confounded Ablations
+The packed d64/d128/d256 runs are included for capacity-scaling history and should not be used for clean quality claims because packed examples could attend across document boundaries.
+Packed d_search ablation at K=128:
+| d_search | learned mass@K=128 | raw-QK oracle | learned/oracle | final PPL gap |
+|---|---:|---:|---:|---:|
+| 64 | 0.492 | 0.488 | 1.01x | +2.39% |
+| 128 | 0.503 | 0.488 | 1.03x | -1.81% |
+| 256 | 0.509 | 0.488 | 1.04x | -1.85% |
+The large negative packed K-sweep gaps were leakage-confounded and should be treated as historical/debugging evidence only, not as the headline.
+## Folder Guide
+- `checkpoints_block_d128/`: clean block-causal d128 checkpoint and JSON eval artifacts. Use this for current claims.
+- `checkpoints_packed_d64/`, `checkpoints_packed_d128/`, `checkpoints_packed_d256/`: leakage-confounded packed ablation checkpoints.
+- `checkpoints_d64/`: earlier unpacked d64 checkpoints.
+- `checkpoints/`: original pilot checkpoint and compare JSON.
+## Code
+Source repo: https://github.com/unixsysdev/ann-sparseattention
+The repo README contains the current methodology notes and reproduction commands.

checkpoints/search_step_2000.compare_retrieval.json ADDED Viewed

	@@ -0,0 +1,127 @@

+{
+  "model": "Qwen/Qwen3-4B-Instruct-2507",
+  "ckpt": "/tmp/checkpoints/search_step_2000.pt",
+  "by_K": {
+    "16": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4115331669648488,
+          "8": 0.2802686393260956,
+          "12": 0.1508728675544262,
+          "16": 0.07771651136378448,
+          "20": 0.053202067812283836,
+          "24": 0.10271163408954938
+        },
+        "avg": 0.1793841478518314
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.1072007929906249,
+          "8": 0.23697869976361594,
+          "12": 0.11224483884871006,
+          "16": 0.07637482260664304,
+          "20": 0.06903641019016504,
+          "24": 0.10397349204868078
+        },
+        "avg": 0.11763484274140662
+      }
+    },
+    "32": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4668281575043996,
+          "8": 0.3293568876882394,
+          "12": 0.19662011042237282,
+          "16": 0.11227216385304928,
+          "20": 0.07988839099804561,
+          "24": 0.144585732370615
+        },
+        "avg": 0.22159190713945362
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.13327929687996706,
+          "8": 0.2569987513124943,
+          "12": 0.1446449818710486,
+          "16": 0.10432848272224267,
+          "20": 0.09580977975080411,
+          "24": 0.13778831561406454
+        },
+        "avg": 0.14547493469177022
+      }
+    },
+    "64": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.5168702056010565,
+          "8": 0.390024371445179,
+          "12": 0.25787363573908806,
+          "16": 0.15796820322672525,
+          "20": 0.11998403631150723,
+          "24": 0.2020296814541022
+        },
+        "avg": 0.2741250222962764
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.1732035626967748,
+          "8": 0.2893482334911823,
+          "12": 0.19321986908713976,
+          "16": 0.14695298795898756,
+          "20": 0.13634028658270836,
+          "24": 0.1862101349979639
+        },
+        "avg": 0.18754584580245945
+      }
+    },
+    "128": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.571592112382253,
+          "8": 0.463805615901947,
+          "12": 0.33948806673288345,
+          "16": 0.22183777391910553,
+          "20": 0.18010229741533598,
+          "24": 0.27901028965910274
+        },
+        "avg": 0.34263935933510464
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.235384251922369,
+          "8": 0.3402557211617629,
+          "12": 0.2643987759947777,
+          "16": 0.21102494125564894,
+          "20": 0.19678996006647745,
+          "24": 0.25442706421017647
+        },
+        "avg": 0.25038011910186875
+      }
+    },
+    "256": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.6327291478713354,
+          "8": 0.5521376008788744,
+          "12": 0.43828647087017697,
+          "16": 0.31400871525208157,
+          "20": 0.2687250425418218,
+          "24": 0.37980743249257404
+        },
+        "avg": 0.4309490683178107
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.3288092017173767,
+          "8": 0.41752680391073227,
+          "12": 0.36623430997133255,
+          "16": 0.30524607251087826,
+          "20": 0.2853706416984399,
+          "24": 0.34850213179985684
+        },
+        "avg": 0.3419481936014361
+      }
+    }
+  },
+  "learned_over_raw_K128": 0.7307395145371917
+}

checkpoints/search_step_2000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e30257eb414fa7595580d4a3efbb247e02d1bef6875f60fbab90edeb568012d
+size 11814585

checkpoints_block_d128/search_step_1000.compare_retrieval.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "model": "Qwen/Qwen3-4B-Instruct-2507",
+  "ckpt": "/tmp/checkpoints_block_d128/search_step_1000.pt",
+  "by_K": {
+    "128": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.95585672929883,
+          "8": 0.977135993540287,
+          "12": 0.969620868563652,
+          "16": 0.9640911668539047,
+          "20": 0.9696248471736908,
+          "24": 0.9784364253282547
+        },
+        "avg": 0.9691276717931032
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.949747696518898,
+          "8": 0.9758064672350883,
+          "12": 0.9767678007483482,
+          "16": 0.9701211974024773,
+          "20": 0.9826441332697868,
+          "24": 0.9841717407107353
+        },
+        "avg": 0.9732098393142223
+      }
+    }
+  },
+  "learned_over_raw_K128": 1.0042122081949907
+}

checkpoints_block_d128/search_step_1000.k_sweep_exact.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "ppl_full": 30.444138765335083,
+  "by_K": {
+    "128": {
+      "recall_avg": 0.7435683117319801,
+      "recall_per_layer": {
+        "4": 0.7372374021825518,
+        "8": 0.7400615944244054,
+        "12": 0.7399612933300728,
+        "16": 0.7452810723237757,
+        "20": 0.7493442218927676,
+        "24": 0.7495242862383078
+      },
+      "mass_avg": 0.7874226044715501,
+      "mass_per_layer": {
+        "4": 0.7964790942667894,
+        "8": 0.760353729720074,
+        "12": 0.7879844721965367,
+        "16": 0.8230430181841107,
+        "20": 0.7992641063936694,
+        "24": 0.7574112060681207
+      },
+      "ppl_ann": 30.465327858924866,
+      "ppl_gap_relative": 0.0006959991134290015,
+      "faiss_diag": {}
+    },
+    "256": {
+      "recall_avg": 0.8794096146506826,
+      "recall_per_layer": {
+        "4": 0.8783885035021551,
+        "8": 0.878973599137931,
+        "12": 0.8767847521551724,
+        "16": 0.877142544450431,
+        "20": 0.88153076171875,
+        "24": 0.8836375269396551
+      },
+      "mass_avg": 0.9531509574802443,
+      "mass_per_layer": {
+        "4": 0.9445537698679957,
+        "8": 0.9476728768184267,
+        "12": 0.963769320783944,
+        "16": 0.9684469288793104,
+        "20": 0.9556469095164332,
+        "24": 0.9388159390153556
+      },
+      "ppl_ann": 30.448541164398193,
+      "ppl_gap_relative": 0.0001446058007107463,
+      "faiss_diag": {}
+    },
+    "512": {
+      "recall_avg": 0.0,
+      "recall_per_layer": {
+        "4": 0.0,
+        "8": 0.0,
+        "12": 0.0,
+        "16": 0.0,
+        "20": 0.0,
+        "24": 0.0
+      },
+      "mass_avg": 0.0,
+      "mass_per_layer": {
+        "4": 0.0,
+        "8": 0.0,
+        "12": 0.0,
+        "16": 0.0,
+        "20": 0.0,
+        "24": 0.0
+      },
+      "ppl_ann": 30.447556257247925,
+      "ppl_gap_relative": 0.00011225451109601071,
+      "faiss_diag": {}
+    }
+  }
+}

checkpoints_block_d128/search_step_1000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c384da8d5a0022126ad37dff6d31cf48a7ea5a9fbd35b02d998349cd15fd32cd
+size 23611193

checkpoints_block_d128/search_step_1000.quest_page16.json ADDED Viewed

	@@ -0,0 +1,72 @@

+{
+  "ppl_full": 30.444138765335083,
+  "page_size": 16,
+  "by_K": {
+    "128": {
+      "mass_avg": 0.7268716323509558,
+      "mass_per_layer": {
+        "4": 0.7368688553639114,
+        "8": 0.7033998429093813,
+        "12": 0.7266774880402885,
+        "16": 0.7590455168754203,
+        "20": 0.742104841151724,
+        "24": 0.6931332497650089
+      },
+      "recall_avg": 0.6693170907452658,
+      "recall_per_layer": {
+        "4": 0.6675438136673136,
+        "8": 0.6733688743494758,
+        "12": 0.669839395260821,
+        "16": 0.6732250795960827,
+        "20": 0.6668304967626628,
+        "24": 0.6650948848352387
+      },
+      "ppl_quest": 30.409221529960632,
+      "ppl_gap_relative": -0.0011469280061950332
+    },
+    "256": {
+      "mass_avg": 0.9088168527888155,
+      "mass_per_layer": {
+        "4": 0.9362414129849138,
+        "8": 0.8892738079202587,
+        "12": 0.8947217217807112,
+        "16": 0.92559814453125,
+        "20": 0.9104866817079741,
+        "24": 0.8965793478077856
+      },
+      "recall_avg": 0.8383452755281295,
+      "recall_per_layer": {
+        "4": 0.8439739490377491,
+        "8": 0.8412483478414601,
+        "12": 0.8357682721368198,
+        "16": 0.8378993067248114,
+        "20": 0.8356821125951307,
+        "24": 0.8354996648328058
+      },
+      "ppl_quest": 30.45472514629364,
+      "ppl_gap_relative": 0.0003477313331198978
+    },
+    "512": {
+      "mass_avg": 0.0,
+      "mass_per_layer": {
+        "4": NaN,
+        "8": NaN,
+        "12": NaN,
+        "16": NaN,
+        "20": NaN,
+        "24": NaN
+      },
+      "recall_avg": 0.0,
+      "recall_per_layer": {
+        "4": NaN,
+        "8": NaN,
+        "12": NaN,
+        "16": NaN,
+        "20": NaN,
+        "24": NaN
+      },
+      "ppl_quest": 30.45515537261963,
+      "ppl_gap_relative": 0.0003618629966662039
+    }
+  }
+}

checkpoints_block_d128/search_step_200.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12773073681e6c897150afbfb35d29d62f61916fae404b72449d27e1834a1df6
+size 23611075

checkpoints_block_d128/search_step_400.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b53e45792882ea76d16f616f7669b31d071586aee4a229b0ea2b9f74c2646e61
+size 23611139

checkpoints_block_d128/search_step_600.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c4a9441f38ba4de142b31e295a7069fa48f46444dd97ae1cb459563c6d5d7fbd
+size 23611139

checkpoints_block_d128/search_step_800.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f64aaa18268c8ca4dde60fa81dbb75c3bc8bc4361f8ef73eb0b8bd217998c4c
+size 23611139

checkpoints_d64/search_step_200.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d25e49d3c884670ced90685bc76de51fa0390e4b98dd5ac573c4d6f3cff679c5
+size 11814595

checkpoints_d64/search_step_400.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9fbdbf7f19caf959babc916be0da9a209d8fa7afe1fb19ae0cb89718917ae7e5
+size 11814595

checkpoints_d64/search_step_600.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a891d7560d4c40acee1658d1e25079a37714966cf096a073cd5177c3e272d4c3
+size 11814595

checkpoints_packed_d128/search_step_1000.compare_retrieval.json ADDED Viewed

	@@ -0,0 +1,127 @@

+{
+  "model": "Qwen/Qwen3-4B-Instruct-2507",
+  "ckpt": "/tmp/checkpoints_packed_d128/search_step_1000.pt",
+  "by_K": {
+    "16": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.31979141881068546,
+          "8": 0.3186202198266983,
+          "12": 0.14174955089886984,
+          "16": 0.20945234596729279,
+          "20": 0.2801314915219943,
+          "24": 0.49455158909161884
+        },
+        "avg": 0.2940494360195266
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.29460810621579486,
+          "8": 0.2879024048646291,
+          "12": 0.38745074967543286,
+          "16": 0.2947010373075803,
+          "20": 0.3400069276491801,
+          "24": 0.5236761594812075
+        },
+        "avg": 0.3547242308656375
+      }
+    },
+    "32": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.341281255086263,
+          "8": 0.36785539239645004,
+          "12": 0.21019610886772475,
+          "16": 0.28323352212707203,
+          "20": 0.3406280030806859,
+          "24": 0.5262841482957205
+        },
+        "avg": 0.3449130716423194
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.31440528233846027,
+          "8": 0.318613442281882,
+          "12": 0.42397742718458176,
+          "16": 0.33971526473760605,
+          "20": 0.3927851542830467,
+          "24": 0.5538897663354874
+        },
+        "avg": 0.390564389526844
+      }
+    },
+    "64": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.37279334167639416,
+          "8": 0.4341067870457967,
+          "12": 0.295872134466966,
+          "16": 0.37213313827912015,
+          "20": 0.4122797225912412,
+          "24": 0.5643380333979925
+        },
+        "avg": 0.40858719290958506
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.34206587572892505,
+          "8": 0.3614834249019623,
+          "12": 0.47101640701293945,
+          "16": 0.40079283465941745,
+          "20": 0.46224164466063183,
+          "24": 0.5945162226756414
+        },
+        "avg": 0.4386860682732529
+      }
+    },
+    "128": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4215788046518962,
+          "8": 0.5176494866609573,
+          "12": 0.4037476380666097,
+          "16": 0.47523776690165204,
+          "20": 0.49859579155842465,
+          "24": 0.6136594414710999
+        },
+        "avg": 0.48841148821844
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.38235953201850253,
+          "8": 0.4212384819984436,
+          "12": 0.5328463464975357,
+          "16": 0.4814741685986519,
+          "20": 0.551252673069636,
+          "24": 0.6483117590347925
+        },
+        "avg": 0.5029138268695937
+      }
+    },
+    "256": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4964489738146464,
+          "8": 0.6146760632594427,
+          "12": 0.5339744637409846,
+          "16": 0.5874835352102915,
+          "20": 0.602066790064176,
+          "24": 0.6779818137486776
+        },
+        "avg": 0.5854386066397032
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.4403117299079895,
+          "8": 0.5022278105219206,
+          "12": 0.6128821273644766,
+          "16": 0.582685723900795,
+          "20": 0.65766608218352,
+          "24": 0.716396709283193
+        },
+        "avg": 0.5853616971936492
+      }
+    }
+  },
+  "learned_over_raw_K128": 1.0296928696416485
+}

checkpoints_packed_d128/search_step_1000.k_sweep.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "ppl_full": 224.6417384147644,
+  "by_K": {
+    "128": {
+      "recall_avg": 0.16599421347341228,
+      "recall_per_layer": {
+        "4": 0.1287512010143649,
+        "8": 0.14480467765561997,
+        "12": 0.1611505323840726,
+        "16": 0.17931722825573337,
+        "20": 0.204620607437626,
+        "24": 0.17732103409305697
+      },
+      "mass_avg": 0.2555904160904628,
+      "mass_per_layer": {
+        "4": 0.13555656902251706,
+        "8": 0.22613397721321352,
+        "12": 0.2812447086457283,
+        "16": 0.2538035146651729,
+        "20": 0.27192030414458246,
+        "24": 0.3648834228515625
+      },
+      "ppl_ann": 203.62537622451782,
+      "ppl_gap_relative": -0.09355501937686794,
+      "faiss_diag": {}
+    },
+    "256": {
+      "recall_avg": 0.23324623107910156,
+      "recall_per_layer": {
+        "4": 0.19934444427490233,
+        "8": 0.2069737116495768,
+        "12": 0.2247191111246745,
+        "16": 0.24465274810791016,
+        "20": 0.27422657012939455,
+        "24": 0.24956080118815105
+      },
+      "mass_avg": 0.3174514651298523,
+      "mass_per_layer": {
+        "4": 0.20090028444925945,
+        "8": 0.2829151630401611,
+        "12": 0.3357037226359049,
+        "16": 0.3171180884043376,
+        "20": 0.3458467245101929,
+        "24": 0.4222248077392578
+      },
+      "ppl_ann": 207.06066703796387,
+      "ppl_gap_relative": -0.07826271066483625,
+      "faiss_diag": {}
+    },
+    "512": {
+      "recall_avg": 0.33908390431177043,
+      "recall_per_layer": {
+        "4": 0.30769617216927664,
+        "8": 0.3079519953046526,
+        "12": 0.3272709846496582,
+        "16": 0.34647955213274273,
+        "20": 0.38336784499032156,
+        "24": 0.36173687662397114
+      },
+      "mass_avg": 0.4086176788523084,
+      "mass_per_layer": {
+        "4": 0.30153139574187143,
+        "8": 0.36809664964675903,
+        "12": 0.4162691150392805,
+        "16": 0.4079152686255319,
+        "20": 0.45096262863704134,
+        "24": 0.506931015423366
+      },
+      "ppl_ann": 211.92854118347168,
+      "ppl_gap_relative": -0.056593210687409634,
+      "faiss_diag": {}
+    }
+  }
+}

checkpoints_packed_d128/search_step_1000.k_sweep_exact.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "ppl_full": 224.6417384147644,
+  "by_K": {
+    "128": {
+      "recall_avg": 0.16599421347341228,
+      "recall_per_layer": {
+        "4": 0.1287512010143649,
+        "8": 0.14480467765561997,
+        "12": 0.1611505323840726,
+        "16": 0.17931722825573337,
+        "20": 0.204620607437626,
+        "24": 0.17732103409305697
+      },
+      "mass_avg": 0.2555904160904628,
+      "mass_per_layer": {
+        "4": 0.13555656902251706,
+        "8": 0.22613397721321352,
+        "12": 0.2812447086457283,
+        "16": 0.2538035146651729,
+        "20": 0.27192030414458246,
+        "24": 0.3648834228515625
+      },
+      "ppl_ann": 203.62537622451782,
+      "ppl_gap_relative": -0.09355501937686794,
+      "faiss_diag": {}
+    },
+    "256": {
+      "recall_avg": 0.23324623107910156,
+      "recall_per_layer": {
+        "4": 0.19934444427490233,
+        "8": 0.2069737116495768,
+        "12": 0.2247191111246745,
+        "16": 0.24465274810791016,
+        "20": 0.27422657012939455,
+        "24": 0.24956080118815105
+      },
+      "mass_avg": 0.3174514651298523,
+      "mass_per_layer": {
+        "4": 0.20090028444925945,
+        "8": 0.2829151630401611,
+        "12": 0.3357037226359049,
+        "16": 0.3171180884043376,
+        "20": 0.3458467245101929,
+        "24": 0.4222248077392578
+      },
+      "ppl_ann": 207.06066703796387,
+      "ppl_gap_relative": -0.07826271066483625,
+      "faiss_diag": {}
+    },
+    "512": {
+      "recall_avg": 0.33908390431177043,
+      "recall_per_layer": {
+        "4": 0.30769617216927664,
+        "8": 0.3079519953046526,
+        "12": 0.3272709846496582,
+        "16": 0.34647955213274273,
+        "20": 0.38336784499032156,
+        "24": 0.36173687662397114
+      },
+      "mass_avg": 0.4086176788523084,
+      "mass_per_layer": {
+        "4": 0.30153139574187143,
+        "8": 0.36809664964675903,
+        "12": 0.4162691150392805,
+        "16": 0.4079152686255319,
+        "20": 0.45096262863704134,
+        "24": 0.506931015423366
+      },
+      "ppl_ann": 211.92854118347168,
+      "ppl_gap_relative": -0.056593210687409634,
+      "faiss_diag": {}
+    }
+  }
+}

checkpoints_packed_d128/search_step_1000.k_sweep_exact_skip16.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "ppl_full": 193.67203998565674,
+  "by_K": {
+    "128": {
+      "recall_avg": 0.1631414557016024,
+      "recall_per_layer": {
+        "4": 0.1270386480516003,
+        "8": 0.14388988864037297,
+        "12": 0.15978523992723034,
+        "16": 0.18374190791960684,
+        "20": 0.20087420555853075,
+        "24": 0.1635188441122732
+      },
+      "mass_avg": 0.341150162521229,
+      "mass_per_layer": {
+        "4": 0.13802351394007284,
+        "8": 0.31434367164488763,
+        "12": 0.39897729504492974,
+        "16": 0.32906081599573933,
+        "20": 0.34529122229545345,
+        "24": 0.521204456206291
+      },
+      "ppl_ann": 176.66846084594727,
+      "ppl_gap_relative": -0.0877957352076673,
+      "faiss_diag": {}
+    },
+    "256": {
+      "recall_avg": 0.23127511342366536,
+      "recall_per_layer": {
+        "4": 0.19980506896972655,
+        "8": 0.20826168060302735,
+        "12": 0.22342586517333984,
+        "16": 0.25203742980957033,
+        "20": 0.270965576171875,
+        "24": 0.23315505981445311
+      },
+      "mass_avg": 0.39741740624109906,
+      "mass_per_layer": {
+        "4": 0.2033478021621704,
+        "8": 0.364719812075297,
+        "12": 0.4482226053873698,
+        "16": 0.3924812396367391,
+        "20": 0.4137035528818766,
+        "24": 0.5620294253031413
+      },
+      "ppl_ann": 178.96899604797363,
+      "ppl_gap_relative": -0.07591722552605945,
+      "faiss_diag": {}
+    },
+    "512": {
+      "recall_avg": 0.33962979203178767,
+      "recall_per_layer": {
+        "4": 0.31196675981794086,
+        "8": 0.3126195158277239,
+        "12": 0.32580297333853586,
+        "16": 0.36052775382995605,
+        "20": 0.3826852866581508,
+        "24": 0.3441764627184187
+      },
+      "mass_avg": 0.4830506351732072,
+      "mass_per_layer": {
+        "4": 0.30267453619412016,
+        "8": 0.44439977407455444,
+        "12": 0.5229217324938092,
+        "16": 0.4867537532533918,
+        "20": 0.5154515334538051,
+        "24": 0.6261024815695626
+      },
+      "ppl_ann": 181.6412000656128,
+      "ppl_gap_relative": -0.06211965300171849,
+      "faiss_diag": {}
+    }
+  }
+}

checkpoints_packed_d128/search_step_1000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c831fcdae22879b34815b7af47215dcb51772cf0f27fd59f15e9b1eb4f0d2137
+size 23611129

checkpoints_packed_d128/search_step_200.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:631120827ea8141d461c10ed0b24317643395d66ca707a548f4553e5f38646ac
+size 23611075

checkpoints_packed_d128/search_step_400.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45c8d82cd98746fb98f266066431357c7b2662deed2ebb5d34d6a1c85969b073
+size 23611075

checkpoints_packed_d128/search_step_600.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dcf85676cb91a94c07dfc27db483bd4be83a83374d0f48a79f38118cd6edb8fb
+size 23611075

checkpoints_packed_d128/search_step_800.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5502cff4ec669daa37c1b989f2344dd5648572a1855e2de32f33d6fd9fcb1a0c
+size 23611075

checkpoints_packed_d256/search_step_1000.compare_retrieval.json ADDED Viewed

	@@ -0,0 +1,127 @@

+{
+  "model": "Qwen/Qwen3-4B-Instruct-2507",
+  "ckpt": "/tmp/checkpoints_packed_d256/search_step_1000.pt",
+  "by_K": {
+    "16": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.31979141881068546,
+          "8": 0.3186202198266983,
+          "12": 0.14174955089886984,
+          "16": 0.20945234596729279,
+          "20": 0.2801314915219943,
+          "24": 0.49455158909161884
+        },
+        "avg": 0.2940494360195266
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.2933313399553299,
+          "8": 0.29298429439465207,
+          "12": 0.39142733812332153,
+          "16": 0.2983766943216324,
+          "20": 0.3438563868403435,
+          "24": 0.525188535451889
+        },
+        "avg": 0.35752743151452804
+      }
+    },
+    "32": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.341281255086263,
+          "8": 0.36785539239645004,
+          "12": 0.21019610886772475,
+          "16": 0.28323352212707203,
+          "20": 0.3406280030806859,
+          "24": 0.5262841482957205
+        },
+        "avg": 0.3449130716423194
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.3145047202706337,
+          "8": 0.3253612567981084,
+          "12": 0.4279355009396871,
+          "16": 0.344377006093661,
+          "20": 0.3977118283510208,
+          "24": 0.556083157658577
+        },
+        "avg": 0.39432891168528134
+      }
+    },
+    "64": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.37279334167639416,
+          "8": 0.4341067870457967,
+          "12": 0.295872134466966,
+          "16": 0.37213313827912015,
+          "20": 0.4122797225912412,
+          "24": 0.5643380333979925
+        },
+        "avg": 0.40858719290958506
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.34352220843235654,
+          "8": 0.37063097457091015,
+          "12": 0.4752006282409032,
+          "16": 0.4067305897672971,
+          "20": 0.46815096338589984,
+          "24": 0.5977186262607574
+        },
+        "avg": 0.44365899844302076
+      }
+    },
+    "128": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4215788046518962,
+          "8": 0.5176494866609573,
+          "12": 0.4037476380666097,
+          "16": 0.47523776690165204,
+          "20": 0.49859579155842465,
+          "24": 0.6136594414710999
+        },
+        "avg": 0.48841148821844
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.3849342664082845,
+          "8": 0.43360541264216107,
+          "12": 0.5375104447205862,
+          "16": 0.488710917532444,
+          "20": 0.5576231330633163,
+          "24": 0.6525773753722509
+        },
+        "avg": 0.5091602582898406
+      }
+    },
+    "256": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4964489738146464,
+          "8": 0.6146760632594427,
+          "12": 0.5339744637409846,
+          "16": 0.5874835352102915,
+          "20": 0.602066790064176,
+          "24": 0.6779818137486776
+        },
+        "avg": 0.5854386066397032
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.44419410079717636,
+          "8": 0.5183951507012049,
+          "12": 0.6179872999588648,
+          "16": 0.5908408363660177,
+          "20": 0.6638876696427664,
+          "24": 0.7216567993164062
+        },
+        "avg": 0.592826976130406
+      }
+    }
+  },
+  "learned_over_raw_K128": 1.0424821499328059
+}

checkpoints_packed_d256/search_step_1000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f93073485115c5ac669cce2f483179310be7e1caa88e53091cc36fd1b31435e3
+size 47204153

checkpoints_packed_d256/search_step_200.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f763b0057300594fd26916b2137bdde8f004b077ba21fb760fa950b3d1058a9
+size 47204099

checkpoints_packed_d256/search_step_400.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e14e97ca1f49b5ac54273be7fdf57218e644d9d062edea3a7afb225e76e195c
+size 47204099

checkpoints_packed_d256/search_step_600.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:85a94bcd5b978022c4010e19ee7c4aca2d267a2d7d7896fb7025595198d0ee8f
+size 47204099

checkpoints_packed_d256/search_step_800.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a7db9c635cc462e3b08d402ad047cfd33a90fe5e6e57bf8576172e271b67255
+size 47204099

checkpoints_packed_d64/search_step_1000.compare_retrieval.json ADDED Viewed

	@@ -0,0 +1,127 @@

+{
+  "model": "Qwen/Qwen3-4B-Instruct-2507",
+  "ckpt": "/tmp/checkpoints_packed_d64/search_step_1000.pt",
+  "by_K": {
+    "16": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.31979141881068546,
+          "8": 0.3186202198266983,
+          "12": 0.14174955089886984,
+          "16": 0.20945234596729279,
+          "20": 0.2801314915219943,
+          "24": 0.49455158909161884
+        },
+        "avg": 0.2940494360195266
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.29329264909029007,
+          "8": 0.28124504536390305,
+          "12": 0.37904051691293716,
+          "16": 0.2890646557013194,
+          "20": 0.3339161326487859,
+          "24": 0.5209685415029526
+        },
+        "avg": 0.34958792353669804
+      }
+    },
+    "32": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.341281255086263,
+          "8": 0.36785539239645004,
+          "12": 0.21019610886772475,
+          "16": 0.28323352212707203,
+          "20": 0.3406280030806859,
+          "24": 0.5262841482957205
+        },
+        "avg": 0.3449130716423194
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.31185250480969745,
+          "8": 0.3099167247613271,
+          "12": 0.41439520567655563,
+          "16": 0.33257966736952466,
+          "20": 0.3849489390850067,
+          "24": 0.5501216500997543
+        },
+        "avg": 0.383969115300311
+      }
+    },
+    "64": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.37279334167639416,
+          "8": 0.4341067870457967,
+          "12": 0.295872134466966,
+          "16": 0.37213313827912015,
+          "20": 0.4122797225912412,
+          "24": 0.5643380333979925
+        },
+        "avg": 0.40858719290958506
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.3371334026257197,
+          "8": 0.3503660187125206,
+          "12": 0.4602071891228358,
+          "16": 0.3917141556739807,
+          "20": 0.45263657718896866,
+          "24": 0.5895731498797735
+        },
+        "avg": 0.43027174886729985
+      }
+    },
+    "128": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4215788046518962,
+          "8": 0.5176494866609573,
+          "12": 0.4037476380666097,
+          "16": 0.47523776690165204,
+          "20": 0.49859579155842465,
+          "24": 0.6136594414710999
+        },
+        "avg": 0.48841148821844
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.37442514797051746,
+          "8": 0.40738723675409955,
+          "12": 0.5206492592891058,
+          "16": 0.4700211783250173,
+          "20": 0.5400509238243103,
+          "24": 0.6421113312244415
+        },
+        "avg": 0.4924408462312486
+      }
+    },
+    "256": {
+      "raw_qk": {
+        "per_layer": {
+          "4": 0.4964489738146464,
+          "8": 0.6146760632594427,
+          "12": 0.5339744637409846,
+          "16": 0.5874835352102915,
+          "20": 0.602066790064176,
+          "24": 0.6779818137486776
+        },
+        "avg": 0.5854386066397032
+      },
+      "learned": {
+        "per_layer": {
+          "4": 0.4301423355937004,
+          "8": 0.48568252722422284,
+          "12": 0.599395309885343,
+          "16": 0.5693490405877432,
+          "20": 0.6461250483989716,
+          "24": 0.709612175822258
+        },
+        "avg": 0.5733844062520398
+      }
+    }
+  },
+  "learned_over_raw_K128": 1.0082499247253711
+}

checkpoints_packed_d64/search_step_1000.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c3c1e631d9a7a3838dbba1274d598f143b4e9e1dffca474b9f27f292470962a
+size 11814649

checkpoints_packed_d64/search_step_200.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:63fe19831fae43c71698480b3fbee53712a69375a3526b54d78e454124ec120c
+size 11814595

checkpoints_packed_d64/search_step_400.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:993c324c3b9e1a74ea5600e90662db176db7e33d027609447c6277354fa35ae0
+size 11814595

checkpoints_packed_d64/search_step_600.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8a229f9353532df19b4af03e668938205e4ff4576fa53095bf632bf945e9f4ff
+size 11814595

checkpoints_packed_d64/search_step_800.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6e49834e2902223c4c3f512ef2affa2920c03905162c6a53ccb0749bd1606c1
+size 11814595