datasysdev commited on
Commit
5e73334
·
verified ·
1 Parent(s): 8b19534

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -13
README.md CHANGED
@@ -46,8 +46,11 @@ Broad-layer experiments:
46
  - All-36 step 500: recall@K=0.816, PPL gap +3.23%.
47
  - All-36 step 750 regressed to +3.96% despite stable recall.
48
  - Per-layer mass@K identified L00/L01/L02 as the weak early layers.
49
- - A follow-up all32 run reserves full attention on `[0, 1, 2, 35]` and trains
50
- layers `3..34`; checkpoints will be mirrored here as they become useful.
 
 
 
51
 
52
  ## Important results
53
 
@@ -98,7 +101,7 @@ This validates that the learned search vectors are compatible with
98
  off-the-shelf ANN. It is not a wall-clock result: the prototype uses CPU FAISS
99
  and per-forward index construction.
100
 
101
- ### All-36 result so far
102
 
103
  | Step | Recall@K eval | PPL gap |
104
  |---:|---:|---:|
@@ -129,15 +132,53 @@ Per-layer step-500 mass@K at K=128:
129
  | L35 | 0.980 | 0.967 | -0.013 |
130
  | avg | 0.966 | 0.960 | -0.006 |
131
 
132
- The next run reserves `[0, 1, 2, 35]` and trains layers `3..34`.
 
133
 
134
- First diagnostic from the active all32 run:
135
 
136
  | Step | Recall@K eval | PPL gap | Read |
137
  |---:|---:|---:|---|
138
- | 250 | 0.812 | +2.28% | already better than all36 best training eval |
139
-
140
- This is not a final result; the run is continuing toward step 1000.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  ## Positioning against related methods
143
 
@@ -156,9 +197,10 @@ closest in practical baseline behavior to Quest.
156
  | This work | trained low-dim retrieval | yes | yes | O(N log N) | over retrieved set |
157
 
158
  This is a design-positioning table, not a claim of completed production
159
- superiority. The clean result proves the approach for the six-layer pilot; the
160
- active all32 reserved-layer run tests whether broad near-whole-model
161
- substitution can preserve that quality.
 
162
 
163
  This method targets a different deployment scenario than native
164
  sliding-window/state-space/hybrid architectures such as Mistral-style sliding
@@ -175,7 +217,9 @@ Important checkpoint paths in this HF repo:
175
  - `checkpoints_block_d128/search_step_1000.pt`: clean six-layer d128 parity checkpoint.
176
  - `checkpoints_all36_d128_block/protected/search_step_500_keep.pt`: best observed all-36 checkpoint so far.
177
  - `checkpoints_all36_d128_block/search_step_800.pt`: latest all-36 checkpoint before stopping for analysis.
178
- - `checkpoints_all32_d128_block_reserve_0_1_2_35/`: active follow-up, uploaded as useful checkpoints are saved.
 
 
179
 
180
  These checkpoints contain the trained search projection module and optimizer
181
  state. They do not contain or modify the base Qwen model weights.
@@ -187,6 +231,8 @@ state. They do not contain or modify the base Qwen model weights.
187
  - No autoregressive KV-cache integration yet.
188
  - Dynamic indexing is currently supported only by a retrieval-mass proxy.
189
  - Main clean results are single-model and mostly single-seed.
190
- - All-36 broad substitution is not full-attention parity yet.
 
 
191
 
192
  Use the GitHub repository for runnable code, scripts, and the LaTeX paper draft.
 
46
  - All-36 step 500: recall@K=0.816, PPL gap +3.23%.
47
  - All-36 step 750 regressed to +3.96% despite stable recall.
48
  - Per-layer mass@K identified L00/L01/L02 as the weak early layers.
49
+ - The all32 reserved-edge run reserves full attention on `[0, 1, 2, 35]` and
50
+ trains layers `3..34`. Final step 1000: recall@K=0.825, +1.746% PPL gap in
51
+ training eval, and 20.97M trained search-projection parameters. The exact
52
+ K-sweep gives +0.590% PPL gap at K=128 and -0.062% at K=256 on a small clean
53
+ block-causal slice.
54
 
55
  ## Important results
56
 
 
101
  off-the-shelf ANN. It is not a wall-clock result: the prototype uses CPU FAISS
102
  and per-forward index construction.
103
 
104
+ ### All-36 and all32 broad-layer results
105
 
106
  | Step | Recall@K eval | PPL gap |
107
  |---:|---:|---:|
 
132
  | L35 | 0.980 | 0.967 | -0.013 |
133
  | avg | 0.966 | 0.960 | -0.006 |
134
 
135
+ This diagnostic motivated reserving `[0, 1, 2, 35]` as full-attention layers and
136
+ training only layers `3..34`.
137
 
138
+ Final all32 reserved-edge training trajectory:
139
 
140
  | Step | Recall@K eval | PPL gap | Read |
141
  |---:|---:|---:|---|
142
+ | 250 | 0.812 | +2.283% | already better than all36 best training eval |
143
+ | 500 | 0.823 | +1.753% | converged to final quality band |
144
+ | 750 | 0.825 | +1.943% | small eval fluctuation |
145
+ | 1000 | 0.825 | +1.746% | final checkpoint; essentially tied with step 500 |
146
+
147
+ The all32 checkpoint is the current broad-substitution result. It is not
148
+ full-attention parity at K=128 in training eval, but it reduces the all36
149
+ quality cost while still substituting 32 of 36 layers. Post-hoc
150
+ `compare_retrieval` on step 1000 shows learned retrieval matches raw-QK mass on
151
+ the substituted layers: at K=128, learned mass is 0.971 vs raw-QK 0.969; at
152
+ K=256, learned mass is 0.993 vs raw-QK 0.994.
153
+
154
+ Exact K-sweep on the final all32 checkpoint, 2-batch clean block-causal slice
155
+ (`PPL_full = 20.5349`):
156
+
157
+ | K | mass@K | Recall@K | sparse PPL | PPL gap |
158
+ |---:|---:|---:|---:|---:|
159
+ | 16 | 0.546 | 0.518 | 24.86 | +21.064% |
160
+ | 32 | 0.627 | 0.572 | 21.85 | +6.422% |
161
+ | 64 | 0.722 | 0.652 | 20.94 | +1.974% |
162
+ | 128 | 0.807 | 0.746 | 20.66 | +0.590% |
163
+ | 256 | 0.902 | 0.876 | 20.52 | -0.062% |
164
+
165
+ K=512 is intentionally omitted from this table. The current script produced a
166
+ valid sparse-attention PPL line for K=512 but zero mass/recall, which is an
167
+ edge-case bug in the metric path when K exceeds the number of valid causal keys
168
+ for most same-segment queries. It should be rerun after fixing the metric
169
+ handling; the publishable sweep for now is K <= 256.
170
+
171
+ Coverage now looks like a real deployment knob:
172
+
173
+ | Configuration | Layers substituted | Coverage | PPL gap | Read |
174
+ |---|---:|---:|---:|---|
175
+ | Clean six-layer pilot | 6/36 | 17% | +0.07% at K=128 | quality-preserving pilot |
176
+ | all32 reserved-edge | 32/36 | 89% | +1.746% train eval; +0.590% exact sweep | near-parity broad substitution |
177
+ | all36 | 36/36 | 100% | +3.23% best observed | full substitution costs quality |
178
+
179
+ This is not yet enough to claim an optimal coverage ratio, but it suggests the
180
+ best deployment point is intermediate rather than "sparsify everything." A
181
+ 12/18/20-layer coverage sweep is the next clean experiment.
182
 
183
  ## Positioning against related methods
184
 
 
197
  | This work | trained low-dim retrieval | yes | yes | O(N log N) | over retrieved set |
198
 
199
  This is a design-positioning table, not a claim of completed production
200
+ superiority. The clean result proves the approach for the six-layer pilot, and
201
+ the all32 reserved-edge run shows broad substitution can get close to parity
202
+ when weak edge layers remain full attention. All36 full substitution is still
203
+ not parity.
204
 
205
  This method targets a different deployment scenario than native
206
  sliding-window/state-space/hybrid architectures such as Mistral-style sliding
 
217
  - `checkpoints_block_d128/search_step_1000.pt`: clean six-layer d128 parity checkpoint.
218
  - `checkpoints_all36_d128_block/protected/search_step_500_keep.pt`: best observed all-36 checkpoint so far.
219
  - `checkpoints_all36_d128_block/search_step_800.pt`: latest all-36 checkpoint before stopping for analysis.
220
+ - `checkpoints_all32_d128_block_reserve_0_1_2_35/search_step_1000.pt`: final all32 reserved-edge checkpoint.
221
+ - `checkpoints_all32_d128_block_reserve_0_1_2_35/search_step_1000.compare_retrieval.json`: all32 per-layer retrieval comparison.
222
+ - `checkpoints_all32_d128_block_reserve_0_1_2_35/search_step_1000.k_sweep_exact.json`: all32 exact K-sweep.
223
 
224
  These checkpoints contain the trained search projection module and optimizer
225
  state. They do not contain or modify the base Qwen model weights.
 
231
  - No autoregressive KV-cache integration yet.
232
  - Dynamic indexing is currently supported only by a retrieval-mass proxy.
233
  - Main clean results are single-model and mostly single-seed.
234
+ - All-36 broad substitution is not full-attention parity.
235
+ - The all32 result is near parity on a small slice but still needs larger eval
236
+ slices, task benchmarks, and a coverage Pareto sweep.
237
 
238
  Use the GitHub repository for runnable code, scripts, and the LaTeX paper draft.