richiejp commited on
Commit
deecf20
·
verified ·
1 Parent(s): 5760d09

Sync model card with upstream GitHub inference README

Browse files
Files changed (1) hide show
  1. README.md +141 -354
README.md CHANGED
@@ -15,366 +15,155 @@ license: apache-2.0
15
  [![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
16
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
17
 
18
- **Local Voice Quality Enhancement** — a compact neural model for joint
19
- acoustic echo cancellation (AEC), noise suppression, and dereverberation of
20
- 16 kHz speech, designed to run on commodity CPUs in real time.
21
-
22
- - Two sizes — choose by CPU budget:
23
- - **v1.3 (current)** — 4.8 M parameters (~19 MB F32), ~3.2 ms per 16 ms
24
- frame on Zen4 (4 threads), **≈5× realtime**, ~34 MiB peak RSS.
25
- - **v1.2** — 1.3 M parameters (~5 MB F32), ~1.6 ms per 16 ms frame on
26
- Zen4 (4 threads), **≈10× realtime**, ~20 MiB peak RSS.
27
- - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
28
- - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
29
- PyTorch reference included for verification and research
30
-
31
- Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.
32
-
33
- This page is the Hugging Face model card — it hosts the published weights.
34
- Source code, build system, tests, and training pipeline live in the GitHub
35
- repository: <https://github.com/localai-org/LocalVQE>.
36
-
37
- The current release is **v1.3**. It widens the encoder/decoder
38
- (mic channels `[2,112,32,104,96,152]`, far `[2,64,32]`, bottleneck
39
- 256) and trains from scratch under a noise-floor-aware loss recipe.
40
- On doubletalk it filters noise better than v1.2; on far-end-only
41
- echo it cancels harder but the residual rates rougher in AECMOS
42
- some users will prefer v1.2's gentler trade-off on FE-ST scenes.
43
- v1.2 stays available as the small/fast option (~1/4 the per-hop
44
- cost). Both reuse v1.2's 1024 ms echo-search window.
45
-
46
- **Authors:**
47
- - Richard Palethorpe ([richiejp](https://github.com/richiejp))
48
- - Claude (Anthropic)
49
-
50
- LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023
51
- *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
52
- Cancellation, Noise Suppression and Dereverberation*,
53
- [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) smaller, GGML-native,
54
- and tuned for streaming CPU inference. The architecture is documented in
55
- the technical report linked above.
56
-
57
- ## A concrete example
58
-
59
- Picture a video call from a laptop. Your microphone picks up three things
60
- alongside your voice:
61
-
62
- 1. The remote participant's voice, played back through your speakers and
63
- caught again by your mic — this is the **echo**. Without cancellation
64
- they hear themselves a fraction of a second later.
65
- 2. Your own voice bouncing off walls, desk, and monitor before reaching
66
- the mic — this is **reverberation**, the "tunnel" or "bathroom" sound
67
- that makes you feel far away from the listener.
68
- 3. A fan, keyboard clatter, a dog barking, or traffic outside — plain
69
- **background noise**.
70
-
71
- LocalVQE removes all three in a single causal pass, frame by frame, on
72
- the CPU, so only your voice reaches the far end.
73
-
74
- ## Why this, and not a classical AEC/NS stack?
75
-
76
- Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
77
- NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
78
- frame and remain a strong baseline when the acoustic path is benign. LocalVQE
79
- is interesting when you want:
80
-
81
- - **Robustness to non-linear echo paths** (small loudspeakers, handheld
82
- devices, plastic laptop chassis) where linear AEC leaves residual echo.
83
- - **Non-stationary noise suppression** (babble, keyboards, fans changing
84
- speed) that energy-based noise estimators struggle with.
85
- - **One model, many conditions** — no per-device tuning of step sizes,
86
- forgetting factors, or VAD thresholds.
87
- - **A single deterministic causal pass** — no double-talk detector, no
88
- adaptation state that can diverge.
89
-
90
- The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
91
- ~1–2 ms/frame. On anything larger than a microcontroller that's still a
92
- small fraction of a real-time budget.
93
-
94
- ## Why this, and not DeepVQE?
95
-
96
- Microsoft never released DeepVQE — no weights, no reference
97
- implementation, no streaming runtime. We re-implemented it from the
98
- paper as a GGML graph at
99
- [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
100
- (the full-width ~7.5 M-parameter version) before starting LocalVQE.
101
- LocalVQE is the same idea rebuilt for streaming CPU inference, and
102
- published in two sizes: a 1.3 M-parameter compact build (v1.2,
103
- ~5 MB F32) for tight CPU budgets, and a 4.8 M-parameter wider build
104
- (v1.3, ~19 MB F32) that filters noise better on some clips at ~2×
105
- the per-hop cost. Both are small enough to run real time on
106
- commodity CPUs.
107
 
108
  ## Files in this repository
109
 
110
- | File | Size | Description |
111
  |---|---|---|
112
- | `localvqe-v1.3-4.8M.pt` | 55 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune, wider arch + noise-floor-aware loss. **Current release.** |
113
- | `localvqe-v1.3-4.8M-f32.gguf` | 19 MB | GGML F32 export of the current release — what the C++ inference engine loads. |
114
- | `localvqe-v1.2-1.3M.pt` | 11 MB | Compact alternative — same arch family as v1.3 (`arch_version=3`), ~1/4 the cost per hop. |
115
- | `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | GGML F32 export of the compact variant. |
116
- | `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | Older release (F32 GGUF). |
117
- | `localvqe-v1-1.3M-f32.gguf` | 5 MB | Original release. |
 
 
118
 
119
- Only F32 GGUF is published today. A `quantize` tool is included in the
120
- C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
121
- released.
122
 
123
- ## Validation Results
124
 
125
  Full 800-clip eval on the
126
  [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
127
- real recordings, not synthetic mixes.
128
-
129
- **v1.3** (current, 4.8 M):
130
-
131
- | Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
132
- |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
133
- | doubletalk | 115 | 4.73 | **2.62** | 8.5 dB | 2.89 |
134
- | doubletalk-with-movement | 185 | 4.67 | **2.43** | 8.3 dB | 2.85 |
135
- | farend-singletalk | 107 | 3.69 | 4.83 | **50.9 dB** | 1.94 |
136
- | farend-singletalk-with-movement | 193 | 3.88 | 4.98 | **49.9 dB** | 1.96 |
137
- | nearend-singletalk | 200 | 5.00 | 4.18 | 2.4 dB | 3.17 |
138
-
139
- **v1.2** (compact alternative, 1.3 M):
140
-
141
- | Scenario | n | AECMOS echo | AECMOS deg ↑ | blind ERLE | DNSMOS OVRL ↑ |
142
- |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
143
- | doubletalk | 115 | 4.72 | 2.37 | 8.4 dB | 2.83 |
144
- | doubletalk-with-movement | 185 | 4.65 | 2.30 | 8.1 dB | 2.79 |
145
- | farend-singletalk | 107 | 3.78 | 4.91 | 45.7 dB | 1.80 |
146
- | farend-singletalk-with-movement | 193 | 4.12 | 4.96 | 40.6 dB | 1.75 |
147
- | nearend-singletalk | 200 | 5.00 | 4.16 | 2.1 dB | 3.17 |
148
-
149
- v1.3 vs v1.2 deltas (same 800-clip set, same eval pipeline):
150
-
151
- - **Doubletalk deg MOS +0.25**, dt-with-movement deg MOS +0.13the
152
- wider model + noise-floor-aware loss recipe noticeably reduces
153
- perceived speech degradation when both talkers are active. This is
154
- the primary v1.3 release goal.
155
- - **FE-ST-with-movement ERLE +9.3 dB**, FE-ST ERLE +5.2 dB — v1.3
156
- cancels far-end echo substantially harder. **AECMOS echo MOS drops
157
- −0.24 / −0.09** at the same time: the residual after cancellation
158
- rates rougher on AECMOS's perceptual scale even though there's
159
- numerically less of it. Some users will prefer v1.2's gentler
160
- trade-off on far-end-only scenes.
161
- - **Nearend-singletalk identical** within noise (deg +0.02,
162
- OVRL +0.00) wider capacity doesn't help (or hurt) when there's
163
- nothing to cancel.
164
- - DNSMOS OVRL is up 0.04–0.21 across all scenarios the wider
165
- model produces consistently cleaner-rated output by DNS metrics.
166
-
167
- For the original v1.2 vs v1.1 deltas (the previous release's
168
- headline numbers), see the [v1.2 release notes on
169
- GitHub](https://github.com/localai-org/LocalVQE).
170
-
171
- - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
172
- quality predictor. "Echo" rates how well echo was removed; "degradation"
173
- rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
174
- - **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on
175
- far-end single-talk where the input is echo-only; on scenes with active
176
- near-end speech it understates echo removal because both numerator and
177
- denominator are dominated by speech.
178
-
179
- ## Building the C++ Inference Engine
180
-
181
- Source, build system, and tests live at
182
- <https://github.com/localai-org/LocalVQE>. Requires CMake ≥ 3.20 and a C++17
183
- compiler. A [Nix](https://nixos.org/) flake is provided:
 
 
 
 
 
 
 
 
 
 
 
 
184
 
185
  ```bash
186
- git clone --recursive https://github.com/localai-org/LocalVQE.git
187
- cd LocalVQE
188
-
189
- # With Nix:
190
- nix develop
191
- cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
192
- cmake --build ggml/build -j$(nproc)
193
-
194
- # Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
195
- cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
196
- cmake --build ggml/build -j$(nproc)
197
- ```
198
-
199
- Binaries land in `ggml/build/bin/`. The CPU build produces multiple
200
- `libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
201
- Keep the binaries and `.so` files together.
202
-
203
- ### Vulkan backend (embedded / integrated-GPU targets)
204
-
205
- Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
206
- CPU build — an additional `libggml-vulkan.so` is produced in
207
- `ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
208
- present, otherwise it falls back to the CPU variants.
209
-
210
- ```bash
211
- cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
212
- cmake --build ggml/build -j$(nproc)
213
  ```
214
 
215
- The Nix flake's dev shell already includes `vulkan-loader`,
216
- `vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
217
- from your distro (Debian: `libvulkan-dev vulkan-headers
218
- glslc`/`shaderc`).
219
-
220
- ### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
221
-
222
- Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
223
- full `ggml_backend_graph_compute`.
224
-
225
- **v1.3** (current — 4.8 M, wider encoder/decoder, bn 256):
226
-
227
- | Backend | Threads | p50 | p99 | RT factor |
228
- |-----------------------------|--------:|--------:|--------:|----------:|
229
- | CPU | 1 | 9.73 ms | 14.48 ms | 1.58× |
230
- | CPU | 2 | 5.41 ms | 5.62 ms | 2.95× |
231
- | CPU | 4 | 3.21 ms | 3.42 ms | 4.97× |
232
- | CPU | 8 | 3.47 ms | 3.80 ms | 4.59× |
233
- | CPU | 16 | 3.79 ms | 4.06 ms | 4.19× |
234
- | Vulkan — AMD iGPU (RADV) | — | 8.71 ms | 9.15 ms | 1.83× |
235
- | Vulkan — NVIDIA RTX 5070 Ti | — | 2.57 ms | 4.21 ms | 6.07× |
236
-
237
- The wider v1.3 model is ~2× the per-hop cost of v1.2 in matching
238
- configurations. The dGPU (RTX 5070 Ti) ends up the fastest option
239
- by ~1.25× vs 4-thread CPU. The 1-thread case is the worst, still
240
- real-time (RT 1.58×) but with little margin — running v1.3 on a
241
- low-core / power-constrained device should use v1.2 instead.
242
-
243
- **v1.2** (compact alternative — 1.3 M, 1024 ms echo-search window):
244
-
245
- | Backend | Threads | p50 | p99 | RT factor |
246
- |-----------------------------|--------:|--------:|--------:|----------:|
247
- | CPU | 1 | 4.28 ms | 4.85 ms | 3.72× |
248
- | CPU | 2 | 2.59 ms | 3.80 ms | 6.09× |
249
- | CPU | 4 | 1.65 ms | 2.91 ms | 8.90× |
250
- | CPU | 8 | 1.93 ms | 2.41 ms | 8.22× |
251
- | CPU | 16 | 2.09 ms | 2.22 ms | 7.69× |
252
- | Vulkan — AMD iGPU (RADV) | — | 6.10 ms | 6.53 ms | 2.61× |
253
- | Vulkan — NVIDIA RTX 5070 Ti | — | 1.96 ms | 3.64 ms | 7.85× |
254
-
255
- Beyond ≈4 threads both models are small enough that thread-launch
256
- and synchronisation overhead dominate; **four threads is the sweet
257
- spot on Zen4** for both v1.2 and v1.3.
258
-
259
- **v1.1** (older, 512 ms echo-search window) for comparison:
260
-
261
- | Backend | Threads | p50 | p99 | max |
262
- |-----------------------------|--------:|--------:|--------:|--------:|
263
- | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
264
- | CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms |
265
- | CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms |
266
- | Vulkan — AMD iGPU (RADV) | — | 4.43 ms | 4.62 ms | 5.07 ms |
267
- | Vulkan — NVIDIA RTX 5070 Ti | — | 1.79 ms | 3.41 ms | 4.14 ms |
268
-
269
- Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
270
- shared desktop is sensitive to external GPU clients (display
271
- compositor, browser). On a dedicated embedded device with no
272
- compositor contending for the queue, expect the quieter end of the
273
- range.
274
-
275
- ### Memory footprint (CPU)
276
-
277
- Process RSS from `bench` on Zen4, measured via `/proc/self/status`.
278
- Same numbers under every thread count from 1 to 16 — the runtime has
279
- no per-thread arenas of meaningful size, so peak RSS is set by
280
- weights + activations + history scratch.
281
-
282
- | Model | Post-load delta ¹ | Peak RSS (VmHWM) ² |
283
- |---------------------|------------------:|-------------------:|
284
- | **v1.3** (4.8 M) | +24.4 MiB | 34.1 MiB |
285
- | **v1.2** (1.3 M) | +10.0 MiB | 19.6 MiB |
286
-
287
- ¹ RSS added by loading the model + initialising the CPU backend, on
288
- top of a ~7 MiB binary-and-libs baseline. This is the portable
289
- "working set the model brings" number; the absolute peak will depend
290
- on your host process baseline.
291
-
292
- ² Steady-state ceiling after warmup + sustained streaming. v1.3 is
293
- ~1.75× v1.2 in RSS terms despite carrying ~3.7× more parameters —
294
- activation/history buffers don't scale with channel width. GPU
295
- backends are not reflected here (VRAM doesn't appear in
296
- `/proc/self/status`); for those, `bench --profile` prints the
297
- backend-internal weight/activation buffer sizes.
298
-
299
- ## Running Inference
300
-
301
- Download a GGUF from the file list above — `localvqe-v1.3-4.8M-f32.gguf`
302
- for the current default, or `localvqe-v1.2-1.3M-f32.gguf` for the
303
- smaller / faster option — via `huggingface-cli`, the Hub web UI, or
304
- `hf_hub_download` from `huggingface_hub`. The CLI flags are the same
305
- either way; the examples below use v1.2 so the snippets are shorter
306
- to type. Swap the filename in to run v1.3.
307
-
308
- ### CLI
309
-
310
- ```bash
311
- ./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
312
- --in-wav mic.wav ref.wav \
313
- --out-wav enhanced.wav
314
- ```
315
-
316
- Expects 16 kHz mono PCM for both mic and far-end reference.
317
-
318
- ### Benchmark
319
-
320
- ```bash
321
- ./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
322
- --in-wav mic.wav ref.wav --iters 10 --profile
323
- ```
324
 
325
- ### Shared Library (C API)
326
-
327
- ```bash
328
- cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
329
- cmake --build ggml/build -j$(nproc)
330
- ```
331
-
332
- Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
333
- `ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
334
- integration.
335
-
336
- ### Quantizing (experimental)
337
-
338
- Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
339
- tool in the C++ build can produce GGUF variants from the F32 reference
340
- for experimentation:
341
-
342
- ```bash
343
- ./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
344
- ```
345
-
346
- Expect end-to-end quality loss until proper per-tensor selection and
347
- calibration have been worked through.
348
-
349
- ## PyTorch Reference
350
-
351
- `localvqe-v1.3-4.8M.pt` (current) and `localvqe-v1.2-1.3M.pt`
352
- (compact alternative) are the PyTorch checkpoints used to produce
353
- the GGUF exports. They are provided for verification, ablation, and
354
- downstream research — not for end-user inference, which should go
355
- through the GGML build above. Both share `arch_version=3` (pre-norm
356
- CausalGroupNorm + SiLU + STFT-256) and differ only in width
357
- (`mic_channels`, `far_channels`, `bottleneck_hidden`), which the
358
- loader reads from the saved `model_config` field. The model
359
- definition lives under `pytorch/` in the
360
- [GitHub repo](https://github.com/localai-org/LocalVQE):
361
-
362
- ```bash
363
- git clone https://github.com/localai-org/LocalVQE.git
364
- cd LocalVQE/pytorch
365
- pip install -r requirements.txt
366
- ```
367
 
368
- ## Citing LocalVQE
 
 
 
369
 
370
- If you use LocalVQE in academic work, please cite the repository via the
371
- `CITATION.cff` at <https://github.com/localai-org/LocalVQE> — GitHub renders
372
- a "Cite this repository" button that produces APA and BibTeX entries
373
- automatically.
374
 
375
- For a DOI, we recommend citing a specific release via
376
- [Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
377
- also cite the upstream DeepVQE paper:
378
 
379
  ```bibtex
380
  @inproceedings{indenbom2023deepvqe,
@@ -382,25 +171,23 @@ also cite the upstream DeepVQE paper:
382
  Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
383
  author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
384
  and Chernov, Mykola and Aichner, Robert},
385
- booktitle = {Interspeech},
386
- year = {2023},
387
  doi = {10.21437/Interspeech.2023-2176}
388
  }
389
  ```
390
 
391
- ## Dataset Attribution
392
 
393
- Published weights are trained on data from the
394
- [ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
395
  (Microsoft, CC BY 4.0) and fine-tuned on the
396
- [ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
397
 
398
- ## Safety Note
399
 
400
- Training data was filtered by DNSMOS perceived-quality scores, which can
401
- misclassify distressed speech (screaming, crying) as noise. LocalVQE may
402
- attenuate or distort such signals and must not be relied upon for emergency
403
- call or safety-critical applications.
404
 
405
  ## License
406
 
 
15
  [![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
16
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
17
 
18
+ **Local Voice Quality Enhancement** — compact neural models for acoustic echo
19
+ cancellation (AEC), noise suppression (NS), and dereverberation of 16 kHz
20
+ speech, running on commodity CPUs in real time. Causal and streaming
21
+ (256-sample hop, 16 ms latency).
22
+
23
+ - **Try it:** <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>
24
+ - **Source, build system, tests:** <https://github.com/localai-org/LocalVQE>
25
+
26
+ This page hosts the published weights. Inference runs the GGML C++ engine on
27
+ the GGUF files directly (build instructions on GitHub).
28
+
29
+ **Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
30
+ Claude (Anthropic). LocalVQE is a streaming, CPU-tuned derivative of **DeepVQE**
31
+ ([Indenbom et al., Interspeech 2023](https://arxiv.org/abs/2306.03177)).
32
+
33
+ ## Models
34
+
35
+ Speed is per 16 ms hop on a Ryzen 9 7900 (Zen4), 4 threads; RT = realtime
36
+ factor (higher is faster than realtime).
37
+
38
+ | Version | Does | Params | Size (F32) | Speed | Pick it when |
39
+ |---|---|---:|---:|---|---|
40
+ | **v1.3** *(current)* | AEC + NS + dereverb | 4.8 M | ~19 MB | 3.2 ms · 5.0× RT | best joint quality, CPU budget available |
41
+ | **v1.2** | AEC + NS + dereverb | 1.3 M | ~5 MB | 1.7 ms · 8.9× RT | tight CPU / low-power devices |
42
+ | **v1.4-AEC** | echo only (keeps voice, noise, room) | 203 K | ~3 MB | 0.83 ms · 19× RT | NS is handled elsewhere, or you want the room kept |
43
+ | **v1.4-AEC 2.7K** | echo only, linear filter (no mask) | 2.7 K | ~17 KB | 0.36 ms · 44× RT | lightest echo canceller; echo isn't heavily reverberant |
44
+ | v1.1 / v1 | AEC + NS + dereverb | 1.3 M | ~5 MB | — | superseded by v1.2 |
45
+
46
+ - **Joint models (v1.2 / v1.3)** clean echo, noise, and reverb in one pass.
47
+ v1.3 is wider and filters noise better; v1.2 is ~1/4 the per-hop cost.
48
+ - **v1.4-AEC** removes only the far-end echo and passes voice, room, and
49
+ background through unchanged. It's a classical adaptive filter followed by a
50
+ small neural mask. The **2.7K** build is that filter alone cheaper and
51
+ gentler, but it can't remove heavily reverberant echo the way the mask can.
52
+ - Every model needs a far-end **reference** signal (a loopback of what your
53
+ speakers play) in addition to the mic.
54
+ - `bf16` GGUFs are ~12 % smaller with identical quality and speed; pick `f32`
55
+ unless download size matters.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ## Files in this repository
58
 
59
+ | File | Size | Model |
60
  |---|---|---|
61
+ | `localvqe-v1.4-aec-200K-f32.gguf` | 3 MB | v1.4-AEC (echo only) |
62
+ | `localvqe-v1.4-aec-200K-bf16.gguf` | 2.6 MB | v1.4-AEC, conv weights in BF16 |
63
+ | `localvqe-v1.4-aec-2.7K-f32.gguf` | 17 KB | v1.4-AEC front-end only (adaptive filter, no mask) |
64
+ | `localvqe-v1.3-4.8M-f32.gguf` | 19 MB | v1.3 joint GGUF the engine loads |
65
+ | `localvqe-v1.3-4.8M.pt` | 55 MB | v1.3 joint PyTorch checkpoint (research) |
66
+ | `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | v1.2 joint — GGUF |
67
+ | `localvqe-v1.2-1.3M.pt` | 11 MB | v1.2 joint — PyTorch checkpoint |
68
+ | `localvqe-v1.1-1.3M-f32.gguf`, `localvqe-v1-1.3M-f32.gguf` | 5 MB | older releases |
69
 
70
+ v1.4-AEC is GGUF-only (no `.pt`). GGUF integrity is checked at load time against
71
+ a built-in SHA256 allowlist in the engine.
 
72
 
73
+ ## Performance
74
 
75
  Full 800-clip eval on the
76
  [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
77
+ (real recordings). AECMOS echo / deg are 1–5 (higher = more echo removed /
78
+ cleaner speech); blind ERLE is `10·log10(E[mic²]/E[enh²])`, only meaningful on
79
+ far-end-only clips. Unprocessed-mic echo MOS is 2.67 / 2.56 / 1.90 / 2.13 / 5.00
80
+ across the five scenarios.
81
+
82
+ **v1.4-AEC** — keeps background noise and room by design, so its ERLE and
83
+ far-end DNSMOS are intentionally lower than the joint models (it isn't deleting
84
+ the ambience):
85
+
86
+ | Scenario | n | echo ↑ | deg ↑ | ERLE | OVRL |
87
+ |---|--:|--:|--:|--:|--:|
88
+ | doubletalk | 115 | 4.20 | 2.45 | — | 2.59 |
89
+ | doubletalk-with-movement | 185 | 4.19 | 2.45 | | 2.55 |
90
+ | farend-singletalk | 107 | 3.80 | 4.99 | 14.6 dB | 1.37 |
91
+ | farend-singletalk-with-movement | 193 | 3.86 | 4.95 | 11.1 dB | 1.31 |
92
+ | nearend-singletalk | 200 | 4.99 | 3.99 || 3.08 |
93
+
94
+ **v1.4-AEC 2.7K** (front-end only) matches or beats the full model's
95
+ perceptual far-end echo at 1/74 the parameters; the mask's extra work shows up
96
+ as higher ERLE above, not higher echo MOS:
97
+
98
+ | Scenario | n | echo ↑ | deg ↑ | ERLE ↑ | OVRL |
99
+ |---|--:|--:|--:|--:|--:|
100
+ | doubletalk | 115 | 4.00 | 2.79 | — | 2.46 |
101
+ | doubletalk-with-movement | 185 | 3.90 | 2.92 | | 2.42 |
102
+ | farend-singletalk | 107 | 4.06 | 5.00 | 6.5 dB | 1.24 |
103
+ | farend-singletalk-with-movement | 193 | 4.05 | 4.97 | 3.9 dB | 1.22 |
104
+ | nearend-singletalk | 200 | 4.98 | 3.77 | — | 3.03 |
105
+
106
+ **v1.3** (joint) and **v1.2** (joint) these also delete the background, so
107
+ their far-end ERLE is much higher and not comparable to v1.4-AEC's:
108
+
109
+ | Scenario | n | v1.3 echo / deg / ERLE / OVRL | v1.2 echo / deg / ERLE / OVRL |
110
+ |---|--:|---|---|
111
+ | doubletalk | 115 | 4.73 / 2.62 / 8.5 dB / 2.89 | 4.72 / 2.37 / 8.4 dB / 2.83 |
112
+ | doubletalk-with-movement | 185 | 4.67 / 2.43 / 8.3 dB / 2.85 | 4.65 / 2.30 / 8.1 dB / 2.79 |
113
+ | farend-singletalk | 107 | 3.69 / 4.83 / 50.9 dB / 1.94 | 3.78 / 4.91 / 45.7 dB / 1.80 |
114
+ | farend-singletalk-with-movement | 193 | 3.88 / 4.98 / 49.9 dB / 1.96 | 4.12 / 4.96 / 40.6 dB / 1.75 |
115
+ | nearend-singletalk | 200 | 5.00 / 4.18 / 2.4 dB / 3.17 | 5.00 / 4.16 / 2.1 dB / 3.17 |
116
+
117
+ ### Latency
118
+
119
+ Per-hop p50 / RT factor on a Ryzen 9 7900 (Zen4). 16 kHz, 256-sample hop.
120
+
121
+ | Model | 1 thread | 4 threads | dGPU (RTX 5070 Ti, Vulkan) |
122
+ |---|---|---|---|
123
+ | v1.4-AEC (203 K) | 1.29 ms · 12.2× | 0.83 ms · 18.6× | run on CPU¹ |
124
+ | v1.4-AEC 2.7K | 0.36 ms · 44× (single-threaded) | | run on CPU¹ |
125
+ | v1.3 (4.8 M) | 9.73 ms · 1.58× | 3.21 ms · 4.97× | 2.57 ms · 6.07× |
126
+ | v1.2 (1.3 M) | 4.28 ms · 3.72× | 1.65 ms · 8.90× | 1.96 ms · 7.85× |
127
+
128
+ ¹ v1.4-AEC's adaptive front-end always runs on CPU and the neural stage is too
129
+ small for GPU offload to pay off. Four threads is the sweet spot on Zen4 for all
130
+ models; the library defaults to `min(4, available CPUs)`.
131
+
132
+ ### Memory (CPU)
133
+
134
+ Working set the model adds on top of the ~7 MiB binary baseline:
135
+
136
+ | Model | Post-load delta | Peak RSS |
137
+ |---|--:|--:|
138
+ | v1.3 (4.8 M) | +24.4 MiB | 34.1 MiB |
139
+ | v1.2 (1.3 M) | +10.0 MiB | 19.6 MiB |
140
+ | v1.4-AEC (203 K) | +6.7 MiB | 17.0 MiB |
141
+
142
+ ## Running inference
143
+
144
+ Download a GGUF (web UI, `huggingface-cli`, or `hf_hub_download`) and run the
145
+ GGML CLI — same command for every model, just swap the file:
146
 
147
  ```bash
148
+ ./localvqe localvqe-v1.3-4.8M-f32.gguf --in-wav mic.wav ref.wav --out-wav out.wav
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  ```
150
 
151
+ 16 kHz mono PCM for both the mic and the far-end reference. Building the engine,
152
+ the C API (`liblocalvqe.so`), and the OBS Studio plugin are documented in the
153
+ [GitHub repository](https://github.com/localai-org/LocalVQE).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
+ ## PyTorch reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
+ `localvqe-v1.3-4.8M.pt` and `localvqe-v1.2-1.3M.pt` are the checkpoints used to
158
+ produce the GGUF exports — for verification, ablation, and research, not
159
+ end-user inference (use the GGML build). The model definition lives under
160
+ `pytorch/` in the [GitHub repo](https://github.com/localai-org/LocalVQE).
161
 
162
+ ## Citing
 
 
 
163
 
164
+ Cite the repository via `CITATION.cff` at
165
+ <https://github.com/localai-org/LocalVQE> (GitHub's "Cite this repository"
166
+ button produces APA / BibTeX), and the upstream DeepVQE paper:
167
 
168
  ```bibtex
169
  @inproceedings{indenbom2023deepvqe,
 
171
  Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
172
  author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
173
  and Chernov, Mykola and Aichner, Robert},
174
+ booktitle = {Interspeech}, year = {2023},
 
175
  doi = {10.21437/Interspeech.2023-2176}
176
  }
177
  ```
178
 
179
+ ## Dataset attribution
180
 
181
+ Weights are trained on the
182
+ [ICASSP 2023 DNS Challenge](https://github.com/microsoft/DNS-Challenge)
183
  (Microsoft, CC BY 4.0) and fine-tuned on the
184
+ [ICASSP 2022/2023 AEC Challenge](https://github.com/microsoft/AEC-Challenge).
185
 
186
+ ## Safety
187
 
188
+ Training data was filtered by DNSMOS, which can misclassify distressed speech
189
+ (screaming, crying) as noise. LocalVQE may attenuate such signals and must not
190
+ be relied upon for emergency or safety-critical applications.
 
191
 
192
  ## License
193