richiejp commited on
Commit
1c5254a
Β·
verified Β·
1 Parent(s): 699976d

Sync model card with upstream GitHub inference README

Browse files
Files changed (1) hide show
  1. README.md +288 -33
README.md CHANGED
@@ -9,46 +9,151 @@ tags:
9
  license: apache-2.0
10
  ---
11
 
12
- # LocalVQE β€” Local Voice Quality Enhancement
13
 
14
- Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
15
- dereverberation for 16 kHz speech. A from-scratch derivative of **DeepVQE**
16
- (Indenbom et al., Interspeech 2023 β€” *DeepVQE: Real Time Deep Voice Quality
17
- Enhancement*, [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned
18
- for quantization-aware local CPU/GPU inference. The DCT-II analysis/synthesis
19
- (replacing STFT), S4D bottleneck, GGML streaming graph, and training pipeline
20
- are work of this project β€” no paper yet.
21
 
22
- **Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
23
- Claude (Anthropic).
 
 
 
 
 
 
24
 
25
- Project source: <https://github.com/richiejp/LocalVQE>
 
 
26
 
27
- ## Files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  | File | Size | Description |
30
  |---|---|---|
31
- | `localvqe-v1.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + AEC Challenge fine-tune. |
32
- | `localvqe-v1-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). |
33
 
34
- ## Usage (GGML / C++ / Go)
 
 
35
 
36
- ```bash
37
- # Build the ggml binary
38
- cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
39
 
40
- # Run inference on a 16 kHz WAV pair
41
- ./build/bin/localvqe localvqe-v1-f32.gguf \
42
- --in-wav mic.wav ref.wav --out-wav enhanced.wav
43
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
46
- 16 kHz / 256-sample hop).
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Architecture
49
 
50
  | Component | Value |
51
- |-----------|-------|
52
  | Sample rate | 16 kHz |
53
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
54
  | Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
@@ -60,16 +165,166 @@ Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
60
  | Kernel | (4, 4) time Γ— freq, causal padding |
61
  | Parameters | ~0.9 M |
62
 
63
- ## Upstream citation (DeepVQE)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ```bibtex
66
  @inproceedings{indenbom2023deepvqe,
67
- title={{DeepVQE}: Real Time Deep Voice Quality Enhancement for Joint Acoustic
68
- Echo Cancellation, Noise Suppression and Dereverberation},
69
- author={Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u a}t{\u a}lin and
70
- Chernov, Mykola and Aichner, Robert},
71
- booktitle={Interspeech},
72
- year={2023},
73
- doi={10.21437/Interspeech.2023-2176}
74
  }
75
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  license: apache-2.0
10
  ---
11
 
12
+ # LocalVQE
13
 
14
+ **Local Voice Quality Enhancement** β€” a compact neural model for joint
15
+ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
16
+ 16 kHz speech, designed to run on commodity CPUs in real time.
 
 
 
 
17
 
18
+ - ~0.9 M parameters (~3.5 MB F32)
19
+ - ~1.66 ms per 16 ms frame on Zen4 (24 threads) β€” **β‰ˆ9.6Γ— realtime**
20
+ - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
21
+ - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
22
+ PyTorch reference included for verification and research
23
+ - Quantization-friendly by design (power-of-2 channel widths, kernel area 16)
24
+ to support future Q4_K / Q8_0 native inference
25
+ - Apache 2.0
26
 
27
+ This page is the Hugging Face model card β€” it hosts the published weights.
28
+ Source code, build system, tests, and training pipeline live in the GitHub
29
+ repository: <https://github.com/LocalAI-io/LocalVQE>.
30
 
31
+ **Authors:**
32
+ - Richard Palethorpe ([richiejp](https://github.com/richiejp))
33
+ - Claude (Anthropic)
34
+
35
+ LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β€”
36
+ *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
37
+ Cancellation, Noise Suppression and Dereverberation*,
38
+ [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). It keeps DeepVQE's
39
+ overall topology (mic/far-end encoders, soft-delay cross attention, decoder
40
+ with sub-pixel upsampling, complex convolving mask) but replaces the STFT
41
+ with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal
42
+ state-space model (S4D), and is ~9Γ— smaller than the reference DeepVQE.
43
+ Everything specific to LocalVQE is original to this repository β€” there is
44
+ no LocalVQE paper.
45
+
46
+ ## A concrete example
47
+
48
+ Picture a video call from a laptop. Your microphone picks up three things
49
+ alongside your voice:
50
+
51
+ 1. The remote participant's voice, played back through your speakers and
52
+ caught again by your mic β€” this is the **echo**. Without cancellation
53
+ they hear themselves a fraction of a second later.
54
+ 2. Your own voice bouncing off walls, desk, and monitor before reaching
55
+ the mic β€” this is **reverberation**, the "tunnel" or "bathroom" sound
56
+ that makes you feel far away from the listener.
57
+ 3. A fan, keyboard clatter, a dog barking, or traffic outside β€” plain
58
+ **background noise**.
59
+
60
+ LocalVQE removes all three in a single causal pass, frame by frame, on
61
+ the CPU, so only your voice reaches the far end.
62
+
63
+ ## Why this, and not a classical AEC/NS stack?
64
+
65
+ Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
66
+ NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
67
+ frame and remain a strong baseline when the acoustic path is benign. LocalVQE
68
+ is interesting when you want:
69
+
70
+ - **Robustness to non-linear echo paths** (small loudspeakers, handheld
71
+ devices, plastic laptop chassis) where linear AEC leaves residual echo.
72
+ - **Non-stationary noise suppression** (babble, keyboards, fans changing
73
+ speed) that energy-based noise estimators struggle with.
74
+ - **One model, many conditions** β€” no per-device tuning of step sizes,
75
+ forgetting factors, or VAD thresholds.
76
+ - **A single deterministic causal pass** β€” no double-talk detector, no
77
+ adaptation state that can diverge.
78
+
79
+ The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
80
+ ~1–2 ms/frame. On anything larger than a microcontroller that's still a
81
+ small fraction of a real-time budget.
82
+
83
+ ## Why this, and not DeepVQE?
84
+
85
+ Microsoft never released DeepVQE β€” no weights, no reference implementation,
86
+ no streaming runtime. We re-implemented it from the paper as a GGML graph
87
+ at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the
88
+ full-width ~7.5 M-parameter version) before starting LocalVQE. Comparing
89
+ that implementation to this one:
90
+
91
+ | | DeepVQE (our re-implementation) | LocalVQE |
92
+ |---|---|---|
93
+ | Parameters | ~7.5 M | ~0.9 M |
94
+ | Weights (F32) | ~30 MB | ~3.5 MB |
95
+ | Analysis | STFT (complex FFT) | DCT-II (real, in-graph) |
96
+ | Bottleneck | GRU | S4D (diagonal state space) |
97
+ | CCM arithmetic | Complex | Real-valued (GGML-friendly) |
98
+ | Streaming inference | Yes, separate repo | Yes, in this repo |
99
+
100
+ The smaller parameter count comes from iterative channel pruning of the
101
+ full-width reference, not from distillation; S4D halves the bottleneck
102
+ parameter count vs GRU at similar quality.
103
+
104
+ ## Files in this repository
105
 
106
  | File | Size | Description |
107
  |---|---|---|
108
+ | `localvqe-v1.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
109
+ | `localvqe-v1-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). This is what the C++ inference engine loads. |
110
 
111
+ Only F32 GGUF is published today. A `quantize` tool is included in the C++
112
+ build (see below) and the architecture is designed to be Q4_K / Q8_0
113
+ friendly, but quantized weights have not yet been calibrated and released.
114
 
115
+ ## Validation Results
 
 
116
 
117
+ Numbers below are from the best checkpoint of the AEC fine-tune
118
+ (`localvqe-v1-f32.gguf`), evaluated on a 1 000-clip validation split mixing
119
+ DNS5-synthesised near/far-end scenes and ICASSP AEC Challenge synthetic
120
+ data. AECMOS scores are computed over a 100-clip sub-sample per the standard
121
+ AEC Challenge protocol.
122
+
123
+ | Metric | Overall | Single-talk far-end | Double-talk |
124
+ |---|---:|---:|---:|
125
+ | ERLE | β€” | **+52.2 dB** | β€” |
126
+ | AECMOS echo (↑, 1–5) | 4.36 | 4.46 | 4.33 |
127
+ | AECMOS degradation (↑, 1–5) | 4.83 | 5.00 | 4.78 |
128
+
129
+ - **ERLE** (Echo Return Loss Enhancement) in dB β€” higher is better. Only
130
+ reported for single-talk far-end, where the mic signal is pure echo and the
131
+ ratio `10Β·log10(E[micΒ²] / E[enhΒ²])` directly measures echo attenuation.
132
+ Overall and double-talk ERLE are omitted because near-end speech in the
133
+ mic and enhanced signals dominates the numerator/denominator and the
134
+ number stops being a clean echo-removal measurement.
135
+ - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
136
+ quality predictor. "Echo" rates how well the echo was removed; "degradation"
137
+ rates how clean the resulting speech/residual is. Both are on a 1–5 MOS
138
+ scale, higher is better.
139
 
140
+ ### Why DNSMOS OVRL is not reported here
141
+
142
+ We track DNSMOS P.808 (`sig_bak_ovr.onnx`) in TensorBoard but are deliberately
143
+ *not* publishing OVRL numbers for this model. The scores we obtain (around 2.0
144
+ overall, 2.1 on single-talk far-end) contradict informal listening β€”
145
+ single-talk far-end with 52 dB of cancellation is audibly near-silent, not a
146
+ "2-out-of-5" output. We suspect our DNSMOS invocation (input normalisation,
147
+ silence handling, or ONNX model variant) is miscalibrated for AEC outputs
148
+ and in particular for near-silent clips, which are out of distribution for a
149
+ speech-quality predictor. Until we can reconcile the numbers with a
150
+ DeepVQE-matching protocol we consider our OVRL numbers untrustworthy and
151
+ omit them rather than publish misleading figures.
152
 
153
  ## Architecture
154
 
155
  | Component | Value |
156
+ |---|---|
157
  | Sample rate | 16 kHz |
158
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
159
  | Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
 
165
  | Kernel | (4, 4) time Γ— freq, causal padding |
166
  | Parameters | ~0.9 M |
167
 
168
+ ## Building the C++ Inference Engine
169
+
170
+ Source, build system, and tests live at
171
+ <https://github.com/LocalAI-io/LocalVQE>. Requires CMake β‰₯ 3.20 and a C++17
172
+ compiler. A [Nix](https://nixos.org/) flake is provided:
173
+
174
+ ```bash
175
+ git clone --recursive https://github.com/LocalAI-io/LocalVQE.git
176
+ cd LocalVQE
177
+
178
+ # With Nix:
179
+ nix develop
180
+ cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
181
+ cmake --build ggml/build -j$(nproc)
182
+
183
+ # Without Nix β€” install cmake, gcc/clang, pkg-config, libsndfile, then:
184
+ cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
185
+ cmake --build ggml/build -j$(nproc)
186
+ ```
187
+
188
+ Binaries land in `ggml/build/bin/`. The CPU build produces multiple
189
+ `libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
190
+ Keep the binaries and `.so` files together.
191
+
192
+ ### Vulkan backend (embedded / integrated-GPU targets)
193
+
194
+ Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
195
+ CPU build β€” an additional `libggml-vulkan.so` is produced in
196
+ `ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
197
+ present, otherwise it falls back to the CPU variants.
198
+
199
+ ```bash
200
+ cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
201
+ cmake --build ggml/build -j$(nproc)
202
+ ```
203
+
204
+ The Nix flake's dev shell already includes `vulkan-loader`,
205
+ `vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
206
+ from your distro (Debian: `libvulkan-dev vulkan-headers
207
+ glslc`/`shaderc`).
208
+
209
+ ### Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)
210
+
211
+ Measured with `bench` on Zen4 desktop (Ryzen 9 7900), 30 iters Γ— 187 hops
212
+ = 5 610 streaming hops per backend. Each hop is a full
213
+ `ggml_backend_graph_compute`.
214
+
215
+ | Backend | p50 | p99 | max (quiet) | max (with load) |
216
+ |-----------------------------|--------:|--------:|------------:|----------------:|
217
+ | CPU β€” 1 thread | 3.46 ms | 3.59 ms | 4.93 ms | β€” |
218
+ | CPU β€” 2 threads | 2.05 ms | 2.17 ms | 3.34 ms | β€” |
219
+ | CPU β€” 4 threads | 1.26 ms | 1.48 ms | 3.07 ms | β€” |
220
+ | Vulkan β€” AMD iGPU (RADV) | 1.68 ms | 1.77 ms | 3.40 ms | 37.50 ms |
221
+ | Vulkan β€” NVIDIA RTX 5070 Ti | 1.68 ms | 1.79 ms | 3.40 ms | 31.72 ms |
222
+
223
+ Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
224
+ shared desktop is sensitive to external GPU clients (display compositor,
225
+ browser). On a dedicated embedded device with no compositor contending
226
+ for the queue, the "quiet" column is what you'll see.
227
+
228
+ ## Running Inference
229
+
230
+ Download `localvqe-v1-f32.gguf` from this repository (the file list above)
231
+ either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
232
+ `huggingface_hub`. Then:
233
+
234
+ ### CLI
235
+
236
+ ```bash
237
+ ./ggml/build/bin/localvqe localvqe-v1-f32.gguf \
238
+ --in-wav mic.wav ref.wav \
239
+ --out-wav enhanced.wav
240
+ ```
241
+
242
+ Expects 16 kHz mono PCM for both mic and far-end reference.
243
+
244
+ ### Benchmark
245
+
246
+ ```bash
247
+ ./ggml/build/bin/bench localvqe-v1-f32.gguf \
248
+ --in-wav mic.wav ref.wav --iters 10 --profile
249
+ ```
250
+
251
+ ### Shared Library (C API)
252
+
253
+ ```bash
254
+ cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
255
+ cmake --build ggml/build -j$(nproc)
256
+ ```
257
+
258
+ Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
259
+ `ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
260
+ integration.
261
+
262
+ ### Quantizing (experimental)
263
+
264
+ The model was designed with quantization in mind β€” power-of-two channel
265
+ widths, kernel area 16, GGML-friendly real-valued arithmetic β€” but
266
+ calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool
267
+ in the C++ build can produce GGUF variants from the F32 reference for
268
+ experimentation:
269
+
270
+ ```bash
271
+ ./ggml/build/bin/quantize localvqe-v1-f32.gguf localvqe-v1-q8.gguf Q8_0
272
+ ```
273
+
274
+ Expect end-to-end quality loss until proper per-tensor selection and
275
+ calibration have been worked through.
276
+
277
+ ## PyTorch Reference
278
+
279
+ `localvqe-v1.pt` is the PyTorch checkpoint used to produce the GGUF export.
280
+ It is provided for verification, ablation, and downstream research β€” not
281
+ for end-user inference, which should go through the GGML build above. The
282
+ model definition lives under `pytorch/` in the
283
+ [GitHub repo](https://github.com/LocalAI-io/LocalVQE):
284
+
285
+ ```bash
286
+ git clone https://github.com/LocalAI-io/LocalVQE.git
287
+ cd LocalVQE/pytorch
288
+ pip install -r requirements.txt
289
+ ```
290
+
291
+ ## Citing LocalVQE
292
+
293
+ If you use LocalVQE in academic work, please cite the repository via the
294
+ `CITATION.cff` at <https://github.com/LocalAI-io/LocalVQE> β€” GitHub renders
295
+ a "Cite this repository" button that produces APA and BibTeX entries
296
+ automatically.
297
+
298
+ For a DOI, we recommend citing a specific release via
299
+ [Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
300
+ also cite the upstream DeepVQE paper:
301
 
302
  ```bibtex
303
  @inproceedings{indenbom2023deepvqe,
304
+ title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
305
+ Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
306
+ author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
307
+ and Chernov, Mykola and Aichner, Robert},
308
+ booktitle = {Interspeech},
309
+ year = {2023},
310
+ doi = {10.21437/Interspeech.2023-2176}
311
  }
312
  ```
313
+
314
+ ## Dataset Attribution
315
+
316
+ Published weights are trained on data from the
317
+ [ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
318
+ (Microsoft, CC BY 4.0) and fine-tuned on the
319
+ [ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
320
+
321
+ ## Safety Note
322
+
323
+ Training data was filtered by DNSMOS perceived-quality scores, which can
324
+ misclassify distressed speech (screaming, crying) as noise. LocalVQE may
325
+ attenuate or distort such signals and must not be relied upon for emergency
326
+ call or safety-critical applications.
327
+
328
+ ## License
329
+
330
+ Apache License 2.0.