File size: 13,530 Bytes
432fc91
 
 
 
 
 
 
 
 
 
 
1c5254a
432fc91
3fe49fc
d6fc4cf
3fe49fc
 
1c5254a
 
 
432fc91
0872e4e
66c7198
1c5254a
 
 
3fe49fc
 
88876af
1c5254a
 
d6fc4cf
432fc91
66c7198
 
 
79b6332
9a0a8e2
79b6332
74fd5a7
 
 
 
9a0a8e2
1c5254a
 
 
 
 
 
 
79b6332
 
 
1c5254a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79b6332
 
 
 
 
 
 
1c5254a
 
432fc91
 
 
66c7198
 
 
 
 
432fc91
79b6332
 
 
432fc91
1c5254a
97bc2da
79b6332
d15c8fe
 
668175b
79b6332
 
66c7198
 
 
 
 
 
 
 
 
 
 
 
432fc91
d15c8fe
 
 
 
 
 
 
432fc91
1c5254a
 
 
d6fc4cf
1c5254a
 
 
d6fc4cf
1c5254a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79b6332
 
1c5254a
66c7198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79b6332
 
 
 
 
 
 
1c5254a
 
79b6332
 
 
 
1c5254a
 
 
66c7198
1c5254a
 
 
 
 
 
66c7198
1c5254a
 
 
 
 
 
 
 
 
66c7198
1c5254a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79b6332
 
 
1c5254a
 
66c7198
1c5254a
 
 
 
 
 
 
66c7198
1c5254a
 
 
d6fc4cf
1c5254a
 
d6fc4cf
1c5254a
 
 
 
 
 
 
d6fc4cf
1c5254a
 
 
 
 
 
97bc2da
 
 
1c5254a
 
 
 
 
 
 
97bc2da
 
1c5254a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
library_name: pytorch
tags:
  - audio-to-audio
  - speech-enhancement
  - acoustic-echo-cancellation
  - noise-suppression
  - ggml
license: apache-2.0
---

# LocalVQE

[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo)
[![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)

**Local Voice Quality Enhancement** β€” a compact neural model for joint
acoustic echo cancellation (AEC), noise suppression, and dereverberation of
16 kHz speech, designed to run on commodity CPUs in real time.

- 1.3 M parameters (~5 MB F32)
- ~1.56 ms per 16 ms frame on Zen4 (4 threads) β€” **β‰ˆ10Γ— realtime**
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
  PyTorch reference included for verification and research

Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.

This page is the Hugging Face model card β€” it hosts the published weights.
Source code, build system, tests, and training pipeline live in the GitHub
repository: <https://github.com/localai-org/LocalVQE>.

The current release is **v1.2**. It doubles the supported delay
window from 500 ms to 1 second at a ~20 % per-hop CPU cost. It also
avoids oversuppression of voices that are near to the noise floor.

The technical report describing the architecture, streaming-state contract,
and streaming-causal normalisation operator is included in this repo as
[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
from an existing author in those categories β€” if you can endorse, please
reach out via the GitHub repo.

**Authors:**
- Richard Palethorpe ([richiejp](https://github.com/richiejp))
- Claude (Anthropic)

LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β€”
*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
Cancellation, Noise Suppression and Dereverberation*,
[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) β€” smaller, GGML-native,
and tuned for streaming CPU inference. The architecture is documented in
the technical report linked above.

## A concrete example

Picture a video call from a laptop. Your microphone picks up three things
alongside your voice:

1. The remote participant's voice, played back through your speakers and
   caught again by your mic β€” this is the **echo**. Without cancellation
   they hear themselves a fraction of a second later.
2. Your own voice bouncing off walls, desk, and monitor before reaching
   the mic β€” this is **reverberation**, the "tunnel" or "bathroom" sound
   that makes you feel far away from the listener.
3. A fan, keyboard clatter, a dog barking, or traffic outside β€” plain
   **background noise**.

LocalVQE removes all three in a single causal pass, frame by frame, on
the CPU, so only your voice reaches the far end.

## Why this, and not a classical AEC/NS stack?

Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
frame and remain a strong baseline when the acoustic path is benign. LocalVQE
is interesting when you want:

- **Robustness to non-linear echo paths** (small loudspeakers, handheld
  devices, plastic laptop chassis) where linear AEC leaves residual echo.
- **Non-stationary noise suppression** (babble, keyboards, fans changing
  speed) that energy-based noise estimators struggle with.
- **One model, many conditions** β€” no per-device tuning of step sizes,
  forgetting factors, or VAD thresholds.
- **A single deterministic causal pass** β€” no double-talk detector, no
  adaptation state that can diverge.

The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
~1–2 ms/frame. On anything larger than a microcontroller that's still a
small fraction of a real-time budget.

## Why this, and not DeepVQE?

Microsoft never released DeepVQE β€” no weights, no reference
implementation, no streaming runtime. We re-implemented it from the
paper as a GGML graph at
[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
(the full-width ~7.5 M-parameter version) before starting LocalVQE.
LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
(~5 MB F32), small enough to run on commodity CPUs in real time.

## Files in this repository

| File | Size | Description |
|---|---|---|
| `localvqe-v1.2-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
| `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | GGML F32 export β€” what the C++ inference engine loads. |
| `localvqe-v1.1-1.3M.pt` | 11 MB | Previous release. |
| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | Previous release (F32 GGUF). |
| `localvqe-v1-1.3M-f32.gguf` | 5 MB | Original release. |

Only F32 GGUF is published today. A `quantize` tool is included in the
C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
released.

## Validation Results

Full 800-clip eval on the
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
β€” real recordings, not synthetic mixes.

| Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
| doubletalk                        | 115 |          4.72 |         2.37 |       8.4 dB |          2.83 |
| doubletalk-with-movement          | 185 |          4.65 |         2.30 |       8.1 dB |          2.79 |
| farend-singletalk                 | 107 |          3.78 |         4.91 |      45.7 dB |          1.80 |
| farend-singletalk-with-movement   | 193 |          4.12 |         4.96 |      40.6 dB |          1.75 |
| nearend-singletalk                | 200 |          5.00 |         4.16 |       2.1 dB |          3.17 |

v1.2 vs v1.1 deltas: AECMOS echo MOS +0.80 / +0.72 on FE-ST and
FE-ST-with-movement (the primary release goal β€” these scenarios are
where echo leaks through), near-end deg MOS +0.11, double-talk
roughly unchanged. FE-ST-with-movement raw ERLE drops 4.4 dB; v1.2
is less aggressive when the echo path is moving, trading raw
cancellation for fewer near-end gating artefacts.

- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
  quality predictor. "Echo" rates how well echo was removed; "degradation"
  rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
- **Blind ERLE** is `10Β·log10(E[micΒ²] / E[enhΒ²])`. Only meaningful on
  far-end single-talk where the input is echo-only; on scenes with active
  near-end speech it understates echo removal because both numerator and
  denominator are dominated by speech.

## Building the C++ Inference Engine

Source, build system, and tests live at
<https://github.com/localai-org/LocalVQE>. Requires CMake β‰₯ 3.20 and a C++17
compiler. A [Nix](https://nixos.org/) flake is provided:

```bash
git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE

# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

# Without Nix β€” install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
```

Binaries land in `ggml/build/bin/`. The CPU build produces multiple
`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
Keep the binaries and `.so` files together.

### Vulkan backend (embedded / integrated-GPU targets)

Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
CPU build β€” an additional `libggml-vulkan.so` is produced in
`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
present, otherwise it falls back to the CPU variants.

```bash
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)
```

The Nix flake's dev shell already includes `vulkan-loader`,
`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
from your distro (Debian: `libvulkan-dev vulkan-headers
glslc`/`shaderc`).

### Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)

Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
full `ggml_backend_graph_compute`.

**v1.2** (current, 1024 ms echo-search window):

| Backend                     | Threads | p50     | p99     | max     |
|-----------------------------|--------:|--------:|--------:|--------:|
| CPU                         |       1 | 4.15 ms | 4.53 ms | 6.23 ms |
| CPU                         |       4 | 1.56 ms | 1.73 ms | 4.57 ms |
| CPU                         |       8 | 1.89 ms | 2.15 ms | 6.91 ms |
| CPU                         |      16 | 2.12 ms | 2.17 ms | 6.43 ms |
| Vulkan β€” AMD iGPU (RADV)    |       β€” | 4.88 ms | 5.06 ms | 6.24 ms |
| Vulkan β€” NVIDIA RTX 5070 Ti |       β€” | 1.79 ms | 3.42 ms | 5.42 ms |

Beyond β‰ˆ4 threads the model is small enough that thread-launch and
synchronisation overhead dominate; **four threads is the sweet spot
on Zen4**.

**v1.1** (previous, 512 ms echo-search window) for comparison:

| Backend                     | Threads | p50     | p99     | max     |
|-----------------------------|--------:|--------:|--------:|--------:|
| CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
| CPU                         |       2 | 2.07 ms | 2.25 ms | 3.65 ms |
| CPU                         |       4 | 1.32 ms | 1.57 ms | 6.91 ms |
| Vulkan β€” AMD iGPU (RADV)    |       β€” | 4.43 ms | 4.62 ms | 5.07 ms |
| Vulkan β€” NVIDIA RTX 5070 Ti |       β€” | 1.79 ms | 3.41 ms | 4.14 ms |

Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
shared desktop is sensitive to external GPU clients (display
compositor, browser). On a dedicated embedded device with no
compositor contending for the queue, expect the quieter end of the
range.

## Running Inference

Download `localvqe-v1.2-1.3M-f32.gguf` from this repository (the file list above)
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
`huggingface_hub`. Then:

### CLI

```bash
./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav \
    --out-wav enhanced.wav
```

Expects 16 kHz mono PCM for both mic and far-end reference.

### Benchmark

```bash
./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav --iters 10 --profile
```

### Shared Library (C API)

```bash
cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)
```

Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
integration.

### Quantizing (experimental)

Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
tool in the C++ build can produce GGUF variants from the F32 reference
for experimentation:

```bash
./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
```

Expect end-to-end quality loss until proper per-tensor selection and
calibration have been worked through.

## PyTorch Reference

`localvqe-v1.2-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
It is provided for verification, ablation, and downstream research β€” not
for end-user inference, which should go through the GGML build above. The
model definition lives under `pytorch/` in the
[GitHub repo](https://github.com/localai-org/LocalVQE):

```bash
git clone https://github.com/localai-org/LocalVQE.git
cd LocalVQE/pytorch
pip install -r requirements.txt
```

## Citing LocalVQE

If you use LocalVQE in academic work, please cite the repository via the
`CITATION.cff` at <https://github.com/localai-org/LocalVQE> β€” GitHub renders
a "Cite this repository" button that produces APA and BibTeX entries
automatically.

For a DOI, we recommend citing a specific release via
[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
also cite the upstream DeepVQE paper:

```bibtex
@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech},
  year      = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}
```

## Dataset Attribution

Published weights are trained on data from the
[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
(Microsoft, CC BY 4.0) and fine-tuned on the
[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).

## Safety Note

Training data was filtered by DNSMOS perceived-quality scores, which can
misclassify distressed speech (screaming, crying) as noise. LocalVQE may
attenuate or distort such signals and must not be relied upon for emergency
call or safety-critical applications.

## License

Apache License 2.0.