Sync model card with upstream GitHub inference README
Browse files
README.md
CHANGED
|
@@ -15,366 +15,155 @@ license: apache-2.0
|
|
| 15 |
[](https://github.com/localai-org/LocalVQE)
|
| 16 |
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 17 |
|
| 18 |
-
**Local Voice Quality Enhancement** —
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
(
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
v1.2
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
**
|
| 47 |
-
|
| 48 |
-
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
## A concrete example
|
| 58 |
-
|
| 59 |
-
Picture a video call from a laptop. Your microphone picks up three things
|
| 60 |
-
alongside your voice:
|
| 61 |
-
|
| 62 |
-
1. The remote participant's voice, played back through your speakers and
|
| 63 |
-
caught again by your mic — this is the **echo**. Without cancellation
|
| 64 |
-
they hear themselves a fraction of a second later.
|
| 65 |
-
2. Your own voice bouncing off walls, desk, and monitor before reaching
|
| 66 |
-
the mic — this is **reverberation**, the "tunnel" or "bathroom" sound
|
| 67 |
-
that makes you feel far away from the listener.
|
| 68 |
-
3. A fan, keyboard clatter, a dog barking, or traffic outside — plain
|
| 69 |
-
**background noise**.
|
| 70 |
-
|
| 71 |
-
LocalVQE removes all three in a single causal pass, frame by frame, on
|
| 72 |
-
the CPU, so only your voice reaches the far end.
|
| 73 |
-
|
| 74 |
-
## Why this, and not a classical AEC/NS stack?
|
| 75 |
-
|
| 76 |
-
Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
|
| 77 |
-
NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
|
| 78 |
-
frame and remain a strong baseline when the acoustic path is benign. LocalVQE
|
| 79 |
-
is interesting when you want:
|
| 80 |
-
|
| 81 |
-
- **Robustness to non-linear echo paths** (small loudspeakers, handheld
|
| 82 |
-
devices, plastic laptop chassis) where linear AEC leaves residual echo.
|
| 83 |
-
- **Non-stationary noise suppression** (babble, keyboards, fans changing
|
| 84 |
-
speed) that energy-based noise estimators struggle with.
|
| 85 |
-
- **One model, many conditions** — no per-device tuning of step sizes,
|
| 86 |
-
forgetting factors, or VAD thresholds.
|
| 87 |
-
- **A single deterministic causal pass** — no double-talk detector, no
|
| 88 |
-
adaptation state that can diverge.
|
| 89 |
-
|
| 90 |
-
The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
|
| 91 |
-
~1–2 ms/frame. On anything larger than a microcontroller that's still a
|
| 92 |
-
small fraction of a real-time budget.
|
| 93 |
-
|
| 94 |
-
## Why this, and not DeepVQE?
|
| 95 |
-
|
| 96 |
-
Microsoft never released DeepVQE — no weights, no reference
|
| 97 |
-
implementation, no streaming runtime. We re-implemented it from the
|
| 98 |
-
paper as a GGML graph at
|
| 99 |
-
[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
|
| 100 |
-
(the full-width ~7.5 M-parameter version) before starting LocalVQE.
|
| 101 |
-
LocalVQE is the same idea rebuilt for streaming CPU inference, and
|
| 102 |
-
published in two sizes: a 1.3 M-parameter compact build (v1.2,
|
| 103 |
-
~5 MB F32) for tight CPU budgets, and a 4.8 M-parameter wider build
|
| 104 |
-
(v1.3, ~19 MB F32) that filters noise better on some clips at ~2×
|
| 105 |
-
the per-hop cost. Both are small enough to run real time on
|
| 106 |
-
commodity CPUs.
|
| 107 |
|
| 108 |
## Files in this repository
|
| 109 |
|
| 110 |
-
| File | Size |
|
| 111 |
|---|---|---|
|
| 112 |
-
| `localvqe-v1.
|
| 113 |
-
| `localvqe-v1.
|
| 114 |
-
| `localvqe-v1.2-
|
| 115 |
-
| `localvqe-v1.
|
| 116 |
-
| `localvqe-v1.
|
| 117 |
-
| `localvqe-v1-1.3M-f32.gguf` | 5 MB |
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
released.
|
| 122 |
|
| 123 |
-
##
|
| 124 |
|
| 125 |
Full 800-clip eval on the
|
| 126 |
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
|
| 137 |
-
|
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
|
| 142 |
-
|-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
```bash
|
| 186 |
-
|
| 187 |
-
cd LocalVQE
|
| 188 |
-
|
| 189 |
-
# With Nix:
|
| 190 |
-
nix develop
|
| 191 |
-
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
|
| 192 |
-
cmake --build ggml/build -j$(nproc)
|
| 193 |
-
|
| 194 |
-
# Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
|
| 195 |
-
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
|
| 196 |
-
cmake --build ggml/build -j$(nproc)
|
| 197 |
-
```
|
| 198 |
-
|
| 199 |
-
Binaries land in `ggml/build/bin/`. The CPU build produces multiple
|
| 200 |
-
`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
|
| 201 |
-
Keep the binaries and `.so` files together.
|
| 202 |
-
|
| 203 |
-
### Vulkan backend (embedded / integrated-GPU targets)
|
| 204 |
-
|
| 205 |
-
Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
|
| 206 |
-
CPU build — an additional `libggml-vulkan.so` is produced in
|
| 207 |
-
`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
|
| 208 |
-
present, otherwise it falls back to the CPU variants.
|
| 209 |
-
|
| 210 |
-
```bash
|
| 211 |
-
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
|
| 212 |
-
cmake --build ggml/build -j$(nproc)
|
| 213 |
```
|
| 214 |
|
| 215 |
-
|
| 216 |
-
`
|
| 217 |
-
|
| 218 |
-
glslc`/`shaderc`).
|
| 219 |
-
|
| 220 |
-
### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
|
| 221 |
-
|
| 222 |
-
Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
|
| 223 |
-
full `ggml_backend_graph_compute`.
|
| 224 |
-
|
| 225 |
-
**v1.3** (current — 4.8 M, wider encoder/decoder, bn 256):
|
| 226 |
-
|
| 227 |
-
| Backend | Threads | p50 | p99 | RT factor |
|
| 228 |
-
|-----------------------------|--------:|--------:|--------:|----------:|
|
| 229 |
-
| CPU | 1 | 9.73 ms | 14.48 ms | 1.58× |
|
| 230 |
-
| CPU | 2 | 5.41 ms | 5.62 ms | 2.95× |
|
| 231 |
-
| CPU | 4 | 3.21 ms | 3.42 ms | 4.97× |
|
| 232 |
-
| CPU | 8 | 3.47 ms | 3.80 ms | 4.59× |
|
| 233 |
-
| CPU | 16 | 3.79 ms | 4.06 ms | 4.19× |
|
| 234 |
-
| Vulkan — AMD iGPU (RADV) | — | 8.71 ms | 9.15 ms | 1.83× |
|
| 235 |
-
| Vulkan — NVIDIA RTX 5070 Ti | — | 2.57 ms | 4.21 ms | 6.07× |
|
| 236 |
-
|
| 237 |
-
The wider v1.3 model is ~2× the per-hop cost of v1.2 in matching
|
| 238 |
-
configurations. The dGPU (RTX 5070 Ti) ends up the fastest option
|
| 239 |
-
by ~1.25× vs 4-thread CPU. The 1-thread case is the worst, still
|
| 240 |
-
real-time (RT 1.58×) but with little margin — running v1.3 on a
|
| 241 |
-
low-core / power-constrained device should use v1.2 instead.
|
| 242 |
-
|
| 243 |
-
**v1.2** (compact alternative — 1.3 M, 1024 ms echo-search window):
|
| 244 |
-
|
| 245 |
-
| Backend | Threads | p50 | p99 | RT factor |
|
| 246 |
-
|-----------------------------|--------:|--------:|--------:|----------:|
|
| 247 |
-
| CPU | 1 | 4.28 ms | 4.85 ms | 3.72× |
|
| 248 |
-
| CPU | 2 | 2.59 ms | 3.80 ms | 6.09× |
|
| 249 |
-
| CPU | 4 | 1.65 ms | 2.91 ms | 8.90× |
|
| 250 |
-
| CPU | 8 | 1.93 ms | 2.41 ms | 8.22× |
|
| 251 |
-
| CPU | 16 | 2.09 ms | 2.22 ms | 7.69× |
|
| 252 |
-
| Vulkan — AMD iGPU (RADV) | — | 6.10 ms | 6.53 ms | 2.61× |
|
| 253 |
-
| Vulkan — NVIDIA RTX 5070 Ti | — | 1.96 ms | 3.64 ms | 7.85× |
|
| 254 |
-
|
| 255 |
-
Beyond ≈4 threads both models are small enough that thread-launch
|
| 256 |
-
and synchronisation overhead dominate; **four threads is the sweet
|
| 257 |
-
spot on Zen4** for both v1.2 and v1.3.
|
| 258 |
-
|
| 259 |
-
**v1.1** (older, 512 ms echo-search window) for comparison:
|
| 260 |
-
|
| 261 |
-
| Backend | Threads | p50 | p99 | max |
|
| 262 |
-
|-----------------------------|--------:|--------:|--------:|--------:|
|
| 263 |
-
| CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
|
| 264 |
-
| CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms |
|
| 265 |
-
| CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms |
|
| 266 |
-
| Vulkan — AMD iGPU (RADV) | — | 4.43 ms | 4.62 ms | 5.07 ms |
|
| 267 |
-
| Vulkan — NVIDIA RTX 5070 Ti | — | 1.79 ms | 3.41 ms | 4.14 ms |
|
| 268 |
-
|
| 269 |
-
Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
|
| 270 |
-
shared desktop is sensitive to external GPU clients (display
|
| 271 |
-
compositor, browser). On a dedicated embedded device with no
|
| 272 |
-
compositor contending for the queue, expect the quieter end of the
|
| 273 |
-
range.
|
| 274 |
-
|
| 275 |
-
### Memory footprint (CPU)
|
| 276 |
-
|
| 277 |
-
Process RSS from `bench` on Zen4, measured via `/proc/self/status`.
|
| 278 |
-
Same numbers under every thread count from 1 to 16 — the runtime has
|
| 279 |
-
no per-thread arenas of meaningful size, so peak RSS is set by
|
| 280 |
-
weights + activations + history scratch.
|
| 281 |
-
|
| 282 |
-
| Model | Post-load delta ¹ | Peak RSS (VmHWM) ² |
|
| 283 |
-
|---------------------|------------------:|-------------------:|
|
| 284 |
-
| **v1.3** (4.8 M) | +24.4 MiB | 34.1 MiB |
|
| 285 |
-
| **v1.2** (1.3 M) | +10.0 MiB | 19.6 MiB |
|
| 286 |
-
|
| 287 |
-
¹ RSS added by loading the model + initialising the CPU backend, on
|
| 288 |
-
top of a ~7 MiB binary-and-libs baseline. This is the portable
|
| 289 |
-
"working set the model brings" number; the absolute peak will depend
|
| 290 |
-
on your host process baseline.
|
| 291 |
-
|
| 292 |
-
² Steady-state ceiling after warmup + sustained streaming. v1.3 is
|
| 293 |
-
~1.75× v1.2 in RSS terms despite carrying ~3.7× more parameters —
|
| 294 |
-
activation/history buffers don't scale with channel width. GPU
|
| 295 |
-
backends are not reflected here (VRAM doesn't appear in
|
| 296 |
-
`/proc/self/status`); for those, `bench --profile` prints the
|
| 297 |
-
backend-internal weight/activation buffer sizes.
|
| 298 |
-
|
| 299 |
-
## Running Inference
|
| 300 |
-
|
| 301 |
-
Download a GGUF from the file list above — `localvqe-v1.3-4.8M-f32.gguf`
|
| 302 |
-
for the current default, or `localvqe-v1.2-1.3M-f32.gguf` for the
|
| 303 |
-
smaller / faster option — via `huggingface-cli`, the Hub web UI, or
|
| 304 |
-
`hf_hub_download` from `huggingface_hub`. The CLI flags are the same
|
| 305 |
-
either way; the examples below use v1.2 so the snippets are shorter
|
| 306 |
-
to type. Swap the filename in to run v1.3.
|
| 307 |
-
|
| 308 |
-
### CLI
|
| 309 |
-
|
| 310 |
-
```bash
|
| 311 |
-
./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
|
| 312 |
-
--in-wav mic.wav ref.wav \
|
| 313 |
-
--out-wav enhanced.wav
|
| 314 |
-
```
|
| 315 |
-
|
| 316 |
-
Expects 16 kHz mono PCM for both mic and far-end reference.
|
| 317 |
-
|
| 318 |
-
### Benchmark
|
| 319 |
-
|
| 320 |
-
```bash
|
| 321 |
-
./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
|
| 322 |
-
--in-wav mic.wav ref.wav --iters 10 --profile
|
| 323 |
-
```
|
| 324 |
|
| 325 |
-
##
|
| 326 |
-
|
| 327 |
-
```bash
|
| 328 |
-
cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
|
| 329 |
-
cmake --build ggml/build -j$(nproc)
|
| 330 |
-
```
|
| 331 |
-
|
| 332 |
-
Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
|
| 333 |
-
`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
|
| 334 |
-
integration.
|
| 335 |
-
|
| 336 |
-
### Quantizing (experimental)
|
| 337 |
-
|
| 338 |
-
Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
|
| 339 |
-
tool in the C++ build can produce GGUF variants from the F32 reference
|
| 340 |
-
for experimentation:
|
| 341 |
-
|
| 342 |
-
```bash
|
| 343 |
-
./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
|
| 344 |
-
```
|
| 345 |
-
|
| 346 |
-
Expect end-to-end quality loss until proper per-tensor selection and
|
| 347 |
-
calibration have been worked through.
|
| 348 |
-
|
| 349 |
-
## PyTorch Reference
|
| 350 |
-
|
| 351 |
-
`localvqe-v1.3-4.8M.pt` (current) and `localvqe-v1.2-1.3M.pt`
|
| 352 |
-
(compact alternative) are the PyTorch checkpoints used to produce
|
| 353 |
-
the GGUF exports. They are provided for verification, ablation, and
|
| 354 |
-
downstream research — not for end-user inference, which should go
|
| 355 |
-
through the GGML build above. Both share `arch_version=3` (pre-norm
|
| 356 |
-
CausalGroupNorm + SiLU + STFT-256) and differ only in width
|
| 357 |
-
(`mic_channels`, `far_channels`, `bottleneck_hidden`), which the
|
| 358 |
-
loader reads from the saved `model_config` field. The model
|
| 359 |
-
definition lives under `pytorch/` in the
|
| 360 |
-
[GitHub repo](https://github.com/localai-org/LocalVQE):
|
| 361 |
-
|
| 362 |
-
```bash
|
| 363 |
-
git clone https://github.com/localai-org/LocalVQE.git
|
| 364 |
-
cd LocalVQE/pytorch
|
| 365 |
-
pip install -r requirements.txt
|
| 366 |
-
```
|
| 367 |
|
| 368 |
-
|
|
|
|
|
|
|
|
|
|
| 369 |
|
| 370 |
-
|
| 371 |
-
`CITATION.cff` at <https://github.com/localai-org/LocalVQE> — GitHub renders
|
| 372 |
-
a "Cite this repository" button that produces APA and BibTeX entries
|
| 373 |
-
automatically.
|
| 374 |
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
|
| 379 |
```bibtex
|
| 380 |
@inproceedings{indenbom2023deepvqe,
|
|
@@ -382,25 +171,23 @@ also cite the upstream DeepVQE paper:
|
|
| 382 |
Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
|
| 383 |
author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
|
| 384 |
and Chernov, Mykola and Aichner, Robert},
|
| 385 |
-
booktitle = {Interspeech},
|
| 386 |
-
year = {2023},
|
| 387 |
doi = {10.21437/Interspeech.2023-2176}
|
| 388 |
}
|
| 389 |
```
|
| 390 |
|
| 391 |
-
## Dataset
|
| 392 |
|
| 393 |
-
|
| 394 |
-
[ICASSP 2023
|
| 395 |
(Microsoft, CC BY 4.0) and fine-tuned on the
|
| 396 |
-
[ICASSP 2022/2023
|
| 397 |
|
| 398 |
-
## Safety
|
| 399 |
|
| 400 |
-
Training data was filtered by DNSMOS
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
call or safety-critical applications.
|
| 404 |
|
| 405 |
## License
|
| 406 |
|
|
|
|
| 15 |
[](https://github.com/localai-org/LocalVQE)
|
| 16 |
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 17 |
|
| 18 |
+
**Local Voice Quality Enhancement** — compact neural models for acoustic echo
|
| 19 |
+
cancellation (AEC), noise suppression (NS), and dereverberation of 16 kHz
|
| 20 |
+
speech, running on commodity CPUs in real time. Causal and streaming
|
| 21 |
+
(256-sample hop, 16 ms latency).
|
| 22 |
+
|
| 23 |
+
- **Try it:** <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>
|
| 24 |
+
- **Source, build system, tests:** <https://github.com/localai-org/LocalVQE>
|
| 25 |
+
|
| 26 |
+
This page hosts the published weights. Inference runs the GGML C++ engine on
|
| 27 |
+
the GGUF files directly (build instructions on GitHub).
|
| 28 |
+
|
| 29 |
+
**Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
|
| 30 |
+
Claude (Anthropic). LocalVQE is a streaming, CPU-tuned derivative of **DeepVQE**
|
| 31 |
+
([Indenbom et al., Interspeech 2023](https://arxiv.org/abs/2306.03177)).
|
| 32 |
+
|
| 33 |
+
## Models
|
| 34 |
+
|
| 35 |
+
Speed is per 16 ms hop on a Ryzen 9 7900 (Zen4), 4 threads; RT = realtime
|
| 36 |
+
factor (higher is faster than realtime).
|
| 37 |
+
|
| 38 |
+
| Version | Does | Params | Size (F32) | Speed | Pick it when |
|
| 39 |
+
|---|---|---:|---:|---|---|
|
| 40 |
+
| **v1.3** *(current)* | AEC + NS + dereverb | 4.8 M | ~19 MB | 3.2 ms · 5.0× RT | best joint quality, CPU budget available |
|
| 41 |
+
| **v1.2** | AEC + NS + dereverb | 1.3 M | ~5 MB | 1.7 ms · 8.9× RT | tight CPU / low-power devices |
|
| 42 |
+
| **v1.4-AEC** | echo only (keeps voice, noise, room) | 203 K | ~3 MB | 0.83 ms · 19× RT | NS is handled elsewhere, or you want the room kept |
|
| 43 |
+
| **v1.4-AEC 2.7K** | echo only, linear filter (no mask) | 2.7 K | ~17 KB | 0.36 ms · 44× RT | lightest echo canceller; echo isn't heavily reverberant |
|
| 44 |
+
| v1.1 / v1 | AEC + NS + dereverb | 1.3 M | ~5 MB | — | superseded by v1.2 |
|
| 45 |
+
|
| 46 |
+
- **Joint models (v1.2 / v1.3)** clean echo, noise, and reverb in one pass.
|
| 47 |
+
v1.3 is wider and filters noise better; v1.2 is ~1/4 the per-hop cost.
|
| 48 |
+
- **v1.4-AEC** removes only the far-end echo and passes voice, room, and
|
| 49 |
+
background through unchanged. It's a classical adaptive filter followed by a
|
| 50 |
+
small neural mask. The **2.7K** build is that filter alone — cheaper and
|
| 51 |
+
gentler, but it can't remove heavily reverberant echo the way the mask can.
|
| 52 |
+
- Every model needs a far-end **reference** signal (a loopback of what your
|
| 53 |
+
speakers play) in addition to the mic.
|
| 54 |
+
- `bf16` GGUFs are ~12 % smaller with identical quality and speed; pick `f32`
|
| 55 |
+
unless download size matters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
## Files in this repository
|
| 58 |
|
| 59 |
+
| File | Size | Model |
|
| 60 |
|---|---|---|
|
| 61 |
+
| `localvqe-v1.4-aec-200K-f32.gguf` | 3 MB | v1.4-AEC (echo only) |
|
| 62 |
+
| `localvqe-v1.4-aec-200K-bf16.gguf` | 2.6 MB | v1.4-AEC, conv weights in BF16 |
|
| 63 |
+
| `localvqe-v1.4-aec-2.7K-f32.gguf` | 17 KB | v1.4-AEC front-end only (adaptive filter, no mask) |
|
| 64 |
+
| `localvqe-v1.3-4.8M-f32.gguf` | 19 MB | v1.3 joint — GGUF the engine loads |
|
| 65 |
+
| `localvqe-v1.3-4.8M.pt` | 55 MB | v1.3 joint — PyTorch checkpoint (research) |
|
| 66 |
+
| `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | v1.2 joint — GGUF |
|
| 67 |
+
| `localvqe-v1.2-1.3M.pt` | 11 MB | v1.2 joint — PyTorch checkpoint |
|
| 68 |
+
| `localvqe-v1.1-1.3M-f32.gguf`, `localvqe-v1-1.3M-f32.gguf` | 5 MB | older releases |
|
| 69 |
|
| 70 |
+
v1.4-AEC is GGUF-only (no `.pt`). GGUF integrity is checked at load time against
|
| 71 |
+
a built-in SHA256 allowlist in the engine.
|
|
|
|
| 72 |
|
| 73 |
+
## Performance
|
| 74 |
|
| 75 |
Full 800-clip eval on the
|
| 76 |
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
|
| 77 |
+
(real recordings). AECMOS echo / deg are 1–5 (higher = more echo removed /
|
| 78 |
+
cleaner speech); blind ERLE is `10·log10(E[mic²]/E[enh²])`, only meaningful on
|
| 79 |
+
far-end-only clips. Unprocessed-mic echo MOS is 2.67 / 2.56 / 1.90 / 2.13 / 5.00
|
| 80 |
+
across the five scenarios.
|
| 81 |
+
|
| 82 |
+
**v1.4-AEC** — keeps background noise and room by design, so its ERLE and
|
| 83 |
+
far-end DNSMOS are intentionally lower than the joint models (it isn't deleting
|
| 84 |
+
the ambience):
|
| 85 |
+
|
| 86 |
+
| Scenario | n | echo ↑ | deg ↑ | ERLE ↑ | OVRL |
|
| 87 |
+
|---|--:|--:|--:|--:|--:|
|
| 88 |
+
| doubletalk | 115 | 4.20 | 2.45 | — | 2.59 |
|
| 89 |
+
| doubletalk-with-movement | 185 | 4.19 | 2.45 | — | 2.55 |
|
| 90 |
+
| farend-singletalk | 107 | 3.80 | 4.99 | 14.6 dB | 1.37 |
|
| 91 |
+
| farend-singletalk-with-movement | 193 | 3.86 | 4.95 | 11.1 dB | 1.31 |
|
| 92 |
+
| nearend-singletalk | 200 | 4.99 | 3.99 | — | 3.08 |
|
| 93 |
+
|
| 94 |
+
**v1.4-AEC 2.7K** (front-end only) — matches or beats the full model's
|
| 95 |
+
perceptual far-end echo at 1/74 the parameters; the mask's extra work shows up
|
| 96 |
+
as higher ERLE above, not higher echo MOS:
|
| 97 |
+
|
| 98 |
+
| Scenario | n | echo ↑ | deg ↑ | ERLE ↑ | OVRL |
|
| 99 |
+
|---|--:|--:|--:|--:|--:|
|
| 100 |
+
| doubletalk | 115 | 4.00 | 2.79 | — | 2.46 |
|
| 101 |
+
| doubletalk-with-movement | 185 | 3.90 | 2.92 | — | 2.42 |
|
| 102 |
+
| farend-singletalk | 107 | 4.06 | 5.00 | 6.5 dB | 1.24 |
|
| 103 |
+
| farend-singletalk-with-movement | 193 | 4.05 | 4.97 | 3.9 dB | 1.22 |
|
| 104 |
+
| nearend-singletalk | 200 | 4.98 | 3.77 | — | 3.03 |
|
| 105 |
+
|
| 106 |
+
**v1.3** (joint) and **v1.2** (joint) — these also delete the background, so
|
| 107 |
+
their far-end ERLE is much higher and not comparable to v1.4-AEC's:
|
| 108 |
+
|
| 109 |
+
| Scenario | n | v1.3 echo / deg / ERLE / OVRL | v1.2 echo / deg / ERLE / OVRL |
|
| 110 |
+
|---|--:|---|---|
|
| 111 |
+
| doubletalk | 115 | 4.73 / 2.62 / 8.5 dB / 2.89 | 4.72 / 2.37 / 8.4 dB / 2.83 |
|
| 112 |
+
| doubletalk-with-movement | 185 | 4.67 / 2.43 / 8.3 dB / 2.85 | 4.65 / 2.30 / 8.1 dB / 2.79 |
|
| 113 |
+
| farend-singletalk | 107 | 3.69 / 4.83 / 50.9 dB / 1.94 | 3.78 / 4.91 / 45.7 dB / 1.80 |
|
| 114 |
+
| farend-singletalk-with-movement | 193 | 3.88 / 4.98 / 49.9 dB / 1.96 | 4.12 / 4.96 / 40.6 dB / 1.75 |
|
| 115 |
+
| nearend-singletalk | 200 | 5.00 / 4.18 / 2.4 dB / 3.17 | 5.00 / 4.16 / 2.1 dB / 3.17 |
|
| 116 |
+
|
| 117 |
+
### Latency
|
| 118 |
+
|
| 119 |
+
Per-hop p50 / RT factor on a Ryzen 9 7900 (Zen4). 16 kHz, 256-sample hop.
|
| 120 |
+
|
| 121 |
+
| Model | 1 thread | 4 threads | dGPU (RTX 5070 Ti, Vulkan) |
|
| 122 |
+
|---|---|---|---|
|
| 123 |
+
| v1.4-AEC (203 K) | 1.29 ms · 12.2× | 0.83 ms · 18.6× | run on CPU¹ |
|
| 124 |
+
| v1.4-AEC 2.7K | 0.36 ms · 44× (single-threaded) | — | run on CPU¹ |
|
| 125 |
+
| v1.3 (4.8 M) | 9.73 ms · 1.58× | 3.21 ms · 4.97× | 2.57 ms · 6.07× |
|
| 126 |
+
| v1.2 (1.3 M) | 4.28 ms · 3.72× | 1.65 ms · 8.90× | 1.96 ms · 7.85× |
|
| 127 |
+
|
| 128 |
+
¹ v1.4-AEC's adaptive front-end always runs on CPU and the neural stage is too
|
| 129 |
+
small for GPU offload to pay off. Four threads is the sweet spot on Zen4 for all
|
| 130 |
+
models; the library defaults to `min(4, available CPUs)`.
|
| 131 |
+
|
| 132 |
+
### Memory (CPU)
|
| 133 |
+
|
| 134 |
+
Working set the model adds on top of the ~7 MiB binary baseline:
|
| 135 |
+
|
| 136 |
+
| Model | Post-load delta | Peak RSS |
|
| 137 |
+
|---|--:|--:|
|
| 138 |
+
| v1.3 (4.8 M) | +24.4 MiB | 34.1 MiB |
|
| 139 |
+
| v1.2 (1.3 M) | +10.0 MiB | 19.6 MiB |
|
| 140 |
+
| v1.4-AEC (203 K) | +6.7 MiB | 17.0 MiB |
|
| 141 |
+
|
| 142 |
+
## Running inference
|
| 143 |
+
|
| 144 |
+
Download a GGUF (web UI, `huggingface-cli`, or `hf_hub_download`) and run the
|
| 145 |
+
GGML CLI — same command for every model, just swap the file:
|
| 146 |
|
| 147 |
```bash
|
| 148 |
+
./localvqe localvqe-v1.3-4.8M-f32.gguf --in-wav mic.wav ref.wav --out-wav out.wav
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
```
|
| 150 |
|
| 151 |
+
16 kHz mono PCM for both the mic and the far-end reference. Building the engine,
|
| 152 |
+
the C API (`liblocalvqe.so`), and the OBS Studio plugin are documented in the
|
| 153 |
+
[GitHub repository](https://github.com/localai-org/LocalVQE).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
## PyTorch reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
+
`localvqe-v1.3-4.8M.pt` and `localvqe-v1.2-1.3M.pt` are the checkpoints used to
|
| 158 |
+
produce the GGUF exports — for verification, ablation, and research, not
|
| 159 |
+
end-user inference (use the GGML build). The model definition lives under
|
| 160 |
+
`pytorch/` in the [GitHub repo](https://github.com/localai-org/LocalVQE).
|
| 161 |
|
| 162 |
+
## Citing
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
+
Cite the repository via `CITATION.cff` at
|
| 165 |
+
<https://github.com/localai-org/LocalVQE> (GitHub's "Cite this repository"
|
| 166 |
+
button produces APA / BibTeX), and the upstream DeepVQE paper:
|
| 167 |
|
| 168 |
```bibtex
|
| 169 |
@inproceedings{indenbom2023deepvqe,
|
|
|
|
| 171 |
Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
|
| 172 |
author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
|
| 173 |
and Chernov, Mykola and Aichner, Robert},
|
| 174 |
+
booktitle = {Interspeech}, year = {2023},
|
|
|
|
| 175 |
doi = {10.21437/Interspeech.2023-2176}
|
| 176 |
}
|
| 177 |
```
|
| 178 |
|
| 179 |
+
## Dataset attribution
|
| 180 |
|
| 181 |
+
Weights are trained on the
|
| 182 |
+
[ICASSP 2023 DNS Challenge](https://github.com/microsoft/DNS-Challenge)
|
| 183 |
(Microsoft, CC BY 4.0) and fine-tuned on the
|
| 184 |
+
[ICASSP 2022/2023 AEC Challenge](https://github.com/microsoft/AEC-Challenge).
|
| 185 |
|
| 186 |
+
## Safety
|
| 187 |
|
| 188 |
+
Training data was filtered by DNSMOS, which can misclassify distressed speech
|
| 189 |
+
(screaming, crying) as noise. LocalVQE may attenuate such signals and must not
|
| 190 |
+
be relied upon for emergency or safety-critical applications.
|
|
|
|
| 191 |
|
| 192 |
## License
|
| 193 |
|