LocalAI-io
/

LocalVQE

@@ -9,46 +9,151 @@ tags:
 license: apache-2.0
 ---
-# LocalVQE — Local Voice Quality Enhancement
-Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
-dereverberation for 16 kHz speech. A from-scratch derivative of **DeepVQE**
-(Indenbom et al., Interspeech 2023 — *DeepVQE: Real Time Deep Voice Quality
-Enhancement*, [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned
-for quantization-aware local CPU/GPU inference. The DCT-II analysis/synthesis
-(replacing STFT), S4D bottleneck, GGML streaming graph, and training pipeline
-are work of this project — no paper yet.
-**Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
-Claude (Anthropic).
-Project source: <https://github.com/richiejp/LocalVQE>
-## Files
 | File | Size | Description |
 |---|---|---|
-| `localvqe-v1.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + AEC Challenge fine-tune. |
-| `localvqe-v1-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). |
-## Usage (GGML / C++ / Go)
-```bash
-# Build the ggml binary
-cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
-# Run inference on a 16 kHz WAV pair
-./build/bin/localvqe localvqe-v1-f32.gguf \
-    --in-wav mic.wav ref.wav --out-wav enhanced.wav
-```
-Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6× realtime at
-16 kHz / 256-sample hop).
 ## Architecture
 | Component | Value |
-|-----------|-------|
 | Sample rate | 16 kHz |
 | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
 | Mic encoder | 5 blocks: 2 → 32 → 40 → 40 → 40 → 40 |
@@ -60,16 +165,166 @@ Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6× realtime at
 | Kernel | (4, 4) time × freq, causal padding |
 | Parameters | ~0.9 M |
-## Upstream citation (DeepVQE)
 ```bibtex
 @inproceedings{indenbom2023deepvqe,
-  title={{DeepVQE}: Real Time Deep Voice Quality Enhancement for Joint Acoustic
-         Echo Cancellation, Noise Suppression and Dereverberation},
-  author={Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u a}t{\u a}lin and
-          Chernov, Mykola and Aichner, Robert},
-  booktitle={Interspeech},
-  year={2023},
-  doi={10.21437/Interspeech.2023-2176}
 }
 ```

 license: apache-2.0
 ---
+# LocalVQE
+**Local Voice Quality Enhancement** — a compact neural model for joint
+acoustic echo cancellation (AEC), noise suppression, and dereverberation of
+16 kHz speech, designed to run on commodity CPUs in real time.
+- ~0.9 M parameters (~3.5 MB F32)
+- ~1.66 ms per 16 ms frame on Zen4 (24 threads) — **≈9.6× realtime**
+- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
+- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
+  PyTorch reference included for verification and research
+- Quantization-friendly by design (power-of-2 channel widths, kernel area 16)
+  to support future Q4_K / Q8_0 native inference
+- Apache 2.0
+This page is the Hugging Face model card — it hosts the published weights.
+Source code, build system, tests, and training pipeline live in the GitHub
+repository: <https://github.com/LocalAI-io/LocalVQE>.
+**Authors:**
+- Richard Palethorpe ([richiejp](https://github.com/richiejp))
+- Claude (Anthropic)
+LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 —
+*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
+Cancellation, Noise Suppression and Dereverberation*,
+[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). It keeps DeepVQE's
+overall topology (mic/far-end encoders, soft-delay cross attention, decoder
+with sub-pixel upsampling, complex convolving mask) but replaces the STFT
+with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal
+state-space model (S4D), and is ~9× smaller than the reference DeepVQE.
+Everything specific to LocalVQE is original to this repository — there is
+no LocalVQE paper.
+## A concrete example
+Picture a video call from a laptop. Your microphone picks up three things
+alongside your voice:
+1. The remote participant's voice, played back through your speakers and
+   caught again by your mic — this is the **echo**. Without cancellation
+   they hear themselves a fraction of a second later.
+2. Your own voice bouncing off walls, desk, and monitor before reaching
+   the mic — this is **reverberation**, the "tunnel" or "bathroom" sound
+   that makes you feel far away from the listener.
+3. A fan, keyboard clatter, a dog barking, or traffic outside — plain
+   **background noise**.
+LocalVQE removes all three in a single causal pass, frame by frame, on
+the CPU, so only your voice reaches the far end.
+## Why this, and not a classical AEC/NS stack?
+Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
+NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
+frame and remain a strong baseline when the acoustic path is benign. LocalVQE
+is interesting when you want:
+- **Robustness to non-linear echo paths** (small loudspeakers, handheld
+  devices, plastic laptop chassis) where linear AEC leaves residual echo.
+- **Non-stationary noise suppression** (babble, keyboards, fans changing
+  speed) that energy-based noise estimators struggle with.
+- **One model, many conditions** — no per-device tuning of step sizes,
+  forgetting factors, or VAD thresholds.
+- **A single deterministic causal pass** — no double-talk detector, no
+  adaptation state that can diverge.
+The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
+~1–2 ms/frame. On anything larger than a microcontroller that's still a
+small fraction of a real-time budget.
+## Why this, and not DeepVQE?
+Microsoft never released DeepVQE — no weights, no reference implementation,
+no streaming runtime. We re-implemented it from the paper as a GGML graph
+at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the
+full-width ~7.5 M-parameter version) before starting LocalVQE. Comparing
+that implementation to this one:
+| | DeepVQE (our re-implementation) | LocalVQE |
+|---|---|---|
+| Parameters | ~7.5 M | ~0.9 M |
+| Weights (F32) | ~30 MB | ~3.5 MB |
+| Analysis | STFT (complex FFT) | DCT-II (real, in-graph) |
+| Bottleneck | GRU | S4D (diagonal state space) |
+| CCM arithmetic | Complex | Real-valued (GGML-friendly) |
+| Streaming inference | Yes, separate repo | Yes, in this repo |
+The smaller parameter count comes from iterative channel pruning of the
+full-width reference, not from distillation; S4D halves the bottleneck
+parameter count vs GRU at similar quality.
+## Files in this repository
 | File | Size | Description |
 |---|---|---|
+| `localvqe-v1.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
+| `localvqe-v1-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). This is what the C++ inference engine loads. |
+Only F32 GGUF is published today. A `quantize` tool is included in the C++
+build (see below) and the architecture is designed to be Q4_K / Q8_0
+friendly, but quantized weights have not yet been calibrated and released.
+## Validation Results
+Numbers below are from the best checkpoint of the AEC fine-tune
+(`localvqe-v1-f32.gguf`), evaluated on a 1 000-clip validation split mixing
+DNS5-synthesised near/far-end scenes and ICASSP AEC Challenge synthetic
+data. AECMOS scores are computed over a 100-clip sub-sample per the standard
+AEC Challenge protocol.
+| Metric | Overall | Single-talk far-end | Double-talk |
+|---|---:|---:|---:|
+| ERLE | — | **+52.2 dB** | — |
+| AECMOS echo (↑, 1–5) | 4.36 | 4.46 | 4.33 |
+| AECMOS degradation (↑, 1–5) | 4.83 | 5.00 | 4.78 |
+- **ERLE** (Echo Return Loss Enhancement) in dB — higher is better. Only
+  reported for single-talk far-end, where the mic signal is pure echo and the
+  ratio `10·log10(E[mic²] / E[enh²])` directly measures echo attenuation.
+  Overall and double-talk ERLE are omitted because near-end speech in the
+  mic and enhanced signals dominates the numerator/denominator and the
+  number stops being a clean echo-removal measurement.
+- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
+  quality predictor. "Echo" rates how well the echo was removed; "degradation"
+  rates how clean the resulting speech/residual is. Both are on a 1–5 MOS
+  scale, higher is better.
+### Why DNSMOS OVRL is not reported here
+We track DNSMOS P.808 (`sig_bak_ovr.onnx`) in TensorBoard but are deliberately
+*not* publishing OVRL numbers for this model. The scores we obtain (around 2.0
+overall, 2.1 on single-talk far-end) contradict informal listening —
+single-talk far-end with 52 dB of cancellation is audibly near-silent, not a
+"2-out-of-5" output. We suspect our DNSMOS invocation (input normalisation,
+silence handling, or ONNX model variant) is miscalibrated for AEC outputs
+and in particular for near-silent clips, which are out of distribution for a
+speech-quality predictor. Until we can reconcile the numbers with a
+DeepVQE-matching protocol we consider our OVRL numbers untrustworthy and
+omit them rather than publish misleading figures.
 ## Architecture
 | Component | Value |
+|---|---|
 | Sample rate | 16 kHz |
 | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
 | Mic encoder | 5 blocks: 2 → 32 → 40 → 40 → 40 → 40 |
 | Kernel | (4, 4) time × freq, causal padding |
 | Parameters | ~0.9 M |
+## Building the C++ Inference Engine
+Source, build system, and tests live at
+<https://github.com/LocalAI-io/LocalVQE>. Requires CMake ≥ 3.20 and a C++17
+compiler. A [Nix](https://nixos.org/) flake is provided:
+```bash
+git clone --recursive https://github.com/LocalAI-io/LocalVQE.git
+cd LocalVQE
+# With Nix:
+nix develop
+cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
+cmake --build ggml/build -j$(nproc)
+# Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
+cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
+cmake --build ggml/build -j$(nproc)
+```
+Binaries land in `ggml/build/bin/`. The CPU build produces multiple
+`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
+Keep the binaries and `.so` files together.
+### Vulkan backend (embedded / integrated-GPU targets)
+Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
+CPU build — an additional `libggml-vulkan.so` is produced in
+`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
+present, otherwise it falls back to the CPU variants.
+```bash
+cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
+cmake --build ggml/build -j$(nproc)
+```
+The Nix flake's dev shell already includes `vulkan-loader`,
+`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
+from your distro (Debian: `libvulkan-dev vulkan-headers
+glslc`/`shaderc`).
+### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
+Measured with `bench` on Zen4 desktop (Ryzen 9 7900), 30 iters × 187 hops
+= 5 610 streaming hops per backend. Each hop is a full
+`ggml_backend_graph_compute`.
+| Backend                     | p50     | p99     | max (quiet) | max (with load) |
+|-----------------------------|--------:|--------:|------------:|----------------:|
+| CPU — 1 thread              | 3.46 ms | 3.59 ms |     4.93 ms |             —   |
+| CPU — 2 threads             | 2.05 ms | 2.17 ms |     3.34 ms |             —   |
+| CPU — 4 threads             | 1.26 ms | 1.48 ms |     3.07 ms |             —   |
+| Vulkan — AMD iGPU (RADV)    | 1.68 ms | 1.77 ms |     3.40 ms |       37.50 ms  |
+| Vulkan — NVIDIA RTX 5070 Ti | 1.68 ms | 1.79 ms |     3.40 ms |       31.72 ms  |
+Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
+shared desktop is sensitive to external GPU clients (display compositor,
+browser). On a dedicated embedded device with no compositor contending
+for the queue, the "quiet" column is what you'll see.
+## Running Inference
+Download `localvqe-v1-f32.gguf` from this repository (the file list above)
+either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
+`huggingface_hub`. Then:
+### CLI
+```bash
+./ggml/build/bin/localvqe localvqe-v1-f32.gguf \
+    --in-wav mic.wav ref.wav \
+    --out-wav enhanced.wav
+```
+Expects 16 kHz mono PCM for both mic and far-end reference.
+### Benchmark
+```bash
+./ggml/build/bin/bench localvqe-v1-f32.gguf \
+    --in-wav mic.wav ref.wav --iters 10 --profile
+```
+### Shared Library (C API)
+```bash
+cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
+cmake --build ggml/build -j$(nproc)
+```
+Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
+`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
+integration.
+### Quantizing (experimental)
+The model was designed with quantization in mind — power-of-two channel
+widths, kernel area 16, GGML-friendly real-valued arithmetic — but
+calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool
+in the C++ build can produce GGUF variants from the F32 reference for
+experimentation:
+```bash
+./ggml/build/bin/quantize localvqe-v1-f32.gguf localvqe-v1-q8.gguf Q8_0
+```
+Expect end-to-end quality loss until proper per-tensor selection and
+calibration have been worked through.
+## PyTorch Reference
+`localvqe-v1.pt` is the PyTorch checkpoint used to produce the GGUF export.
+It is provided for verification, ablation, and downstream research — not
+for end-user inference, which should go through the GGML build above. The
+model definition lives under `pytorch/` in the
+[GitHub repo](https://github.com/LocalAI-io/LocalVQE):
+```bash
+git clone https://github.com/LocalAI-io/LocalVQE.git
+cd LocalVQE/pytorch
+pip install -r requirements.txt
+```
+## Citing LocalVQE
+If you use LocalVQE in academic work, please cite the repository via the
+`CITATION.cff` at <https://github.com/LocalAI-io/LocalVQE> — GitHub renders
+a "Cite this repository" button that produces APA and BibTeX entries
+automatically.
+For a DOI, we recommend citing a specific release via
+[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
+also cite the upstream DeepVQE paper:
 ```bibtex
 @inproceedings{indenbom2023deepvqe,
+  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
+               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
+  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
+               and Chernov, Mykola and Aichner, Robert},
+  booktitle = {Interspeech},
+  year      = {2023},
+  doi       = {10.21437/Interspeech.2023-2176}
 }
 ```
+## Dataset Attribution
+Published weights are trained on data from the
+[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
+(Microsoft, CC BY 4.0) and fine-tuned on the
+[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
+## Safety Note
+Training data was filtered by DNSMOS perceived-quality scores, which can
+misclassify distressed speech (screaming, crying) as noise. LocalVQE may
+attenuate or distort such signals and must not be relied upon for emergency
+call or safety-critical applications.
+## License
+Apache License 2.0.