| --- |
| library_name: pytorch |
| tags: |
| - audio-to-audio |
| - speech-enhancement |
| - acoustic-echo-cancellation |
| - noise-suppression |
| - ggml |
| license: apache-2.0 |
| --- |
| |
| # LocalVQE |
|
|
| **Local Voice Quality Enhancement** β a compact neural model for joint |
| acoustic echo cancellation (AEC), noise suppression, and dereverberation of |
| 16 kHz speech, designed to run on commodity CPUs in real time. |
|
|
| - 1.3 M parameters (~5 MB F32) |
| - ~1.66 ms per 16 ms frame on Zen4 (24 threads) β **β9.6Γ realtime** |
| - Causal, streaming: 256-sample hop, 16 ms algorithmic latency |
| - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml); |
| PyTorch reference included for verification and research |
| - Quantization-friendly by design (power-of-2 channel widths, kernel area 16) |
| to support future Q4_K / Q8_0 native inference |
| - Apache 2.0 |
|
|
| This page is the Hugging Face model card β it hosts the published weights. |
| Source code, build system, tests, and training pipeline live in the GitHub |
| repository: <https://github.com/LocalAI-io/LocalVQE>. |
|
|
| The technical report describing the architecture, streaming-state contract, |
| and BatchNorm folding rules used for deployment is included in this repo as |
| [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would |
| like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement |
| from an existing author in those categories β if you can endorse, please |
| reach out via the GitHub repo. |
|
|
| **Authors:** |
| - Richard Palethorpe ([richiejp](https://github.com/richiejp)) |
| - Claude (Anthropic) |
|
|
| LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β |
| *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo |
| Cancellation, Noise Suppression and Dereverberation*, |
| [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). It keeps DeepVQE's |
| overall topology (mic/far-end encoders, soft-delay cross attention, decoder |
| with sub-pixel upsampling, complex convolving mask) but replaces the STFT |
| with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal |
| state-space model (S4D), and is ~9Γ smaller than the reference DeepVQE. |
| Everything specific to LocalVQE is original to this repository β there is |
| no LocalVQE paper. |
|
|
| ## A concrete example |
|
|
| Picture a video call from a laptop. Your microphone picks up three things |
| alongside your voice: |
|
|
| 1. The remote participant's voice, played back through your speakers and |
| caught again by your mic β this is the **echo**. Without cancellation |
| they hear themselves a fraction of a second later. |
| 2. Your own voice bouncing off walls, desk, and monitor before reaching |
| the mic β this is **reverberation**, the "tunnel" or "bathroom" sound |
| that makes you feel far away from the listener. |
| 3. A fan, keyboard clatter, a dog barking, or traffic outside β plain |
| **background noise**. |
|
|
| LocalVQE removes all three in a single causal pass, frame by frame, on |
| the CPU, so only your voice reaches the far end. |
|
|
| ## Why this, and not a classical AEC/NS stack? |
|
|
| Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction |
| NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per |
| frame and remain a strong baseline when the acoustic path is benign. LocalVQE |
| is interesting when you want: |
|
|
| - **Robustness to non-linear echo paths** (small loudspeakers, handheld |
| devices, plastic laptop chassis) where linear AEC leaves residual echo. |
| - **Non-stationary noise suppression** (babble, keyboards, fans changing |
| speed) that energy-based noise estimators struggle with. |
| - **One model, many conditions** β no per-device tuning of step sizes, |
| forgetting factors, or VAD thresholds. |
| - **A single deterministic causal pass** β no double-talk detector, no |
| adaptation state that can diverge. |
|
|
| The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE |
| ~1β2 ms/frame. On anything larger than a microcontroller that's still a |
| small fraction of a real-time budget. |
|
|
| ## Why this, and not DeepVQE? |
|
|
| Microsoft never released DeepVQE β no weights, no reference implementation, |
| no streaming runtime. We re-implemented it from the paper as a GGML graph |
| at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the |
| full-width ~7.5 M-parameter version) before starting LocalVQE. Comparing |
| that implementation to this one: |
|
|
| | | DeepVQE (our re-implementation) | LocalVQE | |
| |---|---|---| |
| | Parameters | ~7.5 M | 1.3 M | |
| | Weights (F32) | ~30 MB | ~5 MB | |
| | Analysis | STFT (complex FFT) | DCT-II (real, in-graph) | |
| | Bottleneck | GRU | S4D (diagonal state space) | |
| | CCM arithmetic | Complex | Real-valued (GGML-friendly) | |
| | Streaming inference | Yes, separate repo | Yes, in this repo | |
|
|
| The smaller parameter count comes from iterative channel pruning of the |
| full-width reference, not from distillation; S4D halves the bottleneck |
| parameter count vs GRU at similar quality. |
|
|
| ## Files in this repository |
|
|
| | File | Size | Description | |
| |---|---|---| |
| | `localvqe-v1-1.3M.pt` | 11 MB | PyTorch checkpoint β DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. | |
| | `localvqe-v1-1.3M-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). This is what the C++ inference engine loads. | |
|
|
| Only F32 GGUF is published today. A `quantize` tool is included in the C++ |
| build (see below) and the architecture is designed to be Q4_K / Q8_0 |
| friendly, but quantized weights have not yet been calibrated and released. |
|
|
| ## Validation Results |
|
|
| Stratified 150-sample eval (30 per scenario) on the |
| [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge) |
| β real recordings, not synthetic mixes. |
|
|
| | Scenario | AECMOS echo | AECMOS deg | blind ERLE | |
| |---|---:|---:|---:| |
| | doubletalk | 4.71 | 2.35 | 8.5 dB | |
| | doubletalk-with-movement | 4.67 | 2.33 | 8.1 dB | |
| | farend-singletalk | 4.12 | 4.94 | 40.6 dB | |
| | farend-singletalk-with-movement | 4.31 | 4.98 | 39.0 dB | |
| | nearend-singletalk | 5.00 | 4.15 | 1.9 dB | |
|
|
| - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC |
| quality predictor. "Echo" rates how well echo was removed; "degradation" |
| rates how clean the resulting speech is. 1β5 MOS scale, higher is better. |
| - **Blind ERLE** is `10Β·log10(E[micΒ²] / E[enhΒ²])`. Only meaningful on |
| far-end single-talk where the input is echo-only; on scenes with active |
| near-end speech it understates echo removal because both numerator and |
| denominator are dominated by speech. |
|
|
| ## Architecture |
|
|
| | Component | Value | |
| |---|---| |
| | Sample rate | 16 kHz | |
| | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) | |
| | Mic encoder | 5 blocks: 2 β 32 β 40 β 40 β 40 β 40 | |
| | Far-end encoder | 2 blocks: 2 β 32 β 40 | |
| | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 | |
| | Bottleneck | S4D diagonal state-space, hidden 162 | |
| | Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder | |
| | CCM | 27-ch β 3Γ3 complex convolving mask (real-valued arithmetic) | |
| | Kernel | (4, 4) time Γ freq, causal padding | |
| | Parameters | 1.3 M | |
| |
| ## Building the C++ Inference Engine |
| |
| Source, build system, and tests live at |
| <https://github.com/LocalAI-io/LocalVQE>. Requires CMake β₯ 3.20 and a C++17 |
| compiler. A [Nix](https://nixos.org/) flake is provided: |
| |
| ```bash |
| git clone --recursive https://github.com/LocalAI-io/LocalVQE.git |
| cd LocalVQE |
| |
| # With Nix: |
| nix develop |
| cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release |
| cmake --build ggml/build -j$(nproc) |
| |
| # Without Nix β install cmake, gcc/clang, pkg-config, libsndfile, then: |
| cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release |
| cmake --build ggml/build -j$(nproc) |
| ``` |
| |
| Binaries land in `ggml/build/bin/`. The CPU build produces multiple |
| `libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime. |
| Keep the binaries and `.so` files together. |
| |
| ### Vulkan backend (embedded / integrated-GPU targets) |
| |
| Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the |
| CPU build β an additional `libggml-vulkan.so` is produced in |
| `ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is |
| present, otherwise it falls back to the CPU variants. |
|
|
| ```bash |
| cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON |
| cmake --build ggml/build -j$(nproc) |
| ``` |
|
|
| The Nix flake's dev shell already includes `vulkan-loader`, |
| `vulkan-headers`, and `shaderc`. Without Nix, install the equivalents |
| from your distro (Debian: `libvulkan-dev vulkan-headers |
| glslc`/`shaderc`). |
|
|
| ### Streaming latency (per-hop, 16 kHz / 256-sample hop β 16 ms budget) |
|
|
| Measured with `bench` on Zen4 desktop (Ryzen 9 7900), 30 iters Γ 187 hops |
| = 5 610 streaming hops per backend. Each hop is a full |
| `ggml_backend_graph_compute`. |
|
|
| | Backend | p50 | p99 | max (quiet) | max (with load) | |
| |-----------------------------|--------:|--------:|------------:|----------------:| |
| | CPU β 1 thread | 3.46 ms | 3.59 ms | 4.93 ms | β | |
| | CPU β 2 threads | 2.05 ms | 2.17 ms | 3.34 ms | β | |
| | CPU β 4 threads | 1.26 ms | 1.48 ms | 3.07 ms | β | |
| | Vulkan β AMD iGPU (RADV) | 1.68 ms | 1.77 ms | 3.40 ms | 37.50 ms | |
| | Vulkan β NVIDIA RTX 5070 Ti | 1.68 ms | 1.79 ms | 3.40 ms | 31.72 ms | |
|
|
| Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a |
| shared desktop is sensitive to external GPU clients (display compositor, |
| browser). On a dedicated embedded device with no compositor contending |
| for the queue, the "quiet" column is what you'll see. |
|
|
| ## Running Inference |
|
|
| Download `localvqe-v1-1.3M-f32.gguf` from this repository (the file list above) |
| either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from |
| `huggingface_hub`. Then: |
|
|
| ### CLI |
|
|
| ```bash |
| ./ggml/build/bin/localvqe localvqe-v1-1.3M-f32.gguf \ |
| --in-wav mic.wav ref.wav \ |
| --out-wav enhanced.wav |
| ``` |
|
|
| Expects 16 kHz mono PCM for both mic and far-end reference. |
|
|
| ### Benchmark |
|
|
| ```bash |
| ./ggml/build/bin/bench localvqe-v1-1.3M-f32.gguf \ |
| --in-wav mic.wav ref.wav --iters 10 --profile |
| ``` |
|
|
| ### Shared Library (C API) |
|
|
| ```bash |
| cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON |
| cmake --build ggml/build -j$(nproc) |
| ``` |
|
|
| Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See |
| `ggml/example_purego_test.go` in the GitHub repo for a Go / `purego` |
| integration. |
|
|
| ### Quantizing (experimental) |
|
|
| The model was designed with quantization in mind β power-of-two channel |
| widths, kernel area 16, GGML-friendly real-valued arithmetic β but |
| calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool |
| in the C++ build can produce GGUF variants from the F32 reference for |
| experimentation: |
|
|
| ```bash |
| ./ggml/build/bin/quantize localvqe-v1-1.3M-f32.gguf localvqe-v1-1.3M-q8.gguf Q8_0 |
| ``` |
|
|
| Expect end-to-end quality loss until proper per-tensor selection and |
| calibration have been worked through. |
|
|
| ## PyTorch Reference |
|
|
| `localvqe-v1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export. |
| It is provided for verification, ablation, and downstream research β not |
| for end-user inference, which should go through the GGML build above. The |
| model definition lives under `pytorch/` in the |
| [GitHub repo](https://github.com/LocalAI-io/LocalVQE): |
|
|
| ```bash |
| git clone https://github.com/LocalAI-io/LocalVQE.git |
| cd LocalVQE/pytorch |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Citing LocalVQE |
|
|
| If you use LocalVQE in academic work, please cite the repository via the |
| `CITATION.cff` at <https://github.com/LocalAI-io/LocalVQE> β GitHub renders |
| a "Cite this repository" button that produces APA and BibTeX entries |
| automatically. |
|
|
| For a DOI, we recommend citing a specific release via |
| [Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please |
| also cite the upstream DeepVQE paper: |
|
|
| ```bibtex |
| @inproceedings{indenbom2023deepvqe, |
| title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint |
| Acoustic Echo Cancellation, Noise Suppression and Dereverberation}, |
| author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin |
| and Chernov, Mykola and Aichner, Robert}, |
| booktitle = {Interspeech}, |
| year = {2023}, |
| doi = {10.21437/Interspeech.2023-2176} |
| } |
| ``` |
|
|
| ## Dataset Attribution |
|
|
| Published weights are trained on data from the |
| [ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge) |
| (Microsoft, CC BY 4.0) and fine-tuned on the |
| [ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge). |
|
|
| ## Safety Note |
|
|
| Training data was filtered by DNSMOS perceived-quality scores, which can |
| misclassify distressed speech (screaming, crying) as noise. LocalVQE may |
| attenuate or distort such signals and must not be relied upon for emergency |
| call or safety-critical applications. |
|
|
| ## License |
|
|
| Apache License 2.0. |
|
|