LocalVQE / README.md
richiejp's picture
Sync model card with upstream GitHub inference README
74fd5a7 verified
---
library_name: pytorch
tags:
- audio-to-audio
- speech-enhancement
- acoustic-echo-cancellation
- noise-suppression
- ggml
license: apache-2.0
---
# LocalVQE
**Local Voice Quality Enhancement** β€” a compact neural model for joint
acoustic echo cancellation (AEC), noise suppression, and dereverberation of
16 kHz speech, designed to run on commodity CPUs in real time.
- 1.3 M parameters (~5 MB F32)
- ~1.66 ms per 16 ms frame on Zen4 (24 threads) β€” **β‰ˆ9.6Γ— realtime**
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
PyTorch reference included for verification and research
- Quantization-friendly by design (power-of-2 channel widths, kernel area 16)
to support future Q4_K / Q8_0 native inference
- Apache 2.0
This page is the Hugging Face model card β€” it hosts the published weights.
Source code, build system, tests, and training pipeline live in the GitHub
repository: <https://github.com/LocalAI-io/LocalVQE>.
The technical report describing the architecture, streaming-state contract,
and BatchNorm folding rules used for deployment is included in this repo as
[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
from an existing author in those categories β€” if you can endorse, please
reach out via the GitHub repo.
**Authors:**
- Richard Palethorpe ([richiejp](https://github.com/richiejp))
- Claude (Anthropic)
LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β€”
*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
Cancellation, Noise Suppression and Dereverberation*,
[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). It keeps DeepVQE's
overall topology (mic/far-end encoders, soft-delay cross attention, decoder
with sub-pixel upsampling, complex convolving mask) but replaces the STFT
with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal
state-space model (S4D), and is ~9Γ— smaller than the reference DeepVQE.
Everything specific to LocalVQE is original to this repository β€” there is
no LocalVQE paper.
## A concrete example
Picture a video call from a laptop. Your microphone picks up three things
alongside your voice:
1. The remote participant's voice, played back through your speakers and
caught again by your mic β€” this is the **echo**. Without cancellation
they hear themselves a fraction of a second later.
2. Your own voice bouncing off walls, desk, and monitor before reaching
the mic β€” this is **reverberation**, the "tunnel" or "bathroom" sound
that makes you feel far away from the listener.
3. A fan, keyboard clatter, a dog barking, or traffic outside β€” plain
**background noise**.
LocalVQE removes all three in a single causal pass, frame by frame, on
the CPU, so only your voice reaches the far end.
## Why this, and not a classical AEC/NS stack?
Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
frame and remain a strong baseline when the acoustic path is benign. LocalVQE
is interesting when you want:
- **Robustness to non-linear echo paths** (small loudspeakers, handheld
devices, plastic laptop chassis) where linear AEC leaves residual echo.
- **Non-stationary noise suppression** (babble, keyboards, fans changing
speed) that energy-based noise estimators struggle with.
- **One model, many conditions** β€” no per-device tuning of step sizes,
forgetting factors, or VAD thresholds.
- **A single deterministic causal pass** β€” no double-talk detector, no
adaptation state that can diverge.
The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
~1–2 ms/frame. On anything larger than a microcontroller that's still a
small fraction of a real-time budget.
## Why this, and not DeepVQE?
Microsoft never released DeepVQE β€” no weights, no reference implementation,
no streaming runtime. We re-implemented it from the paper as a GGML graph
at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the
full-width ~7.5 M-parameter version) before starting LocalVQE. Comparing
that implementation to this one:
| | DeepVQE (our re-implementation) | LocalVQE |
|---|---|---|
| Parameters | ~7.5 M | 1.3 M |
| Weights (F32) | ~30 MB | ~5 MB |
| Analysis | STFT (complex FFT) | DCT-II (real, in-graph) |
| Bottleneck | GRU | S4D (diagonal state space) |
| CCM arithmetic | Complex | Real-valued (GGML-friendly) |
| Streaming inference | Yes, separate repo | Yes, in this repo |
The smaller parameter count comes from iterative channel pruning of the
full-width reference, not from distillation; S4D halves the bottleneck
parameter count vs GRU at similar quality.
## Files in this repository
| File | Size | Description |
|---|---|---|
| `localvqe-v1-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
| `localvqe-v1-1.3M-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). This is what the C++ inference engine loads. |
Only F32 GGUF is published today. A `quantize` tool is included in the C++
build (see below) and the architecture is designed to be Q4_K / Q8_0
friendly, but quantized weights have not yet been calibrated and released.
## Validation Results
Stratified 150-sample eval (30 per scenario) on the
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
β€” real recordings, not synthetic mixes.
| Scenario | AECMOS echo | AECMOS deg | blind ERLE |
|---|---:|---:|---:|
| doubletalk | 4.71 | 2.35 | 8.5 dB |
| doubletalk-with-movement | 4.67 | 2.33 | 8.1 dB |
| farend-singletalk | 4.12 | 4.94 | 40.6 dB |
| farend-singletalk-with-movement | 4.31 | 4.98 | 39.0 dB |
| nearend-singletalk | 5.00 | 4.15 | 1.9 dB |
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
quality predictor. "Echo" rates how well echo was removed; "degradation"
rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
- **Blind ERLE** is `10Β·log10(E[micΒ²] / E[enhΒ²])`. Only meaningful on
far-end single-talk where the input is echo-only; on scenes with active
near-end speech it understates echo removal because both numerator and
denominator are dominated by speech.
## Architecture
| Component | Value |
|---|---|
| Sample rate | 16 kHz |
| Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
| Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
| Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 40 |
| AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
| Bottleneck | S4D diagonal state-space, hidden 162 |
| Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder |
| CCM | 27-ch β†’ 3Γ—3 complex convolving mask (real-valued arithmetic) |
| Kernel | (4, 4) time Γ— freq, causal padding |
| Parameters | 1.3 M |
## Building the C++ Inference Engine
Source, build system, and tests live at
<https://github.com/LocalAI-io/LocalVQE>. Requires CMake β‰₯ 3.20 and a C++17
compiler. A [Nix](https://nixos.org/) flake is provided:
```bash
git clone --recursive https://github.com/LocalAI-io/LocalVQE.git
cd LocalVQE
# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
# Without Nix β€” install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
```
Binaries land in `ggml/build/bin/`. The CPU build produces multiple
`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
Keep the binaries and `.so` files together.
### Vulkan backend (embedded / integrated-GPU targets)
Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
CPU build β€” an additional `libggml-vulkan.so` is produced in
`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
present, otherwise it falls back to the CPU variants.
```bash
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)
```
The Nix flake's dev shell already includes `vulkan-loader`,
`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
from your distro (Debian: `libvulkan-dev vulkan-headers
glslc`/`shaderc`).
### Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)
Measured with `bench` on Zen4 desktop (Ryzen 9 7900), 30 iters Γ— 187 hops
= 5 610 streaming hops per backend. Each hop is a full
`ggml_backend_graph_compute`.
| Backend | p50 | p99 | max (quiet) | max (with load) |
|-----------------------------|--------:|--------:|------------:|----------------:|
| CPU β€” 1 thread | 3.46 ms | 3.59 ms | 4.93 ms | β€” |
| CPU β€” 2 threads | 2.05 ms | 2.17 ms | 3.34 ms | β€” |
| CPU β€” 4 threads | 1.26 ms | 1.48 ms | 3.07 ms | β€” |
| Vulkan β€” AMD iGPU (RADV) | 1.68 ms | 1.77 ms | 3.40 ms | 37.50 ms |
| Vulkan β€” NVIDIA RTX 5070 Ti | 1.68 ms | 1.79 ms | 3.40 ms | 31.72 ms |
Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
shared desktop is sensitive to external GPU clients (display compositor,
browser). On a dedicated embedded device with no compositor contending
for the queue, the "quiet" column is what you'll see.
## Running Inference
Download `localvqe-v1-1.3M-f32.gguf` from this repository (the file list above)
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
`huggingface_hub`. Then:
### CLI
```bash
./ggml/build/bin/localvqe localvqe-v1-1.3M-f32.gguf \
--in-wav mic.wav ref.wav \
--out-wav enhanced.wav
```
Expects 16 kHz mono PCM for both mic and far-end reference.
### Benchmark
```bash
./ggml/build/bin/bench localvqe-v1-1.3M-f32.gguf \
--in-wav mic.wav ref.wav --iters 10 --profile
```
### Shared Library (C API)
```bash
cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)
```
Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
integration.
### Quantizing (experimental)
The model was designed with quantization in mind β€” power-of-two channel
widths, kernel area 16, GGML-friendly real-valued arithmetic β€” but
calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool
in the C++ build can produce GGUF variants from the F32 reference for
experimentation:
```bash
./ggml/build/bin/quantize localvqe-v1-1.3M-f32.gguf localvqe-v1-1.3M-q8.gguf Q8_0
```
Expect end-to-end quality loss until proper per-tensor selection and
calibration have been worked through.
## PyTorch Reference
`localvqe-v1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
It is provided for verification, ablation, and downstream research β€” not
for end-user inference, which should go through the GGML build above. The
model definition lives under `pytorch/` in the
[GitHub repo](https://github.com/LocalAI-io/LocalVQE):
```bash
git clone https://github.com/LocalAI-io/LocalVQE.git
cd LocalVQE/pytorch
pip install -r requirements.txt
```
## Citing LocalVQE
If you use LocalVQE in academic work, please cite the repository via the
`CITATION.cff` at <https://github.com/LocalAI-io/LocalVQE> β€” GitHub renders
a "Cite this repository" button that produces APA and BibTeX entries
automatically.
For a DOI, we recommend citing a specific release via
[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
also cite the upstream DeepVQE paper:
```bibtex
@inproceedings{indenbom2023deepvqe,
title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
and Chernov, Mykola and Aichner, Robert},
booktitle = {Interspeech},
year = {2023},
doi = {10.21437/Interspeech.2023-2176}
}
```
## Dataset Attribution
Published weights are trained on data from the
[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
(Microsoft, CC BY 4.0) and fine-tuned on the
[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
## Safety Note
Training data was filtered by DNSMOS perceived-quality scores, which can
misclassify distressed speech (screaming, crying) as noise. LocalVQE may
attenuate or distort such signals and must not be relied upon for emergency
call or safety-critical applications.
## License
Apache License 2.0.