LocalAI-io
/

LocalVQE

@@ -15,366 +15,155 @@ license: apache-2.0
 [![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
-**Local Voice Quality Enhancement** — a compact neural model for joint
-acoustic echo cancellation (AEC), noise suppression, and dereverberation of
-16 kHz speech, designed to run on commodity CPUs in real time.
-- Two sizes — choose by CPU budget:
-  - **v1.3 (current)** — 4.8 M parameters (~19 MB F32), ~3.2 ms per 16 ms
-    frame on Zen4 (4 threads), **≈5× realtime**, ~34 MiB peak RSS.
-  - **v1.2** — 1.3 M parameters (~5 MB F32), ~1.6 ms per 16 ms frame on
-    Zen4 (4 threads), **≈10× realtime**, ~20 MiB peak RSS.
-- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
-- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
-  PyTorch reference included for verification and research
-Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.
-This page is the Hugging Face model card — it hosts the published weights.
-Source code, build system, tests, and training pipeline live in the GitHub
-repository: <https://github.com/localai-org/LocalVQE>.
-The current release is **v1.3**. It widens the encoder/decoder
-(mic channels `[2,112,32,104,96,152]`, far `[2,64,32]`, bottleneck
-256) and trains from scratch under a noise-floor-aware loss recipe.
-On doubletalk it filters noise better than v1.2; on far-end-only
-echo it cancels harder but the residual rates rougher in AECMOS —
-some users will prefer v1.2's gentler trade-off on FE-ST scenes.
-v1.2 stays available as the small/fast option (~1/4 the per-hop
-cost). Both reuse v1.2's 1024 ms echo-search window.
-**Authors:**
-- Richard Palethorpe ([richiejp](https://github.com/richiejp))
-- Claude (Anthropic)
-LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 —
-*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
-Cancellation, Noise Suppression and Dereverberation*,
-[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) — smaller, GGML-native,
-and tuned for streaming CPU inference. The architecture is documented in
-the technical report linked above.
-## A concrete example
-Picture a video call from a laptop. Your microphone picks up three things
-alongside your voice:
-1. The remote participant's voice, played back through your speakers and
-   caught again by your mic — this is the **echo**. Without cancellation
-   they hear themselves a fraction of a second later.
-2. Your own voice bouncing off walls, desk, and monitor before reaching
-   the mic — this is **reverberation**, the "tunnel" or "bathroom" sound
-   that makes you feel far away from the listener.
-3. A fan, keyboard clatter, a dog barking, or traffic outside — plain
-   **background noise**.
-LocalVQE removes all three in a single causal pass, frame by frame, on
-the CPU, so only your voice reaches the far end.
-## Why this, and not a classical AEC/NS stack?
-Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
-NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
-frame and remain a strong baseline when the acoustic path is benign. LocalVQE
-is interesting when you want:
-- **Robustness to non-linear echo paths** (small loudspeakers, handheld
-  devices, plastic laptop chassis) where linear AEC leaves residual echo.
-- **Non-stationary noise suppression** (babble, keyboards, fans changing
-  speed) that energy-based noise estimators struggle with.
-- **One model, many conditions** — no per-device tuning of step sizes,
-  forgetting factors, or VAD thresholds.
-- **A single deterministic causal pass** — no double-talk detector, no
-  adaptation state that can diverge.
-The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
-~1–2 ms/frame. On anything larger than a microcontroller that's still a
-small fraction of a real-time budget.
-## Why this, and not DeepVQE?
-Microsoft never released DeepVQE — no weights, no reference
-implementation, no streaming runtime. We re-implemented it from the
-paper as a GGML graph at
-[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
-(the full-width ~7.5 M-parameter version) before starting LocalVQE.
-LocalVQE is the same idea rebuilt for streaming CPU inference, and
-published in two sizes: a 1.3 M-parameter compact build (v1.2,
-~5 MB F32) for tight CPU budgets, and a 4.8 M-parameter wider build
-(v1.3, ~19 MB F32) that filters noise better on some clips at ~2×
-the per-hop cost. Both are small enough to run real time on
-commodity CPUs.
 ## Files in this repository
-| File | Size | Description |
 |---|---|---|
-| `localvqe-v1.3-4.8M.pt` | 55 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune, wider arch + noise-floor-aware loss. **Current release.** |
-| `localvqe-v1.3-4.8M-f32.gguf` | 19 MB | GGML F32 export of the current release — what the C++ inference engine loads. |
-| `localvqe-v1.2-1.3M.pt` | 11 MB | Compact alternative — same arch family as v1.3 (`arch_version=3`), ~1/4 the cost per hop. |
-| `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | GGML F32 export of the compact variant. |
-| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | Older release (F32 GGUF). |
-| `localvqe-v1-1.3M-f32.gguf` | 5 MB | Original release. |
-Only F32 GGUF is published today. A `quantize` tool is included in the
-C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
-released.
-## Validation Results
 Full 800-clip eval on the
 [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
-— real recordings, not synthetic mixes.
-**v1.3** (current, 4.8 M):
-| Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
-|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
-| doubletalk                        | 115 |          4.73 |     **2.62** |       8.5 dB |          2.89 |
-| doubletalk-with-movement          | 185 |          4.67 |     **2.43** |       8.3 dB |          2.85 |
-| farend-singletalk                 | 107 |          3.69 |         4.83 |  **50.9 dB** |          1.94 |
-| farend-singletalk-with-movement   | 193 |          3.88 |         4.98 |  **49.9 dB** |          1.96 |
-| nearend-singletalk                | 200 |          5.00 |         4.18 |       2.4 dB |          3.17 |
-**v1.2** (compact alternative, 1.3 M):
-| Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
-|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
-| doubletalk                        | 115 |          4.72 |         2.37 |       8.4 dB |          2.83 |
-| doubletalk-with-movement          | 185 |          4.65 |         2.30 |       8.1 dB |          2.79 |
-| farend-singletalk                 | 107 |          3.78 |         4.91 |      45.7 dB |          1.80 |
-| farend-singletalk-with-movement   | 193 |          4.12 |         4.96 |      40.6 dB |          1.75 |
-| nearend-singletalk                | 200 |          5.00 |         4.16 |       2.1 dB |          3.17 |
-v1.3 vs v1.2 deltas (same 800-clip set, same eval pipeline):
-- **Doubletalk deg MOS +0.25**, dt-with-movement deg MOS +0.13 — the
-  wider model + noise-floor-aware loss recipe noticeably reduces
-  perceived speech degradation when both talkers are active. This is
-  the primary v1.3 release goal.
-- **FE-ST-with-movement ERLE +9.3 dB**, FE-ST ERLE +5.2 dB — v1.3
-  cancels far-end echo substantially harder. **AECMOS echo MOS drops
-  −0.24 / −0.09** at the same time: the residual after cancellation
-  rates rougher on AECMOS's perceptual scale even though there's
-  numerically less of it. Some users will prefer v1.2's gentler
-  trade-off on far-end-only scenes.
-- **Nearend-singletalk identical** within noise (deg +0.02,
-  OVRL +0.00) — wider capacity doesn't help (or hurt) when there's
-  nothing to cancel.
-- DNSMOS OVRL is up 0.04–0.21 across all scenarios — the wider
-  model produces consistently cleaner-rated output by DNS metrics.
-For the original v1.2 vs v1.1 deltas (the previous release's
-headline numbers), see the [v1.2 release notes on
-GitHub](https://github.com/localai-org/LocalVQE).
-- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
-  quality predictor. "Echo" rates how well echo was removed; "degradation"
-  rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
-- **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on
-  far-end single-talk where the input is echo-only; on scenes with active
-  near-end speech it understates echo removal because both numerator and
-  denominator are dominated by speech.
-## Building the C++ Inference Engine
-Source, build system, and tests live at
-<https://github.com/localai-org/LocalVQE>. Requires CMake ≥ 3.20 and a C++17
-compiler. A [Nix](https://nixos.org/) flake is provided:
 ```bash
-git clone --recursive https://github.com/localai-org/LocalVQE.git
-cd LocalVQE
-# With Nix:
-nix develop
-cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
-cmake --build ggml/build -j$(nproc)
-# Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
-cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
-cmake --build ggml/build -j$(nproc)
-```
-Binaries land in `ggml/build/bin/`. The CPU build produces multiple
-`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
-Keep the binaries and `.so` files together.
-### Vulkan backend (embedded / integrated-GPU targets)
-Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
-CPU build — an additional `libggml-vulkan.so` is produced in
-`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
-present, otherwise it falls back to the CPU variants.
-```bash
-cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
-cmake --build ggml/build -j$(nproc)
 ```
-The Nix flake's dev shell already includes `vulkan-loader`,
-`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
-from your distro (Debian: `libvulkan-dev vulkan-headers
-glslc`/`shaderc`).
-### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
-Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
-full `ggml_backend_graph_compute`.
-**v1.3** (current — 4.8 M, wider encoder/decoder, bn 256):
-| Backend                     | Threads | p50     | p99     | RT factor |
-|-----------------------------|--------:|--------:|--------:|----------:|
-| CPU                         |       1 | 9.73 ms | 14.48 ms |    1.58× |
-| CPU                         |       2 | 5.41 ms |  5.62 ms |    2.95× |
-| CPU                         |       4 | 3.21 ms |  3.42 ms |    4.97× |
-| CPU                         |       8 | 3.47 ms |  3.80 ms |    4.59× |
-| CPU                         |      16 | 3.79 ms |  4.06 ms |    4.19× |
-| Vulkan — AMD iGPU (RADV)    |       — | 8.71 ms |  9.15 ms |    1.83× |
-| Vulkan — NVIDIA RTX 5070 Ti |       — | 2.57 ms |  4.21 ms |    6.07× |
-The wider v1.3 model is ~2× the per-hop cost of v1.2 in matching
-configurations. The dGPU (RTX 5070 Ti) ends up the fastest option
-by ~1.25× vs 4-thread CPU. The 1-thread case is the worst, still
-real-time (RT 1.58×) but with little margin — running v1.3 on a
-low-core / power-constrained device should use v1.2 instead.
-**v1.2** (compact alternative — 1.3 M, 1024 ms echo-search window):
-| Backend                     | Threads | p50     | p99     | RT factor |
-|-----------------------------|--------:|--------:|--------:|----------:|
-| CPU                         |       1 | 4.28 ms | 4.85 ms |     3.72× |
-| CPU                         |       2 | 2.59 ms | 3.80 ms |     6.09× |
-| CPU                         |       4 | 1.65 ms | 2.91 ms |     8.90× |
-| CPU                         |       8 | 1.93 ms | 2.41 ms |     8.22× |
-| CPU                         |      16 | 2.09 ms | 2.22 ms |     7.69× |
-| Vulkan — AMD iGPU (RADV)    |       — | 6.10 ms | 6.53 ms |     2.61× |
-| Vulkan — NVIDIA RTX 5070 Ti |       — | 1.96 ms | 3.64 ms |     7.85× |
-Beyond ≈4 threads both models are small enough that thread-launch
-and synchronisation overhead dominate; **four threads is the sweet
-spot on Zen4** for both v1.2 and v1.3.
-**v1.1** (older, 512 ms echo-search window) for comparison:
-| Backend                     | Threads | p50     | p99     | max     |
-|-----------------------------|--------:|--------:|--------:|--------:|
-| CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
-| CPU                         |       2 | 2.07 ms | 2.25 ms | 3.65 ms |
-| CPU                         |       4 | 1.32 ms | 1.57 ms | 6.91 ms |
-| Vulkan — AMD iGPU (RADV)    |       — | 4.43 ms | 4.62 ms | 5.07 ms |
-| Vulkan — NVIDIA RTX 5070 Ti |       — | 1.79 ms | 3.41 ms | 4.14 ms |
-Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
-shared desktop is sensitive to external GPU clients (display
-compositor, browser). On a dedicated embedded device with no
-compositor contending for the queue, expect the quieter end of the
-range.
-### Memory footprint (CPU)
-Process RSS from `bench` on Zen4, measured via `/proc/self/status`.
-Same numbers under every thread count from 1 to 16 — the runtime has
-no per-thread arenas of meaningful size, so peak RSS is set by
-weights + activations + history scratch.
-| Model               | Post-load delta ¹ | Peak RSS (VmHWM) ² |
-|---------------------|------------------:|-------------------:|
-| **v1.3** (4.8 M)    | +24.4 MiB         |  34.1 MiB          |
-| **v1.2** (1.3 M)    | +10.0 MiB         |  19.6 MiB          |
-¹ RSS added by loading the model + initialising the CPU backend, on
-top of a ~7 MiB binary-and-libs baseline. This is the portable
-"working set the model brings" number; the absolute peak will depend
-on your host process baseline.
-² Steady-state ceiling after warmup + sustained streaming. v1.3 is
-~1.75× v1.2 in RSS terms despite carrying ~3.7× more parameters —
-activation/history buffers don't scale with channel width. GPU
-backends are not reflected here (VRAM doesn't appear in
-`/proc/self/status`); for those, `bench --profile` prints the
-backend-internal weight/activation buffer sizes.
-## Running Inference
-Download a GGUF from the file list above — `localvqe-v1.3-4.8M-f32.gguf`
-for the current default, or `localvqe-v1.2-1.3M-f32.gguf` for the
-smaller / faster option — via `huggingface-cli`, the Hub web UI, or
-`hf_hub_download` from `huggingface_hub`. The CLI flags are the same
-either way; the examples below use v1.2 so the snippets are shorter
-to type. Swap the filename in to run v1.3.
-### CLI
-```bash
-./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
-    --in-wav mic.wav ref.wav \
-    --out-wav enhanced.wav
-```
-Expects 16 kHz mono PCM for both mic and far-end reference.
-### Benchmark
-```bash
-./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
-    --in-wav mic.wav ref.wav --iters 10 --profile
-```
-### Shared Library (C API)
-```bash
-cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
-cmake --build ggml/build -j$(nproc)
-```
-Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
-`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
-integration.
-### Quantizing (experimental)
-Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
-tool in the C++ build can produce GGUF variants from the F32 reference
-for experimentation:
-```bash
-./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8_0.gguf Q8_0
-```
-Expect end-to-end quality loss until proper per-tensor selection and
-calibration have been worked through.
-## PyTorch Reference
-`localvqe-v1.3-4.8M.pt` (current) and `localvqe-v1.2-1.3M.pt`
-(compact alternative) are the PyTorch checkpoints used to produce
-the GGUF exports. They are provided for verification, ablation, and
-downstream research — not for end-user inference, which should go
-through the GGML build above. Both share `arch_version=3` (pre-norm
-CausalGroupNorm + SiLU + STFT-256) and differ only in width
-(`mic_channels`, `far_channels`, `bottleneck_hidden`), which the
-loader reads from the saved `model_config` field. The model
-definition lives under `pytorch/` in the
-[GitHub repo](https://github.com/localai-org/LocalVQE):
-```bash
-git clone https://github.com/localai-org/LocalVQE.git
-cd LocalVQE/pytorch
-pip install -r requirements.txt
-```
-## Citing LocalVQE
-If you use LocalVQE in academic work, please cite the repository via the
-`CITATION.cff` at <https://github.com/localai-org/LocalVQE> — GitHub renders
-a "Cite this repository" button that produces APA and BibTeX entries
-automatically.
-For a DOI, we recommend citing a specific release via
-[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
-also cite the upstream DeepVQE paper:
 ```bibtex
 @inproceedings{indenbom2023deepvqe,
@@ -382,25 +171,23 @@ also cite the upstream DeepVQE paper:
                Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
   author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
                and Chernov, Mykola and Aichner, Robert},
-  booktitle = {Interspeech},
-  year      = {2023},
   doi       = {10.21437/Interspeech.2023-2176}
 }
 ```
-## Dataset Attribution
-Published weights are trained on data from the
-[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
 (Microsoft, CC BY 4.0) and fine-tuned on the
-[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
-## Safety Note
-Training data was filtered by DNSMOS perceived-quality scores, which can
-misclassify distressed speech (screaming, crying) as noise. LocalVQE may
-attenuate or distort such signals and must not be relied upon for emergency
-call or safety-critical applications.
 ## License

 [![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
+**Local Voice Quality Enhancement** — compact neural models for acoustic echo
+cancellation (AEC), noise suppression (NS), and dereverberation of 16 kHz
+speech, running on commodity CPUs in real time. Causal and streaming
+(256-sample hop, 16 ms latency).
+- **Try it:** <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>
+- **Source, build system, tests:** <https://github.com/localai-org/LocalVQE>
+This page hosts the published weights. Inference runs the GGML C++ engine on
+the GGUF files directly (build instructions on GitHub).
+**Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
+Claude (Anthropic). LocalVQE is a streaming, CPU-tuned derivative of **DeepVQE**
+([Indenbom et al., Interspeech 2023](https://arxiv.org/abs/2306.03177)).
+## Models
+Speed is per 16 ms hop on a Ryzen 9 7900 (Zen4), 4 threads; RT = realtime
+factor (higher is faster than realtime).
+| Version | Does | Params | Size (F32) | Speed | Pick it when |
+|---|---|---:|---:|---|---|
+| **v1.3** *(current)* | AEC + NS + dereverb | 4.8 M | ~19 MB | 3.2 ms · 5.0× RT | best joint quality, CPU budget available |
+| **v1.2** | AEC + NS + dereverb | 1.3 M | ~5 MB | 1.7 ms · 8.9× RT | tight CPU / low-power devices |
+| **v1.4-AEC** | echo only (keeps voice, noise, room) | 203 K | ~3 MB | 0.83 ms · 19× RT | NS is handled elsewhere, or you want the room kept |
+| **v1.4-AEC 2.7K** | echo only, linear filter (no mask) | 2.7 K | ~17 KB | 0.36 ms · 44× RT | lightest echo canceller; echo isn't heavily reverberant |
+| v1.1 / v1 | AEC + NS + dereverb | 1.3 M | ~5 MB | — | superseded by v1.2 |
+- **Joint models (v1.2 / v1.3)** clean echo, noise, and reverb in one pass.
+  v1.3 is wider and filters noise better; v1.2 is ~1/4 the per-hop cost.
+- **v1.4-AEC** removes only the far-end echo and passes voice, room, and
+  background through unchanged. It's a classical adaptive filter followed by a
+  small neural mask. The **2.7K** build is that filter alone — cheaper and
+  gentler, but it can't remove heavily reverberant echo the way the mask can.
+- Every model needs a far-end **reference** signal (a loopback of what your
+  speakers play) in addition to the mic.
+- `bf16` GGUFs are ~12 % smaller with identical quality and speed; pick `f32`
+  unless download size matters.
 ## Files in this repository
+| File | Size | Model |
 |---|---|---|
+| `localvqe-v1.4-aec-200K-f32.gguf` | 3 MB | v1.4-AEC (echo only) |
+| `localvqe-v1.4-aec-200K-bf16.gguf` | 2.6 MB | v1.4-AEC, conv weights in BF16 |
+| `localvqe-v1.4-aec-2.7K-f32.gguf` | 17 KB | v1.4-AEC front-end only (adaptive filter, no mask) |
+| `localvqe-v1.3-4.8M-f32.gguf` | 19 MB | v1.3 joint — GGUF the engine loads |
+| `localvqe-v1.3-4.8M.pt` | 55 MB | v1.3 joint — PyTorch checkpoint (research) |
+| `localvqe-v1.2-1.3M-f32.gguf` | 5 MB | v1.2 joint — GGUF |
+| `localvqe-v1.2-1.3M.pt` | 11 MB | v1.2 joint — PyTorch checkpoint |
+| `localvqe-v1.1-1.3M-f32.gguf`, `localvqe-v1-1.3M-f32.gguf` | 5 MB | older releases |
+v1.4-AEC is GGUF-only (no `.pt`). GGUF integrity is checked at load time against
+a built-in SHA256 allowlist in the engine.
+## Performance
 Full 800-clip eval on the
 [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
+(real recordings). AECMOS echo / deg are 1–5 (higher = more echo removed /
+cleaner speech); blind ERLE is `10·log10(E[mic²]/E[enh²])`, only meaningful on
+far-end-only clips. Unprocessed-mic echo MOS is 2.67 / 2.56 / 1.90 / 2.13 / 5.00
+across the five scenarios.
+**v1.4-AEC** — keeps background noise and room by design, so its ERLE and
+far-end DNSMOS are intentionally lower than the joint models (it isn't deleting
+the ambience):
+| Scenario | n | echo ↑ | deg ↑ | ERLE ↑ | OVRL |
+|---|--:|--:|--:|--:|--:|
+| doubletalk | 115 | 4.20 | 2.45 | — | 2.59 |
+| doubletalk-with-movement | 185 | 4.19 | 2.45 | — | 2.55 |
+| farend-singletalk | 107 | 3.80 | 4.99 | 14.6 dB | 1.37 |
+| farend-singletalk-with-movement | 193 | 3.86 | 4.95 | 11.1 dB | 1.31 |
+| nearend-singletalk | 200 | 4.99 | 3.99 | — | 3.08 |
+**v1.4-AEC 2.7K** (front-end only) — matches or beats the full model's
+perceptual far-end echo at 1/74 the parameters; the mask's extra work shows up
+as higher ERLE above, not higher echo MOS:
+| Scenario | n | echo ↑ | deg ↑ | ERLE ↑ | OVRL |
+|---|--:|--:|--:|--:|--:|
+| doubletalk | 115 | 4.00 | 2.79 | — | 2.46 |
+| doubletalk-with-movement | 185 | 3.90 | 2.92 | — | 2.42 |
+| farend-singletalk | 107 | 4.06 | 5.00 | 6.5 dB | 1.24 |
+| farend-singletalk-with-movement | 193 | 4.05 | 4.97 | 3.9 dB | 1.22 |
+| nearend-singletalk | 200 | 4.98 | 3.77 | — | 3.03 |
+**v1.3** (joint) and **v1.2** (joint) — these also delete the background, so
+their far-end ERLE is much higher and not comparable to v1.4-AEC's:
+| Scenario | n | v1.3 echo / deg / ERLE / OVRL | v1.2 echo / deg / ERLE / OVRL |
+|---|--:|---|---|
+| doubletalk | 115 | 4.73 / 2.62 / 8.5 dB / 2.89 | 4.72 / 2.37 / 8.4 dB / 2.83 |
+| doubletalk-with-movement | 185 | 4.67 / 2.43 / 8.3 dB / 2.85 | 4.65 / 2.30 / 8.1 dB / 2.79 |
+| farend-singletalk | 107 | 3.69 / 4.83 / 50.9 dB / 1.94 | 3.78 / 4.91 / 45.7 dB / 1.80 |
+| farend-singletalk-with-movement | 193 | 3.88 / 4.98 / 49.9 dB / 1.96 | 4.12 / 4.96 / 40.6 dB / 1.75 |
+| nearend-singletalk | 200 | 5.00 / 4.18 / 2.4 dB / 3.17 | 5.00 / 4.16 / 2.1 dB / 3.17 |
+### Latency
+Per-hop p50 / RT factor on a Ryzen 9 7900 (Zen4). 16 kHz, 256-sample hop.
+| Model | 1 thread | 4 threads | dGPU (RTX 5070 Ti, Vulkan) |
+|---|---|---|---|
+| v1.4-AEC (203 K) | 1.29 ms · 12.2× | 0.83 ms · 18.6× | run on CPU¹ |
+| v1.4-AEC 2.7K | 0.36 ms · 44× (single-threaded) | — | run on CPU¹ |
+| v1.3 (4.8 M) | 9.73 ms · 1.58× | 3.21 ms · 4.97× | 2.57 ms · 6.07× |
+| v1.2 (1.3 M) | 4.28 ms · 3.72× | 1.65 ms · 8.90× | 1.96 ms · 7.85× |
+¹ v1.4-AEC's adaptive front-end always runs on CPU and the neural stage is too
+small for GPU offload to pay off. Four threads is the sweet spot on Zen4 for all
+models; the library defaults to `min(4, available CPUs)`.
+### Memory (CPU)
+Working set the model adds on top of the ~7 MiB binary baseline:
+| Model | Post-load delta | Peak RSS |
+|---|--:|--:|
+| v1.3 (4.8 M) | +24.4 MiB | 34.1 MiB |
+| v1.2 (1.3 M) | +10.0 MiB | 19.6 MiB |
+| v1.4-AEC (203 K) | +6.7 MiB | 17.0 MiB |
+## Running inference
+Download a GGUF (web UI, `huggingface-cli`, or `hf_hub_download`) and run the
+GGML CLI — same command for every model, just swap the file:
 ```bash
+./localvqe localvqe-v1.3-4.8M-f32.gguf --in-wav mic.wav ref.wav --out-wav out.wav
 ```
+16 kHz mono PCM for both the mic and the far-end reference. Building the engine,
+the C API (`liblocalvqe.so`), and the OBS Studio plugin are documented in the
+[GitHub repository](https://github.com/localai-org/LocalVQE).
+## PyTorch reference
+`localvqe-v1.3-4.8M.pt` and `localvqe-v1.2-1.3M.pt` are the checkpoints used to
+produce the GGUF exports — for verification, ablation, and research, not
+end-user inference (use the GGML build). The model definition lives under
+`pytorch/` in the [GitHub repo](https://github.com/localai-org/LocalVQE).
+## Citing
+Cite the repository via `CITATION.cff` at
+<https://github.com/localai-org/LocalVQE> (GitHub's "Cite this repository"
+button produces APA / BibTeX), and the upstream DeepVQE paper:
 ```bibtex
 @inproceedings{indenbom2023deepvqe,
                Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
   author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
                and Chernov, Mykola and Aichner, Robert},
+  booktitle = {Interspeech}, year = {2023},
   doi       = {10.21437/Interspeech.2023-2176}
 }
 ```
+## Dataset attribution
+Weights are trained on the
+[ICASSP 2023 DNS Challenge](https://github.com/microsoft/DNS-Challenge)
 (Microsoft, CC BY 4.0) and fine-tuned on the
+[ICASSP 2022/2023 AEC Challenge](https://github.com/microsoft/AEC-Challenge).
+## Safety
+Training data was filtered by DNSMOS, which can misclassify distressed speech
+(screaming, crying) as noise. LocalVQE may attenuate such signals and must not
+be relied upon for emergency or safety-critical applications.
 ## License