LiSenNet
Ultra-compact, causal, real-time speech enhancers trained on VoiceBank-DEMAND-16k — a sub-band U-Net with a magnitude-only mask (phase from a 2-iteration Griffin-Lim offline, or the noisy phase for real-time). Port of Yan, Zhou, Chen & Lu, LiSenNet, arXiv:2409.13285 (hyyan2k/LiSenNet, MIT).
This repo holds three variants, each in its own subfolder:
| subfolder | recipe | params | NPU-compiles | FP32 PESQ | real-time int8 PESQ |
|---|---|---|---|---|---|
gru/ |
dual-path GRU (faithful) | 36,783 | ✗ | 3.006 | 2.930 |
conv/ |
dual-path conv | 41,063 | ✗ | 2.970 | 2.855 |
conv-hardened/ |
conv + NPU-hardened | 36,288 | ✓ | 3.013 | 2.998 |
PESQ is wideband, on the full 824-utterance VoiceBank-DEMAND test split.
gru/is the faithful reproduction and the original quality reference. Its GRU + 2-axisLayerNormdo not compile to the STM32N6 Neural-ART NPU.conv/replaces the GRU bottleneck with a dual-path conv one (0 GRU / 0 LayerNormalization). Its ops map to the NPU, but the FIFO-state streaming graph (conv/g_best_streaming_fp32.onnx,feat + N state_i_in -> est_mag + N state_i_out) crashes the Neural-ART codegen — kept as the CPU/onnxruntime frame-by-frame reference.conv-hardened/is the NPU-deployable variant and the current best model overall: per-channel BatchNorm (folds into the convs), ReLU, plain ConvTranspose upsampling, and a stateless windowed deploy graph (conv-hardened/g_best_windowed_int8_static.onnx, signed QInt8,feat_window (B,3,132,257) -> est_mag (B,64,257), window = receptive field 68 + 64 emitted frames) that compiles to Neural-ART — the artifact handed to stedgeai. The hardened primitives also quantize far better (int8 drop −0.016 vs −0.115 forconv/).
Code + full write-up: https://github.com/LarocheC/eco8-neaixt — see RESULTS_LISENNET.md.
Files (per subfolder)
config.json, g_best (PyTorch {"generator": state_dict}), g_best_fp32.onnx
and g_best_int8_static.onnx (whole-utterance mask sub-network,
feat (B,3,T,F) -> est_mag (B,T,F)). conv/ additionally has
g_best_streaming_fp32.onnx and g_best_streaming_int8_static.onnx (single
frame + explicit state I/O); conv-hardened/ has g_best_windowed_fp32.onnx
and g_best_windowed_int8_static.onnx (stateless windowed deploy graph, the
stedgeai / Neural-ART target). The ONNX graphs are the mask sub-network only —
STFT, feature build and phase recovery stay host-side.
Loading (PyTorch)
import json, torch
from huggingface_hub import hf_hub_download
from common.env import AttrDict
from lisennet.model import build_lisennet
REPO, SUB = "claroche1/LiSenNet", "conv-hardened" # or "gru" / "conv"
cfg = json.load(open(hf_hub_download(REPO, f"{SUB}/config.json")))
ckpt = torch.load(hf_hub_download(REPO, f"{SUB}/g_best"), map_location="cpu", weights_only=True)
model = build_lisennet(AttrDict(cfg)).eval()
model.load_state_dict(ckpt["generator"]) # model(noisy_wav)["est"]
Running the NPU windowed deploy graph (conv-hardened/)
Stateless: feed a sliding window of the last 68 + 64 = 132 feature frames and
read the 64 newest enhanced-magnitude frames (no state tensors to carry).
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
sess = ort.InferenceSession(
hf_hub_download("claroche1/LiSenNet", "conv-hardened/g_best_windowed_int8_static.onnx"),
providers=["CPUExecutionProvider"],
)
feat_window = np.zeros((1, 3, 132, 257), np.float32) # last 68+64 feature frames
est_mag = sess.run(["est_mag"], {"feat_window": feat_window})[0] # (1, 64, 257)
Running the CPU streaming graph frame-by-frame (conv/)
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
sess = ort.InferenceSession(
hf_hub_download("claroche1/LiSenNet", "conv/g_best_streaming_fp32.onnx"),
providers=["CPUExecutionProvider"],
)
state_in = [i for i in sess.get_inputs() if i.name != "feat"] # FIFO states
out_names = [o.name for o in sess.get_outputs()] # est_mag + state_*_out
zeros = lambda s: np.zeros([d if isinstance(d, int) else 1 for d in s], np.float32)
states = {i.name: zeros(i.shape) for i in state_in} # start-of-stream = zeros
def step(feat_t): # feat_t: (1, 3, 1, 257)
res = sess.run(out_names, {"feat": feat_t, **states})
for i, v in zip(state_in, res[1:]):
states[i.name] = v
return res[0] # est_mag (1, 1, 257)
License
MIT. See the source repository for training code and full attribution.