You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

LiSenNet

Ultra-compact, causal, real-time speech enhancers trained on VoiceBank-DEMAND-16k — a sub-band U-Net with a magnitude-only mask (phase from a 2-iteration Griffin-Lim offline, or the noisy phase for real-time). Port of Yan, Zhou, Chen & Lu, LiSenNet, arXiv:2409.13285 (hyyan2k/LiSenNet, MIT).

This repo holds three variants, each in its own subfolder:

subfolder	recipe	params	NPU-compiles	FP32 PESQ	real-time int8 PESQ
`gru/`	dual-path GRU (faithful)	36,783	✗	3.006	2.930
`conv/`	dual-path conv	41,063	✗	2.970	2.855
`conv-hardened/`	conv + NPU-hardened	36,288	✓	3.013	2.998

PESQ is wideband, on the full 824-utterance VoiceBank-DEMAND test split.

gru/ is the faithful reproduction and the original quality reference. Its GRU + 2-axis LayerNorm do not compile to the STM32N6 Neural-ART NPU.
conv/ replaces the GRU bottleneck with a dual-path conv one (0 GRU / 0 LayerNormalization). Its ops map to the NPU, but the FIFO-state streaming graph (conv/g_best_streaming_fp32.onnx, feat + N state_i_in -> est_mag + N state_i_out) crashes the Neural-ART codegen — kept as the CPU/onnxruntime frame-by-frame reference.
conv-hardened/ is the NPU-deployable variant and the current best model overall: per-channel BatchNorm (folds into the convs), ReLU, plain ConvTranspose upsampling, and a stateless windowed deploy graph (conv-hardened/g_best_windowed_int8_static.onnx, signed QInt8, feat_window (B,3,132,257) -> est_mag (B,64,257), window = receptive field 68 + 64 emitted frames) that compiles to Neural-ART — the artifact handed to stedgeai. The hardened primitives also quantize far better (int8 drop −0.016 vs −0.115 for conv/).

Code + full write-up: https://github.com/LarocheC/eco8-neaixt — see RESULTS_LISENNET.md.

Files (per subfolder)

config.json, g_best (PyTorch {"generator": state_dict}), g_best_fp32.onnx and g_best_int8_static.onnx (whole-utterance mask sub-network, feat (B,3,T,F) -> est_mag (B,T,F)). conv/ additionally has g_best_streaming_fp32.onnx and g_best_streaming_int8_static.onnx (single frame + explicit state I/O); conv-hardened/ has g_best_windowed_fp32.onnx and g_best_windowed_int8_static.onnx (stateless windowed deploy graph, the stedgeai / Neural-ART target). The ONNX graphs are the mask sub-network only — STFT, feature build and phase recovery stay host-side.

Loading (PyTorch)

import json, torch
from huggingface_hub import hf_hub_download
from common.env import AttrDict
from lisennet.model import build_lisennet

REPO, SUB = "claroche1/LiSenNet", "conv-hardened"      # or "gru" / "conv"
cfg  = json.load(open(hf_hub_download(REPO, f"{SUB}/config.json")))
ckpt = torch.load(hf_hub_download(REPO, f"{SUB}/g_best"), map_location="cpu", weights_only=True)
model = build_lisennet(AttrDict(cfg)).eval()
model.load_state_dict(ckpt["generator"])   # model(noisy_wav)["est"]

Running the NPU windowed deploy graph (`conv-hardened/`)

Stateless: feed a sliding window of the last 68 + 64 = 132 feature frames and read the 64 newest enhanced-magnitude frames (no state tensors to carry).

import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download

sess = ort.InferenceSession(
    hf_hub_download("claroche1/LiSenNet", "conv-hardened/g_best_windowed_int8_static.onnx"),
    providers=["CPUExecutionProvider"],
)
feat_window = np.zeros((1, 3, 132, 257), np.float32)   # last 68+64 feature frames
est_mag = sess.run(["est_mag"], {"feat_window": feat_window})[0]  # (1, 64, 257)

Running the CPU streaming graph frame-by-frame (`conv/`)

import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download

sess = ort.InferenceSession(
    hf_hub_download("claroche1/LiSenNet", "conv/g_best_streaming_fp32.onnx"),
    providers=["CPUExecutionProvider"],
)
state_in = [i for i in sess.get_inputs() if i.name != "feat"]   # FIFO states
out_names = [o.name for o in sess.get_outputs()]                # est_mag + state_*_out
zeros = lambda s: np.zeros([d if isinstance(d, int) else 1 for d in s], np.float32)
states = {i.name: zeros(i.shape) for i in state_in}            # start-of-stream = zeros

def step(feat_t):                                              # feat_t: (1, 3, 1, 257)
    res = sess.run(out_names, {"feat": feat_t, **states})
    for i, v in zip(state_in, res[1:]):
        states[i.name] = v
    return res[0]                                              # est_mag (1, 1, 257)

License

MIT. See the source repository for training code and full attribution.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train claroche1/LiSenNet

Paper for claroche1/LiSenNet

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Paper • 2409.13285 • Published Sep 20, 2024