GPT-SoVITS CPU Inference Fork
Inference-only GPT-SoVITS fork focused on CPU deployment and CPU-side optimization on Windows, Linux, and macOS.
What This Repo Is
- Inference-only fork of GPT-SoVITS.
- Designed around CPU usage rather than GPU-first features.
- Current practical focus is the S2
v2Pro/v2ProPluspath, while keeping versioned pretrained downloads forv1,v2,v2Pro, andv2ProPlus.
What Was Removed
This repository no longer keeps most training and dataset-preparation features from upstream.
- Training entrypoints and training-only utilities
- Dataset slicing / denoise / ASR / labeling workflows
- UVR5 and other non-inference WebUI tools
The remaining goal is straightforward: run GPT-SoVITS inference on CPU with less installation friction and a smaller runtime surface.
What Still Works
webui.py: minimal inference launcherGPT_SoVITS/inference_webui_fast.py: high-performance CPU inference WebUIapi.pyandapi_v2.py: inference APIs
Quick Start
1. Create Environment
Use Miniconda or an existing Conda environment:
conda create -n GPTSoVits python=3.10 -y
conda activate GPTSoVits
2. Install Dependencies and Download Inference Weights
CPU example with ModelScope and v2ProPlus:
bash install.sh --source ModelScope --version v2ProPlus
Windows PowerShell:
.\install.ps1 -Source ModelScope -Version v2ProPlus
Available versions:
v1v2v2Prov2ProPlusall
3. Launch
Recommended:
python webui.py
Direct high-performance inference WebUI:
python GPT_SoVITS/inference_webui_fast.py
Notes
- This fork is aimed at CPU inference, not training.
- Chinese inference is still heavier than English / Japanese / Korean because text preprocessing needs extra frontend work such as
g2pwand BERT features. install.shandinstall.ps1are now CPU-only installers and download inference assets by version instead of the full pretrained bundle.NLTKandOpenJTalkdictionary downloads remain enabled by default.
Speed Summary
Without changing recognition behavior, prosody, speaker identity, or audio quality, this CPU fork currently delivers:
- Chinese end-to-end
zh_pure:wall_sec 15.136264 -> 8.309471, about-45.1% - End-to-end:
wall_sec 10.431826 -> 7.046743, about-32.4% - Preprocessing:
frontend_sec 0.867065 -> 0.569571, about-34.3% - T2S:
t2s_sec 4.023657 -> 2.268571, about-43.6% - VITS:
vits_sec 5.137876 -> 3.969286, about-22.7%
Largest landed stage wins so far:
- Preprocessing:
- Chinese warm multi-sentence BERT frontend:
0.855s -> 0.131s, about-84.7% - Korean cold frontend path:
0.791s -> 0.155s, about-80.4%
- Chinese warm multi-sentence BERT frontend:
- T2S:
stable_batch_remap7-case comparison1.986543 -> 1.684446, about-15.2% - VITS:
remove_weight_normvits_onlycomparison4.256119 -> 4.102004, about-3.6%
How The Speedups Were Achieved
These gains do not come from sentence-level caching, quantization, or quality tradeoffs. The main work was removing CPU-side Python overhead, repeated preparation, repeated copies, and unnecessary cold-start costs from the real inference path.
Chinese shows an even larger end-to-end gain than the overall average because it started from the heaviest path: it pays for both g2pw and BERT text features, so frontend reductions propagate directly into total latency.
What Was Deliberately Not Used
- ONNX / ORT was not adopted.
dec-only ORT,flow + dec ORT, and larger graph-level ORT experiments were all tried, but on the current machine and dependency stack they did not produce a solution that was simultaneously quality-safe, faster, and lighter than the PyTorch path, so the runtime stayed on pure PyTorch and the ONNX compatibility path was removed. - The Chinese path was not simplified by dropping quality-critical frontend pieces.
g2pwstayed, Chinese BERT stayed, and the project did not switch to lighter but lower-quality replacements such asg2pm, which are more likely to cause noticeable G2P errors on harder text, especially literary or classical material. - The main path also does not rely on secondary splitting just to manufacture larger batches. That direction was explored in benchmark-only form, but the results did not stay stable, and the
repeats=3verification did not justify moving it into the runtime path. - VITS parallel synthesis is not exposed in the CPU WebUI. Local CPU measurements showed it can add overhead and slow inference, so the high-performance WebUI keeps VITS synthesis serial while preserving T2S parallel inference.
Preprocessing / Frontend
- English frontend work focused on cold start. Heavy first-request dependencies around
g2p_en.G2pwere reduced,wordsegment.load()was made lazy,nltk.pos_tagwas narrowed to heteronym cases, and import-time overhead frominflect/typeguardwas trimmed. - Korean frontend work removed an unrelated cold-start chain.
g2pk2used to pull innltk/cmudicteven for pure Korean input; that path is now stubbed lazily so Korean requests no longer pay for the English dictionary on first use. - The biggest Chinese frontend win came from changing pure-Chinese multi-sentence BERT extraction from sentence-by-sentence serial execution to a batched path. The current path batches tokenization, BERT forward, and
word2phalignment for pure Chinese multi-sentence requests. - Chinese
g2pwcold start was also reduced by lazily importingrequests, shrinking tokenizer initialization down to directtokenizer.jsonloading, and avoiding the largertransformersauto-dispatch chain during startup. - Chinese segmentation / POS loading was further reduced with local static assets so
jieba_fast.posseginitialization does less repeated work, without introducing any sentence-level cache tied to user input. - Non-Chinese zero-BERT paths also gained simple length-based zero-tensor reuse so repeated all-zero feature allocation does less work.
T2S
- The first layer of gains came from removing pure Python overhead from the hot path, including
tqdmand hot-loop prints during decoding. These changes do not alter model numerics, but they do matter on CPU. - The next layer came from the shrink path. Previously, when some rows in a batch finished early, the code copied whole future-capacity buffers with
index_select, including token buffers, KV caches, and masks that were never going to be used. The current path compacts only the valid prefix. - Another landed gain is
stable_batch_remap. It is not a “never shrink” hack. It keeps exact behavior while stabilizing how active rows are remapped inside the batch, which reduces unnecessary compaction churn. - The deeper T2S gains came from addressing the actual
addmmhotspots. The main path now uses exact-safe hybrid linear execution for high-frequency layers such asMLP,qkv_proj, andout_proj:rows == 1keeps the originalF.linear, while larger cases use a more CPU-friendlytorch.addmmpath. - That backend work only became safe after fixing the load path as well.
t2s_transformeris rebuilt after checkpoint loading so cached transposed weights are bound to the real loaded parameters instead of the pre-load initialization weights.
VITS
- VITS optimization stayed conservative. The main strategy was to remove work that was being repeated for every batch instead of rewriting the model structure.
- The first landed step was a run-level runtime cache in
TTS.run()for reference-side objects that do not depend on the current input text, such asrefer_audio_spec,sv_emb,prompt_semantic_tokens, andprompt_phones. - The second landed step moved decode-condition preparation out of repeated decode calls.
build_decode_condition()letsge / ge_textbe computed once before the batch loop and then reused across decode calls. - The third landed step applies
remove_weight_norm()directly to non-vocoderGenerator.dec, removing weight-norm reparameterization overhead during inference. This is the main clearly logged stage-local VITS win currently kept in the codebase. - The worklog also records more aggressive VITS experiments that were tested but not kept, such as traced
decand more aggressive decode layout variants. The README only describes the parts that actually landed.
Load Path And Memory
t2s_onlybenchmark loading used to instantiate the fullTTS()pipeline and then trim unused objects, which pushed peak memory far too high. That was replaced with a true lightweight load path that initializes only the minimum T2S-side objects.- The main inference load path was also reordered to avoid rebuilding large transposed-weight structures while the checkpoint dictionary was still alive in memory.
- The biggest memory-side change is that inference no longer builds
self.hat all on the main T2S path. The runtime now rebuildst2s_transformerdirectly from the state dict and only keeps what inference actually uses. - This is not benchmark-only machinery. It is wired into the real inference path and mainly helps reduce steady-state RSS and peak RSS so CPU machines are less likely to stall or be reclaimed under memory pressure.
Upstream and Credits
This project is based on and uses code from:
This fork keeps upstream credits and referenced projects below.
Referenced Projects
Theoretical Research
Main Model / Training / Vocoder Related
- RVC-Boss/GPT-SoVITS
- SoVITS
- GPT-SoVITS-beta
- Chinese Speech Pretrain
- Chinese-Roberta-WWM-Ext-Large
- eresnetv2
Text Frontend for Inference
Inherited Upstream Tool References
These projects were referenced by upstream GPT-SoVITS. Some related modules are removed in this inference-only fork, but credits are preserved here.