| --- |
| license: cc-by-nc-4.0 |
| language: |
| - zh |
| - en |
| tags: |
| - speech |
| - asr |
| - sensevoice |
| - paralinguistic |
| - nonverbal-vocalization |
| datasets: |
| - NV-Bench |
| - amphion/Emilia-NV |
| - nonverbalspeech/nonverbalspeech38k |
| - deepvk/NonverbalTTS |
| - xunyi/SMIIP-NV |
| pipeline_tag: automatic-speech-recognition |
| metrics: |
| - cer |
| base_model: |
| - FunAudioLLM/SenseVoiceSmall |
| --- |
| |
| # Multi-lingual NVASR |
|
|
| **Multi-lingual Nonverbal Vocalization Automatic Speech Recognition** |
|
|
| [](https://nvbench.github.io) |
| [](https://huggingface.co/datasets/AnonyData/NV-Bench) |
| [](https://huggingface.co/AnonyData/Multilingual-NVASR) |
|
|
| Multi-lingual NVASR is a speech recognition model fine-tuned from [SenseVoice-Small](https://github.com/FunAudioLLM/SenseVoice) for transcribing both regular speech and **nonverbal vocalizations (NVVs)** with a unified paralinguistic label taxonomy. It is a core component of the [NV-Bench](https://nvbench.github.io) evaluation pipeline. |
|
|
| ## Highlights |
|
|
| - π£οΈ **Multi-lingual Support** β Chinese (zh), English (en) |
| - π― **NVV-Aware Transcription** β Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text |
| - π **High-Quality General ASR** β Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks |
| - π·οΈ **Unified Label Taxonomy** β Consistent paralinguistic labels across all supported languages |
|
|
| ## NVV Taxonomy |
|
|
| NVVs are organized into three functional levels: |
|
|
| | Function | Categories | |
| |----------|------------| |
| | Vegetative | `[Cough]`, `[Sigh]`, `[Breathing]` | |
| | Affect Burst | `[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]` | |
| | Conversational Grunt | `[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]` | |
|
|
| > [!NOTE] |
| > Mandarin supports 13 NVV categories; English supports 7 categories. |
|
|
| ## Usage |
|
|
| ### Quick Start with FunASR |
|
|
| ```python |
| from funasr import AutoModel |
| |
| model = AutoModel(model="path/to/Multi-lingual-NVASR") |
| |
| # Single file inference |
| res = model.generate( |
| input="example/zh.mp3", |
| language="auto", |
| use_itn=True, |
| ) |
| print(res[0]["text"]) |
| ``` |
|
|
| ## Evaluation Metrics |
|
|
| Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline: |
|
|
| | Metric | Description | |
| |--------|-------------| |
| | **OCER / OWER** | Overall Character/Word Error Rate (text + NVV tags) | |
| | **PCER / PWER** | Paralinguistic CER/WER (NVV tags only) | |
| | **CER / WER** | Text-only error rate (NVV tags removed) | |
|
|
| > Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. β *NV-Bench* |
|
|
| ## File Structure |
|
|
| ``` |
| Multi-lingual NVASR/ |
| βββ model.pt # Model weights (~2.8 GB) |
| βββ config.yaml # Model architecture configuration |
| βββ configuration.json # FunASR pipeline configuration |
| βββ am.mvn # Acoustic model mean-variance normalization |
| βββ paralingustic_tokenizer.model # SentencePiece tokenizer with NVV vocabulary |
| βββ example/ # Example audio files |
| β βββ zh.mp3 # Chinese example |
| β βββ en.mp3 # English example |
| ``` |
|
|
| ## Related Resources |
|
|
| - **NV-Bench Project Page**: [https://nvbench.github.io](https://nvbench.github.io) |
| - **NV-Bench Dataset**: [Hugging Face](https://huggingface.co/datasets/AnonyData/NV-Bench) |
| - **SenseVoice**: [GitHub](https://github.com/FunAudioLLM/SenseVoice) |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| Coming soon |
| ``` |
|
|
| ## License |
|
|
| This project is licensed under the [CC BY-NC-4.0 License](LICENSE). |