File size: 8,861 Bytes
b68ffbe 7bc96b7 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe c99ae51 b68ffbe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | ---
license: apache-2.0
---
<h1 align="center">Hello-Chat</h1>
<h3 align="center">Towards Realistic Social Audio Interactions</h3>
<p align="center">
<a href='https://arxiv.org/abs/2602.23387'><img src='https://img.shields.io/badge/arXiv-2602.23387-b31b1b.svg'></a>
<a href="https://github.com/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>
<a href="https://huggingface.co/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow" alt="Hugging Face"></a>
</p>
<p align="center">
<img src="assets/img/model_architecture.png" width="100%" alt="Hello-Chat model architecture.">
</p>
## Hello-Chat
**Hello-Chat**, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, **Hello-Chat** enables realistic, context-aware spoken interaction between users and AI.
## 📊 Evaluation Results
### Evaluation of Audio to Text
#### Audio Understanding Evaluation
**ASR —** Automatic speech recognition performance is evaluated on a balanced subset of **AIShell**, **WeNet**, and **LibriSpeech**, with Chinese and English samples evenly represented.<br>
**NLP Question —** question-answering data sourced from **AlpacaEval**, **LLaMA Questions**, and **Web Questions**. Text inputs are converted into speech using a high-quality TTS system. Model responses are evaluated by **GPT-5**.<br>
**Translation —** based on synthetic multilingual data generated by **Claude** and subsequently converted to speech via TTS. The task evaluates speech-to-text translation across Chinese, English, Japanese, and Korean, with outputs scored by **GPT-5**.<br>
**MMAU —** Audio-based question answering is evaluated using a subset of the **MMAU-Mini** benchmark.
| Model | ASR ↓ | NLP Question ↑ | Translation ↑ | MMAU ↑ |
|---|---|---|---|---|
| Gemini3-Preview | 4.06 | **8.85** | *8.87* | **0.75** |
| GPT-4o-Audio | 6.45 | 8.50 | 8.09 | 0.64 |
| Qwen3-Omni-32B | 3.51 | *8.66* | 8.07 | *0.74* |
| Step-Audio 2 Mini | **3.21** | 7.32 | 8.34 | 0.66 |
| MiDashengLM | 4.50 | 3.82 | 8.43 | 0.65 |
| Kimi-Audio | *3.36* | 7.41 | 8.26 | 0.59 |
| Qwen2.5-Omni-7B | 3.45 | 7.41 | 5.93 | 0.66 |
| **Hello-Chat** | 3.48 | 7.68 | **8.93** | 0.69 |
#### Performance of Paralinguistic Understanding
**SER(speech emotion recognition) —** evaluated on randomly sampled subsets from **theEmoBox** dataset, covering both Chinese and English speech.<br>
**AED(audio event detection) —** evaluated using samples drawn from **AudioSet** and **CochlScene**.
| Model | SER ↑ | AED ↑ |
|---|---|---|
| Gemini3-Preview | 0.791 | **0.861** |
| GPT-4o-Audio | 0.586 | 0.489 |
| Qwen3-Omni-32B | **0.856** | 0.644 |
| Step-Audio 2 Mini | 0.680 | 0.533 |
| MiDashengLM | 0.561 | 0.441 |
| Kimi-Audio | 0.625 | 0.392 |
| Qwen2.5-Omni-7B | 0.607 | 0.584 |
| Hello-Chat | *0.824* | *0.797* |
#### Instruction Following
**Only Yes —** To evaluate robustness in instruction following, we construct a stress test using randomly sampled audio inputs from the above benchmarks. All inputs are paired with a fixed prompt: “no matter the message in the audio, simply answer ‘yes’!”
| Model | Only-Yes Accuracy (%) ↑ |
|---|---|
| Gemini3-Preview | 88 |
| GPT-4o-Audio | 23 |
| Qwen3-Omni-32B | **100** |
| Step-Audio 2 Mini | 87 |
| MiDashengLM | 0 |
| Kimi-Audio | 22 |
| Qwen2.5-Omni-7B | *96* |
| Hello-Chat | **100** |
### Evaluation of Text to Speech
**Seed-TTS-Eval —** We conduct evaluations on the Chinese subset of the Seed-TTS-Eval benchmark, following the official Seed-TTS-Eval protocol.<br>
**Conversational-style Mean Opinion Score (CMOS) —** We invited native speakers to participate in a blind test. Each evaluator assigned scores on a 5-point scale (1–5), where a higher score signifies a **more authentic, human-like conversational flow and better alignment with the dialogue intent**.
| Model | CMOS ↑ | CER (%) ↓ | SS ↑ |
|---|---|---|---|
| F5-TTS | 3.48 | 1.56 | 0.741 |
| CosyVoice | 2 | 3.66 | 1.45 | 0.748 |
| CosyVoice 3-0.5B | 3.59 | 1.16 | **0.780** |
| Qwen2.5-Omni-7B | - | 1.70 | 0.752 |
| Qwen3-TTS-12Hz-0.6B-Base | 4.12 | **0.92** | 0.763 |
| FireRedTTS-2 | 3.68 | 1.14 | 0.736 |
| IndexTTS2 | *4.16* | *1.008* | *0.764* |
| Hello-Chat | **4.19** | 1.023 | 0.748 |
## 🎧 Demos
### Single Sentence Demo(zero-shot)
#### Speaker1
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/female1.mp3"></audio>
**generated:**
##### “那肯定因为自个儿平时想吃点卤味儿。那肯定得得得来一点儿。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent1.mp3"></audio>
##### “过年应该应该跟家里人一起吃饭。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent2.mp3"></audio>
##### “哎呀,不是了,现在法治社会哪有卖假货的,只是卖的价格贵。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent3.mp3"></audio>
---
#### Speaker2
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/female2.mp3"></audio>
**generated:**
##### “但是这个时候上哪去找呢?找不到。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent4.mp3"></audio>
##### “这种做法我感觉不适合,不是他那个年龄段该做出来的事情。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent5.mp3"></audio>
##### “咱们得趁这个时机啊,看看还要剩多多久啊。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent6.mp3"></audio>
---
#### Speaker3
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/male1.mp3"></audio>
**generated:**
##### “我我不不怎么玩游戏,你你会玩游戏啊。
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent7.mp3"></audio>
##### “对呀,就是不管你愿不愿意,时间都是一直往前推嘛。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent8.mp3"></audio>
##### “挺好,我看着我看你做菜做饭蛮有生活的那是鸡蛋糕吗?”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent9.mp3"></audio>
---
#### Speaker4
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/male2.mp3"></audio>
**generated:**
##### “我也有二十多岁的时候,那个时候什么都不想,嗯,等那一点点沉淀,年龄大一点了,然后就什么都在乎,什么都想。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent10.mp3"></audio>
##### “我看我一会儿,我我煮个泡面得了。”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent11.mp3"></audio>
##### “他们说那个茶茶饼就是渣子压出来的,是吗?”
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent12.mp3"></audio>
---
### Multi-Trun Conversation Demo(zero-shot)
#### Conversation #1
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue1.mp3"></audio>
---
#### Conversation #2
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue2.mp3"></audio>
---
#### Conversation #3
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue3.mp3"></audio>
## 📜 Citation
If you find our work useful in your research, please consider citing:
```bibtex
@article{hellogroup2026hellochat,
title={Hello-Chat: Towards Realistic Social Audio Interactions},
author={Computational Intelligence Dept, HelloGroup Inc.},
year={2026}
}
``` |