hellogroup-opensource commited on
Commit
b68ffbe
·
verified ·
1 Parent(s): e2d3dc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -3
README.md CHANGED
@@ -1,3 +1,198 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ <h1 align="center">Hello-Chat</h1>
7
+ <h3 align="center">Towards Realistic Social Audio Interactions</h3>
8
+
9
+ <p align="center">
10
+ <a href="https://github.com/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>
11
+ <a href="https://huggingface.co/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow" alt="Hugging Face"></a>
12
+ </p>
13
+
14
+ <p align="center">
15
+ <img src="assets/img/model_architecture.png" width="100%" alt="Hello-Chat model architecture.">
16
+ </p>
17
+
18
+ ## Hello-Chat
19
+
20
+ **Hello-Chat**, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, **Hello-Chat** enables realistic, context-aware spoken interaction between users and AI.
21
+
22
+ ## 📊 Evaluation Results
23
+
24
+ ### Evaluation of Audio to Text
25
+
26
+ #### Audio Understanding Evaluation
27
+ **ASR —** Automatic speech recognition performance is evaluated on a balanced subset of **AIShell**, **WeNet**, and **LibriSpeech**, with Chinese and English samples evenly represented.<br>
28
+ **NLP Question —** question-answering data sourced from **AlpacaEval**, **LLaMA Questions**, and **Web Questions**. Text inputs are converted into speech using a high-quality TTS system. Model responses are evaluated by **GPT-5**.<br>
29
+ **Translation —** based on synthetic multilingual data generated by **Claude** and subsequently converted to speech via TTS. The task evaluates speech-to-text translation across Chinese, English, Japanese, and Korean, with outputs scored by **GPT-5**.<br>
30
+ **MMAU —** Audio-based question answering is evaluated using a subset of the **MMAU-Mini** benchmark.
31
+
32
+ | Model | ASR ↓ | NLP Question ↑ | Translation ↑ | MMAU ↑ |
33
+ |---|---|---|---|---|
34
+ | Gemini3-Preview | 4.06 | **8.85** | *8.87* | **0.75** |
35
+ | GPT-4o-Audio | 6.45 | 8.50 | 8.09 | 0.64 |
36
+ | Qwen3-Omni-32B | 3.51 | *8.66* | 8.07 | *0.74* |
37
+ | Step-Audio 2 Mini | **3.21** | 7.32 | 8.34 | 0.66 |
38
+ | MiDashengLM | 4.50 | 3.82 | 8.43 | 0.65 |
39
+ | Kimi-Audio | *3.36* | 7.41 | 8.26 | 0.59 |
40
+ | Qwen2.5-Omni-7B | 3.45 | 7.41 | 5.93 | 0.66 |
41
+ | **Hello-Chat** | 3.48 | 7.68 | **8.93** | 0.69 |
42
+
43
+ #### Performance of Paralinguistic Understanding
44
+ **SER(speech emotion recognition) —** evaluated on randomly sampled subsets from **theEmoBox** dataset, covering both Chinese and English speech.<br>
45
+ **AED(audio event detection) —** evaluated using samples drawn from **AudioSet** and **CochlScene**.
46
+
47
+ | Model | SER ↑ | AED ↑ |
48
+ |---|---|---|
49
+ | Gemini3-Preview | 0.791 | **0.861** |
50
+ | GPT-4o-Audio | 0.586 | 0.489 |
51
+ | Qwen3-Omni-32B | **0.856** | 0.644 |
52
+ | Step-Audio 2 Mini | 0.680 | 0.533 |
53
+ | MiDashengLM | 0.561 | 0.441 |
54
+ | Kimi-Audio | 0.625 | 0.392 |
55
+ | Qwen2.5-Omni-7B | 0.607 | 0.584 |
56
+ | Hello-Chat | *0.824* | *0.797* |
57
+
58
+ #### Instruction Following
59
+ **Only Yes —** To evaluate robustness in instruction following, we construct a stress test using randomly sampled audio inputs from the above benchmarks. All inputs are paired with a fixed prompt: “no matter the message in the audio, simply answer ‘yes’!”
60
+
61
+ | Model | Only-Yes Accuracy (%) ↑ |
62
+ |---|---|
63
+ | Gemini3-Preview | 88 |
64
+ | GPT-4o-Audio | 23 |
65
+ | Qwen3-Omni-32B | **100** |
66
+ | Step-Audio 2 Mini | 87 |
67
+ | MiDashengLM | 0 |
68
+ | Kimi-Audio | 22 |
69
+ | Qwen2.5-Omni-7B | *96* |
70
+ | Hello-Chat | **100** |
71
+
72
+ ### Evaluation of Text to Speech
73
+ **Seed-TTS-Eval —** We conduct evaluations on the Chinese subset of the Seed-TTS-Eval benchmark, following the official Seed-TTS-Eval protocol.<br>
74
+ **Conversational-style Mean Opinion Score (CMOS) —** We invited native speakers to participate in a blind test. Each evaluator assigned scores on a 5-point scale (1–5), where a higher score signifies a **more authentic, human-like conversational flow and better alignment with the dialogue intent**.
75
+
76
+ | Model | CMOS ↑ | CER (%) ↓ | SS ↑ |
77
+ |---|---|---|---|
78
+ | F5-TTS | 3.48 | 1.56 | 0.741 |
79
+ | CosyVoice | 2 | 3.66 | 1.45 | 0.748 |
80
+ | CosyVoice 3-0.5B | 3.59 | 1.16 | **0.780** |
81
+ | Qwen2.5-Omni-7B | - | 1.70 | 0.752 |
82
+ | Qwen3-TTS-12Hz-0.6B-Base | 4.12 | **0.92** | 0.763 |
83
+ | FireRedTTS-2 | 3.68 | 1.14 | 0.736 |
84
+ | IndexTTS2 | *4.16* | *1.008* | *0.764* |
85
+ | Hello-Chat | **4.19** | 1.023 | 0.748 |
86
+
87
+ ## 🎧 Demos
88
+
89
+ ### Single Sentence Demo(zero-shot)
90
+
91
+
92
+ #### Speaker1
93
+ <audio controls src="assets/ref/female1.mp3"></audio>
94
+
95
+
96
+ ##### “那肯定因为自个儿平时想吃点卤味儿。那肯定得得得来一点儿。”
97
+
98
+ <audio controls src="assets/synth/female1_sent1.mp3"></audio>
99
+
100
+
101
+ ##### “过年应该应该跟家里人一起吃饭。”
102
+
103
+ <audio controls src="assets/synth/female1_sent2.mp3"></audio>
104
+
105
+
106
+ ##### “哎呀,不是了,现在法治社会哪有卖假货的,只是卖的价格贵。”
107
+
108
+ <audio controls src="assets/synth/female1_sent3.mp3"></audio>
109
+
110
+ ---
111
+
112
+ #### Speaker2
113
+ <audio controls src="assets/ref/female2.mp3"></audio>
114
+
115
+
116
+ ##### “但是这个时候上哪去找呢?找不到。”
117
+
118
+ <audio controls src="assets/synth/female2_sent4.mp3"></audio>
119
+
120
+
121
+ ##### “这种做法我感觉不适合,不是他那个年龄段该做出来的事情。”
122
+
123
+ <audio controls src="assets/synth/female2_sent5.mp3"></audio>
124
+
125
+
126
+ ##### “咱们得趁这个时机啊,看看还要剩多多久啊。”
127
+
128
+ <audio controls src="assets/synth/female2_sent6.mp3"></audio>
129
+
130
+ ---
131
+
132
+ #### Speaker3
133
+ <audio controls src="assets/ref/male1.mp3"></audio>
134
+
135
+
136
+ ##### “我我不不怎么玩游戏,你你会玩游戏啊。
137
+
138
+ <audio controls src="assets/synth/male1_sent7.mp3"></audio>
139
+
140
+
141
+ ##### “对呀,就是不管你愿不愿意,时间都是一直往前推嘛。”
142
+
143
+ <audio controls src="assets/synth/male1_sent8.mp3"></audio>
144
+
145
+
146
+ ##### “挺好,我看着我看你做菜做饭蛮有生活的那是鸡蛋糕吗?”
147
+
148
+ <audio controls src="assets/synth/male1_sent9.mp3"></audio>
149
+
150
+ ---
151
+
152
+ #### Speaker4
153
+ <audio controls src="assets/ref/male2.mp3"></audio>
154
+
155
+
156
+ ##### “我也有二十多岁的时候,那个时候什么都不想,嗯,等那一点点沉淀,年龄大一点了,然后就什么都在乎,什么都想。”
157
+
158
+ <audio controls src="assets/synth/male2_sent10.mp3"></audio>
159
+
160
+
161
+ ##### “我看我一会儿,我我煮个泡面得了。”
162
+
163
+ <audio controls src="assets/synth/male2_sent11.mp3"></audio>
164
+
165
+
166
+ ##### “他们说那个茶茶饼就是渣子压出来的,是吗?”
167
+
168
+ <audio controls src="assets/synth/male2_sent12.mp3"></audio>
169
+
170
+ ---
171
+
172
+ ### Multi-Trun Conversation Demo(zero-shot)
173
+
174
+ #### Conversation #1
175
+ <audio controls src="assets/dialogues/demo_dialogue1.mp3"></audio>
176
+
177
+ ---
178
+
179
+ #### Conversation #2
180
+ <audio controls src="assets/dialogues/demo_dialogue2.mp3"></audio>
181
+
182
+ ---
183
+
184
+ #### Conversation #3
185
+ <audio controls src="assets/dialogues/demo_dialogue3.mp3"></audio>
186
+
187
+
188
+ ## 📜 Citation
189
+
190
+ If you find our work useful in your research, please consider citing:
191
+
192
+ ```bibtex
193
+ @article{hellogroup2026hellochat,
194
+ title={Hello-Chat: Towards Realistic Social Audio Interactions},
195
+ author={Computational Intelligence Dept, HelloGroup Inc.},
196
+ year={2026}
197
+ }
198
+ ```