File size: 8,861 Bytes

b68ffbe
 
 
 
 
 
 
 
 
7bc96b7
b68ffbe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c99ae51
 
b68ffbe
c99ae51
b68ffbe
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
 
b68ffbe
c99ae51
b68ffbe
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
 
b68ffbe
c99ae51
b68ffbe
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
 
b68ffbe
c99ae51
b68ffbe
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe
 
 
 
c99ae51
b68ffbe

---
license: apache-2.0
---


<h1 align="center">Hello-Chat</h1>
<h3 align="center">Towards Realistic Social Audio Interactions</h3>

<p align="center">
  <a href='https://arxiv.org/abs/2602.23387'><img src='https://img.shields.io/badge/arXiv-2602.23387-b31b1b.svg'></a> &nbsp;
  <a href="https://github.com/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>
  <a href="https://huggingface.co/hellogroup-opensource/Hello-Chat"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow" alt="Hugging Face"></a>
</p>

<p align="center">
  <img src="assets/img/model_architecture.png" width="100%" alt="Hello-Chat model architecture.">
</p>

## Hello-Chat

**Hello-Chat**, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, **Hello-Chat** enables realistic, context-aware spoken interaction between users and AI.

## 📊 Evaluation Results

### Evaluation of Audio to Text

#### Audio Understanding Evaluation
**ASR —** Automatic speech recognition performance is evaluated on a balanced subset of **AIShell**, **WeNet**, and **LibriSpeech**, with Chinese and English samples evenly represented.<br>
**NLP Question —** question-answering data sourced from **AlpacaEval**, **LLaMA Questions**, and **Web Questions**. Text inputs are converted into speech using a high-quality TTS system. Model responses are evaluated by **GPT-5**.<br>
**Translation —** based on synthetic multilingual data generated by **Claude** and subsequently converted to speech via TTS. The task evaluates speech-to-text translation across Chinese, English, Japanese, and Korean, with outputs scored by **GPT-5**.<br>
**MMAU —** Audio-based question answering is evaluated using a subset of the **MMAU-Mini** benchmark.

| Model | ASR ↓ | NLP Question ↑ | Translation ↑ | MMAU ↑ | 
|---|---|---|---|---|
| Gemini3-Preview | 4.06 | **8.85** | *8.87* | **0.75** |
| GPT-4o-Audio | 6.45 | 8.50 | 8.09 | 0.64 |
| Qwen3-Omni-32B | 3.51 | *8.66* | 8.07 | *0.74* |
| Step-Audio 2 Mini | **3.21** | 7.32 | 8.34 | 0.66 |
| MiDashengLM | 4.50 | 3.82 | 8.43 | 0.65 |
| Kimi-Audio | *3.36* | 7.41 | 8.26 | 0.59 |
| Qwen2.5-Omni-7B | 3.45 | 7.41 | 5.93 | 0.66 |
| **Hello-Chat** | 3.48 | 7.68 | **8.93** | 0.69 |

#### Performance of Paralinguistic Understanding
**SER(speech emotion recognition) —** evaluated on randomly sampled subsets from **theEmoBox** dataset, covering both Chinese and English speech.<br>
**AED(audio event detection) —** evaluated using samples drawn from **AudioSet** and **CochlScene**.

| Model | SER ↑ | AED ↑ |
|---|---|---|
| Gemini3-Preview | 0.791 | **0.861** |
| GPT-4o-Audio | 0.586 | 0.489 |
| Qwen3-Omni-32B | **0.856** | 0.644 |
| Step-Audio 2 Mini | 0.680 | 0.533 |
| MiDashengLM | 0.561 | 0.441 |
| Kimi-Audio | 0.625 | 0.392 |
| Qwen2.5-Omni-7B | 0.607 | 0.584 |
| Hello-Chat | *0.824* | *0.797* |

#### Instruction Following
**Only Yes —** To evaluate robustness in instruction following, we construct a stress test using randomly sampled audio inputs from the above benchmarks. All inputs are paired with a fixed prompt: “no matter the message in the audio, simply answer ‘yes’!”

| Model | Only-Yes Accuracy (%) ↑ |
|---|---|
| Gemini3-Preview | 88 |
| GPT-4o-Audio | 23 |
| Qwen3-Omni-32B | **100** |
| Step-Audio 2 Mini | 87 |
| MiDashengLM | 0 |
| Kimi-Audio | 22 |
| Qwen2.5-Omni-7B | *96* |
| Hello-Chat | **100** |

### Evaluation of Text to Speech
**Seed-TTS-Eval —** We conduct evaluations on the Chinese subset of the Seed-TTS-Eval benchmark, following the official Seed-TTS-Eval protocol.<br>
**Conversational-style Mean Opinion Score (CMOS) —** We invited native speakers to participate in a blind test. Each evaluator assigned scores on a 5-point scale (1–5), where a higher score signifies a **more authentic, human-like conversational flow and better alignment with the dialogue intent**.

| Model | CMOS ↑ | CER (%) ↓ | SS ↑ |
|---|---|---|---|
| F5-TTS | 3.48 | 1.56 | 0.741 |
| CosyVoice | 2 | 3.66 | 1.45 | 0.748 |
| CosyVoice 3-0.5B | 3.59 | 1.16 | **0.780** |
| Qwen2.5-Omni-7B | - | 1.70 | 0.752 |
| Qwen3-TTS-12Hz-0.6B-Base | 4.12 | **0.92** | 0.763 |
| FireRedTTS-2 | 3.68 | 1.14 | 0.736 |
| IndexTTS2 | *4.16* | *1.008* | *0.764* |
| Hello-Chat | **4.19** | 1.023 | 0.748 |

## 🎧 Demos

### Single Sentence Demo（zero-shot）


#### Speaker1
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/female1.mp3"></audio>

**generated:**
##### “那肯定因为自个儿平时想吃点卤味儿。那肯定得得得来一点儿。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent1.mp3"></audio>


##### “过年应该应该跟家里人一起吃饭。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent2.mp3"></audio>


##### “哎呀，不是了，现在法治社会哪有卖假货的，只是卖的价格贵。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female1_sent3.mp3"></audio>

---

#### Speaker2
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/female2.mp3"></audio>

**generated:**
##### “但是这个时候上哪去找呢？找不到。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent4.mp3"></audio>


##### “这种做法我感觉不适合，不是他那个年龄段该做出来的事情。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent5.mp3"></audio>


##### “咱们得趁这个时机啊，看看还要剩多多久啊。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/female2_sent6.mp3"></audio>

---

#### Speaker3
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/male1.mp3"></audio>

**generated:**
##### “我我不不怎么玩游戏，你你会玩游戏啊。

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent7.mp3"></audio>


##### “对呀，就是不管你愿不愿意，时间都是一直往前推嘛。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent8.mp3"></audio>


##### “挺好，我看着我看你做菜做饭蛮有生活的那是鸡蛋糕吗？”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male1_sent9.mp3"></audio>

---

#### Speaker4
**reference:**
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/ref/male2.mp3"></audio>

**generated:**
##### “我也有二十多岁的时候，那个时候什么都不想，嗯，等那一点点沉淀，年龄大一点了，然后就什么都在乎，什么都想。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent10.mp3"></audio>


##### “我看我一会儿，我我煮个泡面得了。”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent11.mp3"></audio>


##### “他们说那个茶茶饼就是渣子压出来的，是吗？”

<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/synth/male2_sent12.mp3"></audio>

---

### Multi-Trun Conversation Demo（zero-shot）

#### Conversation #1
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue1.mp3"></audio>

---

#### Conversation #2
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue2.mp3"></audio>

---

#### Conversation #3
<audio controls src="https://huggingface.co/hellogroup-opensource/Hello-Chat/resolve/main/assets/dialogues/demo_dialogue3.mp3"></audio>


## 📜 Citation

If you find our work useful in your research, please consider citing:

```bibtex
@article{hellogroup2026hellochat,
  title={Hello-Chat: Towards Realistic Social Audio Interactions},
  author={Computational Intelligence Dept, HelloGroup Inc.},
  year={2026}
}
```