Human-1 / README.md

Update README.md

73af99e verified 24 days ago

9.13 kB

	---
	license: cc-by-4.0
	language:
	- hi
	tags:
	- moshi
	- speech-to-speech
	- hindi
	- conversational-ai
	- audio
	- full-duplex
	- duplex-dialogue
	- indian-languages
	base_model: kyutai/moshiko-pytorch-bf16
	pipeline_tag: audio-to-audio
	---

	# Human-1: A Full-Duplex Conversational Model for Hindi
	🎙️ [Try the live demo →](https://ai.joshtalks.com/research/human-1) \| 📄 [Paper →](https://arxiv.org/pdf/2604.23295v1)

	Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking — trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.

	<p align="center">
	<img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
	</p>

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Developed by \| Bhaskar Singh, Shobhit Banga, Pranav Sharma — [JoshTalks](https://joshtalks.com) \|
	\| Base model \| [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) \|
	\| Language \| Hindi (hi) \|
	\| Model type \| Full-duplex speech-to-speech dialogue \|
	\| Format \| SafeTensors (fp32) \|
	\| Tokenizer \| Custom Hindi SentencePiece (32,000 vocabulary) \|
	\| Audio codec \| Mimi (frozen, 12.5 Hz, 1.1 kbps) \|
	\| License \| CC-BY-4.0 \|

	## What was changed from base Moshi

	The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:

	- `text_emb` — text token embedding in the Temporal Transformer
	- `depformer.emb.0` — text token embedding in the Depth Transformer
	- `text_linear` — text output projection layer

	All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).

	For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).

	## Training

	### Data

	The model was trained on a purpose-built corpus of 26,000 hours of real Hindi spontaneous conversations — to our knowledge, the largest conversational speech corpus for any Indian language.

	\| Characteristic \| Value \|
	\|---\|---\|
	\| Total duration \| 26,000 hours \|
	\| Unique speakers \| 14,695 \|
	\| Recording type \| Spontaneous, unscripted conversations \|
	\| Channels \| Stereo (separate per speaker) \|
	\| Quality control \| Trained annotators + manual checks \|

	The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions — without requiring artificial speaker diarisation.

	### Two-stage training recipe

	Stage 1 — Pre-training on the full 26,000-hour corpus. Learning rate of 3×10⁻⁵ (matching original Moshi pre-training). AdamW with β₁=0.9, β₂=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8× NVIDIA H100 80GB GPUs.

	Stage 2 — Fine-tuning on ~990 hours of curated high-quality conversational data. Split learning rates: 2×10⁻⁶ for the Temporal Transformer, 4×10⁻⁶ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).

	### Training infrastructure

	8× NVIDIA H100 80GB GPUs with bf16 mixed precision.

	## Evaluation

	### Perplexity

	Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.

	\| Temperature \| PPL ↓ \|
	\|---\|---\|
	\| Ground-truth \| 237.1 \|
	\| Human-1 (τ=0.8) \| 356.9 \|
	\| Human-1 (τ=0.9) \| 467.1 \|
	\| Human-1 (τ=1.0) \| 640.6 \|

	### Human Evaluation

	130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.

	Perceptual quality:

	\| Metric \| Human Score \| Model Score \| Human Preferred \| Model Preferred \| Tie \|
	\|---\|---\|---\|---\|---\|---\|
	\| Naturalness \| 4.55 \| 4.10 \| 30.0% \| 3.1% \| 66.9% \|
	\| Clarity \| 4.05 \| 3.04 \| — \| — \| — \|

	Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.

	Conversational rubric evaluation:

	Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.

	\| Rubric \| Pass Rate \|
	\|---\|---\|
	\| Human-like interaction \| ≈85% \|
	\| Appropriateness (response follows prompt) \| ≈53% \|
	\| Completion (response forms a complete reply) \| ≈42% \|

	While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.

	### Turn-Taking Analysis

	Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.

	\| Model \| τ \| IPU/min \| Pause \| Gap \| Overlap \|
	\|---\|---\|---\|---\|---\|---\|
	\| Ground-truth \| — \| 35.30 \| 10.49 \| 8.51 \| 3.03 \|
	\| Human-1 \| 0.8 \| 23.12 \| 9.16 \| 6.77 \| 1.67 \|
	\| Human-1 \| 0.9 \| 29.14 \| 9.24 \| 8.54 \| 4.30 \|
	\| Human-1 \| 1.0 \| 38.90 \| 11.67 \| 8.10 \| 9.68 \|

	## Conversation Style

	Human-1 is trained on topic-driven conversations - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.

	After an initial introduction, the model will typically propose a topic and steer the conversation toward it, preferring structured discussion over open-ended chitchat. Users can also introduce their own topic - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.

	This makes the model particularly well-suited for domain-specific conversational applications. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.

	## Files

	```
	├── model.safetensors # Human-1 LM weights
	├── tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
	├── tokenizer_hindi.model # Hindi SentencePiece tokenizer
	├── tokenizer_hindi.vocab # Vocabulary reference
	├── hindi_moshi_architecture.svg # Architecture diagram
	└── README.md
	```

	## Quick Start

	### 1. Install uv

	```bash
	curl -LsSf https://astral.sh/uv/install.sh \| sh
	source $HOME/.local/bin/env
	```

	### 2. Create project and install dependencies

	```bash
	uv init human-1 && cd human-1
	uv python install 3.12
	uv python pin 3.12
	uv add moshi huggingface_hub
	```

	### 3. Download the model

	```bash
	uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights
	```

	### 4. Run the server

	```bash
	uv run -m moshi.server \
	--moshi-weight ./weights/model.safetensors \
	--mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
	--tokenizer ./weights/tokenizer_hindi.model
	```

	## Intended Use

	The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.

	## Limitations

	- Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
	- Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
	- Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
	- Not intended for impersonation or any malicious use.
	- This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.

	## Citation

	```bibtex
	@article{singh2026human1,
	title = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
	author = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
	year = {2026},
	institution = {JoshTalks}
	}
	```

	## Acknowledgments

	Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus.