Audio-to-Audio
Moshi
Safetensors
Hindi
speech-to-speech
hindi
conversational-ai
audio
full-duplex
duplex-dialogue
indian-languages
Instructions to use JoshTalksAI/Human-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use JoshTalksAI/Human-1 with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "JoshTalksAI/Human-1" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("JoshTalksAI/Human-1") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| language: | |
| - hi | |
| tags: | |
| - moshi | |
| - speech-to-speech | |
| - hindi | |
| - conversational-ai | |
| - audio | |
| - full-duplex | |
| - duplex-dialogue | |
| - indian-languages | |
| base_model: kyutai/moshiko-pytorch-bf16 | |
| pipeline_tag: audio-to-audio | |
| # Human-1: A Full-Duplex Conversational Model for Hindi | |
| **ποΈ [Try the live demo β](https://ai.joshtalks.com/research/human-1)** | **π [Paper β](https://arxiv.org/pdf/2604.23295v1)** | |
| Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers. | |
| <p align="center"> | |
| <img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/> | |
| </p> | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Developed by** | Bhaskar Singh, Shobhit Banga, Pranav Sharma β [JoshTalks](https://joshtalks.com) | | |
| | **Base model** | [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) | | |
| | **Language** | Hindi (hi) | | |
| | **Model type** | Full-duplex speech-to-speech dialogue | | |
| | **Format** | SafeTensors (fp32) | | |
| | **Tokenizer** | Custom Hindi SentencePiece (32,000 vocabulary) | | |
| | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) | | |
| | **License** | CC-BY-4.0 | | |
| ## What was changed from base Moshi | |
| The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups: | |
| - `text_emb` β text token embedding in the Temporal Transformer | |
| - `depformer.emb.0` β text token embedding in the Depth Transformer | |
| - `text_linear` β text output projection layer | |
| All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55). | |
| For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037). | |
| ## Training | |
| ### Data | |
| The model was trained on a purpose-built corpus of **26,000 hours** of real Hindi spontaneous conversations β to our knowledge, the largest conversational speech corpus for any Indian language. | |
| | Characteristic | Value | | |
| |---|---| | |
| | Total duration | 26,000 hours | | |
| | Unique speakers | 14,695 | | |
| | Recording type | Spontaneous, unscripted conversations | | |
| | Channels | Stereo (separate per speaker) | | |
| | Quality control | Trained annotators + manual checks | | |
| The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions β without requiring artificial speaker diarisation. | |
| ### Two-stage training recipe | |
| **Stage 1 β Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ10β»β΅ (matching original Moshi pre-training). AdamW with Ξ²β=0.9, Ξ²β=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8Γ NVIDIA H100 80GB GPUs. | |
| **Stage 2 β Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ10β»βΆ for the Temporal Transformer, 4Γ10β»βΆ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370). | |
| ### Training infrastructure | |
| 8Γ NVIDIA H100 80GB GPUs with bf16 mixed precision. | |
| ## Evaluation | |
| ### Perplexity | |
| Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech. | |
| | Temperature | PPL β | | |
| |---|---| | |
| | Ground-truth | 237.1 | | |
| | Human-1 (Ο=0.8) | 356.9 | | |
| | Human-1 (Ο=0.9) | 467.1 | | |
| | Human-1 (Ο=1.0) | 640.6 | | |
| ### Human Evaluation | |
| 130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity. | |
| **Perceptual quality:** | |
| | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie | | |
| |---|---|---|---|---|---| | |
| | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% | | |
| | Clarity | 4.05 | 3.04 | β | β | β | | |
| Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties. | |
| **Conversational rubric evaluation:** | |
| Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech. | |
| | Rubric | Pass Rate | | |
| |---|---| | |
| | Human-like interaction | β85% | | |
| | Appropriateness (response follows prompt) | β53% | | |
| | Completion (response forms a complete reply) | β42% | | |
| While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge. | |
| ### Turn-Taking Analysis | |
| Temperature Ο=0.9 produces turn-taking dynamics closest to ground-truth. | |
| | Model | Ο | IPU/min | Pause | Gap | Overlap | | |
| |---|---|---|---|---|---| | |
| | Ground-truth | β | 35.30 | 10.49 | 8.51 | 3.03 | | |
| | Human-1 | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 | | |
| | Human-1 | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 | | |
| | Human-1 | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 | | |
| ## Conversation Style | |
| Human-1 is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking. | |
| After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics. | |
| This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work. | |
| ## Files | |
| ``` | |
| βββ model.safetensors # Human-1 LM weights | |
| βββ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi) | |
| βββ tokenizer_hindi.model # Hindi SentencePiece tokenizer | |
| βββ tokenizer_hindi.vocab # Vocabulary reference | |
| βββ hindi_moshi_architecture.svg # Architecture diagram | |
| βββ README.md | |
| ``` | |
| ## Quick Start | |
| ### 1. Install uv | |
| ```bash | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| source $HOME/.local/bin/env | |
| ``` | |
| ### 2. Create project and install dependencies | |
| ```bash | |
| uv init human-1 && cd human-1 | |
| uv python install 3.12 | |
| uv python pin 3.12 | |
| uv add moshi huggingface_hub | |
| ``` | |
| ### 3. Download the model | |
| ```bash | |
| uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights | |
| ``` | |
| ### 4. Run the server | |
| ```bash | |
| uv run -m moshi.server \ | |
| --moshi-weight ./weights/model.safetensors \ | |
| --mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \ | |
| --tokenizer ./weights/tokenizer_hindi.model | |
| ``` | |
| ## Intended Use | |
| The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations. | |
| ## Limitations | |
| - Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed. | |
| - Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate. | |
| - Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token. | |
| - Not intended for impersonation or any malicious use. | |
| - This model is for research purposes. We do not recommend it for providing advice or performing any professional duty. | |
| ## Citation | |
| ```bibtex | |
| @article{singh2026human1, | |
| title = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations}, | |
| author = {Bhaskar Singh and Shobhit Banga and Pranav Sharma}, | |
| year = {2026}, | |
| institution = {JoshTalks} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus. |