HEEP Indic

High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR

HEEP Indic is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With an average word error rate (WER) of 11.9% on Hindi benchmarks — outperforming Google STT, Azure STT, Nvidia Conformer, and IndicWhisper — it challenges the "more data is better" paradigm by training on carefully selected high-information samples.

Model Overview

HEEP Indic supports transcription across 55 Indic languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.

Core Insight: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.

HEEP Methodology

HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.

Mathematical Foundation

Sample Score (Equation 1)

The information score for each sample combines multiple entropy dimensions:

S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)

Where:

H_acoustic(x): Spectral/MFCC entropy measuring acoustic diversity
H_phonetic(x): Phoneme distribution entropy capturing phonetic complexity
H_linguistic(x): Vocabulary and syntax entropy measuring linguistic richness
H_contextual(x): Domain and discourse entropy
MI(x, D): Mutual information contribution relative to dataset
α₁...α₄, β: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)

Mutual Information (Equation 2)

The mutual information between acoustic features and transcription:

I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]

Selection Criterion

Samples are selected based on a threshold:

D' = {x ∈ D : S(x) > τ}

Progressive Filtering (Equation 8)

The threshold increases exponentially across rounds:

τ_{k+1} = τ_k · growth_factor

Error-Aware Adaptation

After each training round, sample scores are adjusted based on model errors:

S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)

Algorithm Overview

Algorithm: HEEP Data Curation with Error-Aware Adaptation

Input: Dataset D, initial threshold τ₀, growth factor g
Output: Curated dataset D*

1. Initialize scorer with entropy estimators
2. Fit scorer to D (compute normalization stats, fit MI estimator)
3. D* ← D
4. k ← 0
5. While |D*| > min_samples AND k < max_rounds:
    a. For each x in D*:
        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
    b. If error_patterns available:
        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
    c. D* ← {x ∈ D* : S'(x) > τₖ}
    d. If train_callback: Train model on D*
    e. If eval_callback: Analyze errors, update error_patterns
    f. τₖ₊₁ ← τₖ · g
    g. k ← k + 1
6. Return D*

Key Benefits

Training on 10-20% of data while matching or exceeding full-dataset performance
Efficient multilingual model development with cross-lingual transfer
Error-aware adaptive sample selection across training rounds
Significant reduction in computational resources and training time

Cross-Architecture Validation with HEEP-Indic

Resources

Reproducibility (Universal Model): https://huggingface.co/bc7ec356/heep-universal
Cross-Architecture Model (Indic): https://huggingface.co/bc7ec356/heep-indic

Cross-Architecture Generalization

To directly address concerns about generalization beyond Whisper V3 Turbo, we trained Qwen3-ASR (1.7B), an architecturally distinct audio-language model, on HEEP-curated data spanning 46 Indian languages (~4.78M utterances). The curation pipeline is identical to the one described in the paper with no architecture-specific tuning.

Hindi Benchmark Comparison (7 Benchmarks)

Word error rates (%) on Indic benchmark datasets:

Dataset	Bengali	Bhojpuri	Chhattisgarhi	Gujarati	Hindi	Kannada	Magahi	Maithili	Malayalam	Marathi	Odia	Punjabi	Sanskrit	Tamil	Telugu	Urdu	Avg
Kathbath	14.6	–	–	17.4	8.5	23	–	–	39.3	19.2	25.4	15.8	41.4	30.3	29	12.1	23
Kathbath Hard	15.7	–	–	18.5	9	25.1	–	–	41.2	20.4	27.7	16.6	43.6	32.6	30.3	11.9	24.4
CommonVoice	21	–	–	–	9.96	–	–	–	46	21.5	34.6	17.5	–	34	–	20.6	25.7
FLEURS	22.4	–	–	23.3	11	23.1	–	–	34.4	25.5	33.3	25	–	35.1	31.9	22.4	26.1
IndicTTS	15.8	–	–	16.9	6.6	19.6	–	–	26.4	14.5	14.8	–	–	22.6	31.3	–	18.7
Gramvaani	–	–	–	–	26	–	–	–	–	–	–	–	–	–	–	–	26
RESPIN	32.5	21.3	21.6	–	12.1	45.6	27.7	41.1	–	32.7	–	–	–	–	37.5	–	30.2
Average	20.4	21.3	21.6	19	11.9	27.3	27.7	41.1	37.5	22.3	27.2	18.7	42.5	30.9	32	16.7	24.6

Hindi Benchmark Comparison

Comparison of publicly-available models on the Hindi subset of the benchmark:

Model	Kathbath	Kathbath Noisy	CommonVoice	FLEURS	IndicTTS	RESPIN	Gramvaani	Average
Google STT	14.3	16.7	20.8	19.4	18.3	–	59.9	24.9
IndicWav2Vec	12.2	16.2	20.2	18.3	15	–	42.1	20.7
Azure STT	13.6	15.1	14.6	24.3	15.2	–	42.3	20.8
Nvidia Conformer-CTC Medium	14	15.6	20.4	19.4	12.3	–	41.3	20.5
Nvidia Conformer-CTC Large	12.7	14.2	21.2	15.7	12.2	–	42.6	19.8
IndicWhisper	10.3	12	15	11.4	7.6	–	26.8	13.8
HEEP Indic	8.53	8.97	9.96	11.04	6.59	12.05	25.98	11.9

HEEP-Indic achieves 11.9% average Hindi WER vs. 13.8% for IndicWhisper (14% relative improvement).

Key Takeaways

Cross-architecture generalization confirmed. The same HEEP pipeline improves two distinct backbones: Whisper V3 Turbo (0.8B, encoder-decoder) and Qwen3-ASR (1.7B, audio-language model), without modification.
Controlled multilingual evaluation. Results span 16 languages across Indo-Aryan, Dravidian, and Classical families on standardized benchmarks with consistent evaluation protocols.
Model-independent scoring. Entropy scoring operates on MFCCs, G2P phonemes, and token distributions, not model internals. The same curated dataset was used for both backbones.
Reproducibility. Model weights, curation code, and training scripts for both backbones are at the anonymous repository.

Model Details

Architecture: Qwen3ASR — Transformer-based encoder-decoder optimized for multilingual transcription
Languages: 55 Indic languages supported
Format: Transformers compatible (safetensors)
Sampling Rate: 16 kHz
Precision: FP16/FP32 supported
Optimization: Real-time inference capable with GPU acceleration

Key Features

Real-Time Performance: Average RTFx of 300 enables real-time applications
Verbatim Transcription: Optimized for accurate, word-for-word transcription
Multi-Domain Excellence: Superior performance across conversational, broadcast, and read speech
Multilingual Support: 55 Indic languages with cross-lingual transfer learning
HEEP-Curated Training: Strategic entropy-based data selection for maximum information density

Quick Start

Install

pip install qwen-asr[vllm]

Inference with vLLM (Recommended)

from qwen_asr import Qwen3ASRModel

# Load model with vLLM backend
asr = Qwen3ASRModel.LLM(
    model="bc7ec356/heep-indic",
    gpu_memory_utilization=0.8,
    max_new_tokens=4096,
)

# Transcribe from file path
results = asr.transcribe(
    audio="path/to/audio.wav",
    language="Hindi",
)
print(results[0].text)
print(results[0].language)

Inference with Transformers

import torch
from qwen_asr import Qwen3ASRModel

# Load model with Transformers backend
asr = Qwen3ASRModel.from_pretrained(
    "bc7ec356/heep-indic",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

# Transcribe
results = asr.transcribe(
    audio="path/to/audio.wav",
    language="Hindi",
)
print(results[0].text)

Batch Transcription

# Transcribe multiple files at once
results = asr.transcribe(
    audio=["audio1.wav", "audio2.wav", "audio3.wav"],
    language=["Hindi", "Tamil", "Bengali"],
)
for r in results:
    print(f"[{r.language}] {r.text}")

Auto Language Detection

# Pass language=None to auto-detect
results = asr.transcribe(
    audio="path/to/audio.wav",
    language=None,
)
print(f"Detected: {results[0].language}")
print(f"Text: {results[0].text}")

Streaming Transcription (vLLM only)

import numpy as np
import soundfile as sf

from qwen_asr import Qwen3ASRModel

asr = Qwen3ASRModel.LLM(
    model="bc7ec356/heep-indic",
    gpu_memory_utilization=0.8,
    max_new_tokens=4096,
)

# Load audio
wav, sr = sf.read("path/to/audio.wav", dtype="float32")

# Initialize streaming state
state = asr.init_streaming_state(
    language="Hindi",
    chunk_size_sec=2.0,
    unfixed_chunk_num=2,
    unfixed_token_num=5,
)

# Feed audio in 1-second chunks
step = sr  # 1 second of samples
for pos in range(0, len(wav), step):
    chunk = wav[pos : pos + step]
    asr.streaming_transcribe(chunk, state)
    print(f"Partial: {state.text}")

# Finalize
asr.finish_streaming_transcribe(state)
print(f"Final: {state.text}")

NumPy Array Input

import numpy as np

# From a numpy array + sample rate
audio_array = np.random.randn(16000).astype(np.float32)  # 1 second at 16kHz
results = asr.transcribe(
    audio=(audio_array, 16000),
    language="English",
)

Performance Optimization Tips

GPU Acceleration: Use device="cuda" for significantly faster inference
Precision: Set torch_dtype=torch.float16 for optimal speed on modern GPUs
Language Specification: Specify language code when known to improve accuracy and speed

Acknowledgments

HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.

Citation

If you use this model in your research, please cite:

@article{anonymous2026heep,
  title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
  author={Anonymous},
  journal={Under Review},
  year={2026}
}

Downloads last month: 41

Safetensors

Model size

2B params

Tensor type

BF16