Quranic Wav2Vec2 Phonetic ASR

Model Description

Quranic Wav2Vec2 Phonetic ASR is a fine-tuned wav2vec2 model designed specifically for phonetic transcription of Quranic recitation.

Unlike standard Arabic ASR systems that output orthographic Arabic text, this model outputs a phonetic (sound-level) representation, making it suitable for:

  • Tajweed research and education
  • Word-level pronunciation analysis
  • Forced alignment of Quranic recitation
  • Linguistic studies of Classical Arabic phonetics

To the best of our knowledge, this is the first publicly released wav2vec2 model trained explicitly for Quranic phonetic transcription.


Key Features

  • 🎙️ Outputs phonetic strings, not Arabic text
  • 📖 Trained on Quranic recitation, not conversational Arabic
  • 🔤 Custom phonetic vocabulary + tokenizer
  • 🧠 Compatible with CTC forced alignment
  • 🧩 Optimized for word-level Tajweed analysis

Intended Use

This model is intended for:

  • Quranic pronunciation analysis
  • Tajweed educational tools
  • Phoneme-level alignment
  • Academic research on Quranic recitation

⚠️ Not intended for:

  • Modern Arabic ASR
  • End-to-end Tajweed grading of full ayat
  • Automatic religious judgments
  • Replacement of qualified Quran teachers

Model Architecture

  • Base model: facebook/wav2vec2-base
  • Training objective: CTC (Connectionist Temporal Classification)
  • Sampling rate: 16 kHz
  • Input: Mono audio waveform
  • Output: Phonetic transcription string
  • Tokenizer: Custom character-level phonetic tokenizer

Training Dataset

📦 Dataset Used

Buraaq/quran-md-words

This dataset contains word-level Quranic recitation audio with aligned metadata.

Each sample includes:

  • Audio of a single Quranic word
  • Phonetic transcription (word_tr)
  • Arabic word (word_ar)
  • Surah and ayah identifiers
  • Word index within the ayah

This dataset was chosen intentionally to:

  • Preserve clear phonetic boundaries
  • Avoid coarticulation noise from full-ayah audio
  • Enable high-accuracy phonetic learning

🔗 Dataset: https://huggingface.co/datasets/Buraaq/quran-md-words


Phonetic Vocabulary & Tokenizer

The model uses a custom phonetic vocabulary built directly from the dataset:

  • Vocabulary constructed from all unique characters in word_tr
  • Character-level CTC tokenizer
  • Includes:
    • | as word delimiter
    • [PAD] for CTC blank
    • [UNK] for unknown symbols

This design allows:

  • Robust forced alignment
  • Fine-grained phonetic decoding
  • Word boundary detection via delimiter token

Training Procedure

Training Setup

  • Framework: Hugging Face Transformers + Datasets
  • Platform: Windows-safe (no multiprocessing crashes)
  • Audio handling: Hugging Face Audio feature
  • Batch size: 16
  • Epochs: 20
  • Learning rate: 2e-5
  • Optimizer: AdamW (default Trainer)
  • Precision: FP16 when CUDA available
  • Train / Eval split: 90% / 10%

The wav2vec2 feature extractor was not frozen, allowing adaptation to Quranic recitation acoustics.


Training Results

On the held-out evaluation split:

  • Training accuracy: 99.7%
  • Evaluation accuracy: 99.8%
  • Test split size: 10% of the dataset

⚠️ Note:
These results reflect word-level phonetic transcription accuracy on a dataset with consistent recitation style.
Performance may degrade on:

  • Fast recitation
  • Strong coarticulation
  • Unseen riwayat styles

Example Usage

Load Model and Processor

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import soundfile as sf
import librosa
import numpy as np

processor = Wav2Vec2Processor.from_pretrained(
    "USERNAME/quranic-wav2vec2-phonetic"
)
model = Wav2Vec2ForCTC.from_pretrained(
    "USERNAME/quranic-wav2vec2-phonetic"
)

model.eval()

Phonetic Transcription Example

audio, sr = sf.read("recitation.wav")
# convert to mono
if audio.ndim > 1:
    audio = audio.mean(axis=1)

# resample to 16kHz
if sr != 16000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

inputs = processor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
    padding=True,
)

with torch.inference_mode():
    logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)

phonetics = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print(phonetics)

Example output:

tālik

Downloads last month
28
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TBOGamer22/wav2vec2-quran-phonetics

Finetuned
(906)
this model

Dataset used to train TBOGamer22/wav2vec2-quran-phonetics