DeepAr / README.md

Update README.md

994e0fc verified 11 months ago

7.82 kB

	---
	library_name: transformers
	tags: []
	---
	# DeepAr

	## Model Description

	DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset for optimal performance.

	Key Features:
	- High-fidelity transcription: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns
	- Speech improvement tool: Designed to help users identify and correct speech patterns
	- Superior performance: Outperforms many existing Arabic ASR models based on Whisper and its variants
	- Arabic with Tashkil: Provides accurate diacritization for comprehensive Arabic text output

	## What Makes DeepAr Different

	Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes exactly what is pronounced. This unique approach makes it particularly valuable for:

	- Speech therapy and improvement: Identifies pronunciation patterns and deviations
	- Language learning: Helps learners understand their actual pronunciation vs. intended speech
	- Linguistic research: Captures authentic speech patterns for analysis
	- Pronunciation assessment: Provides detailed feedback on spoken Arabic

	## Model Details

	- Base Architecture: whisper-turbo-v3
	- Language: Arabic (with Tashkil/diacritics)
	- Task: High-fidelity Automatic Speech Recognition
	- Training Data: Complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset
	- Model Type: Production-ready, latest version

	## Performance

	DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in:
	- Pronunciation accuracy detection
	- Diacritic prediction
	- Handling of Arabic speech variations
	- Authentic speech pattern recognition

	## Intended Use

	This model is ideal for:
	- Speech therapy and pronunciation correction applications
	- Arabic language learning platforms
	- Linguistic research and analysis
	- Educational tools for speech improvement
	- Applications requiring authentic speech transcription
	- Quality assessment of spoken Arabic

	## Usage

	### Installation

	```bash
	pip install transformers torch torchaudio
	```

	### Quick Start

	```python
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import torch
	import torchaudio

	# Load model and processor
	processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr")
	model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr")

	# Load and preprocess audio
	audio_path = "path_to_your_arabic_audio.wav"
	waveform, sample_rate = torchaudio.load(audio_path)

	# Resample to 16kHz if necessary
	if sample_rate != 16000:
	resampler = torchaudio.transforms.Resample(sample_rate, 16000)
	waveform = resampler(waveform)

	# Process audio
	input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features

	# Generate transcription
	with torch.no_grad():
	predicted_ids = model.generate(input_features, language="ar")

	# Decode transcription (exactly as pronounced)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	print(f"Pronounced as: {transcription}")
	```

	### Speech Analysis Example

	```python
	def analyze_pronunciation(audio_path, target_text=None):
	"""
	Analyze pronunciation and compare with target text if provided
	"""
	waveform, sample_rate = torchaudio.load(audio_path)

	if sample_rate != 16000:
	resampler = torchaudio.transforms.Resample(sample_rate, 16000)
	waveform = resampler(waveform)

	input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features

	with torch.no_grad():
	predicted_ids = model.generate(input_features, language="ar")

	actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

	print(f"Actual pronunciation: {actual_pronunciation}")

	if target_text:
	print(f"Target text: {target_text}")
	print("Analysis: Compare the differences for speech improvement")

	return actual_pronunciation

	# Example usage
	pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته")
	```

	### Batch Processing for Speech Assessment

	```python
	def assess_multiple_recordings(audio_files, target_texts=None):
	"""
	Process multiple recordings for comprehensive speech assessment
	"""
	results = []

	for i, audio_file in enumerate(audio_files):
	waveform, sample_rate = torchaudio.load(audio_file)

	if sample_rate != 16000:
	resampler = torchaudio.transforms.Resample(sample_rate, 16000)
	waveform = resampler(waveform)

	input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features

	with torch.no_grad():
	predicted_ids = model.generate(input_features, language="ar")

	pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

	result = {
	'file': audio_file,
	'pronunciation': pronunciation,
	'target': target_texts[i] if target_texts else None
	}
	results.append(result)

	print(f"File {i+1}: {pronunciation}")

	return results

	# Example usage
	audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"]
	target_texts = ["النص الأول", "النص الثاني", "النص الثالث"]
	assessment_results = assess_multiple_recordings(audio_files, target_texts)
	```


	## Training Data

	This model was trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics.

	## Model Advantages

	- Authentic transcription: Captures exactly what is spoken, not what should be spoken
	- High accuracy: Superior performance compared to similar Whisper-based Arabic models
	- Comprehensive training: Utilizes the complete dataset for optimal coverage
	- Practical applications: Specifically designed for speech improvement and assessment
	- Diacritic accuracy: Excellent performance in Arabic diacritization


	## Limitations

	- MSA focus: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations

	## License

	This model is released under the MIT License.

	```
	MIT License

	Copyright (c) 2024 CUAIStudents

	Permission is hereby granted, free of charge, to any person obtaining a copy
	of this software and associated documentation files (the "Software"), to deal
	in the Software without restriction, including without limitation the rights
	to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	copies of the Software, and to permit persons to whom the Software is
	furnished to do so, subject to the following conditions:

	The above copyright notice and this permission notice shall be included in all
	copies or substantial portions of the Software.

	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
	IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
	FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
	AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
	LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
	OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
	SOFTWARE.
	```