Instructions to use CUAIStudents/DeepAr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CUAIStudents/DeepAr with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="CUAIStudents/DeepAr")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("CUAIStudents/DeepAr") model = AutoModelForSpeechSeq2Seq.from_pretrained("CUAIStudents/DeepAr") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: [] | |
| # DeepAr | |
| ## Model Description | |
| DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset for optimal performance. | |
| **Key Features:** | |
| - **High-fidelity transcription**: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns | |
| - **Speech improvement tool**: Designed to help users identify and correct speech patterns | |
| - **Superior performance**: Outperforms many existing Arabic ASR models based on Whisper and its variants | |
| - **Arabic with Tashkil**: Provides accurate diacritization for comprehensive Arabic text output | |
| ## What Makes DeepAr Different | |
| Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes **exactly what is pronounced**. This unique approach makes it particularly valuable for: | |
| - **Speech therapy and improvement**: Identifies pronunciation patterns and deviations | |
| - **Language learning**: Helps learners understand their actual pronunciation vs. intended speech | |
| - **Linguistic research**: Captures authentic speech patterns for analysis | |
| - **Pronunciation assessment**: Provides detailed feedback on spoken Arabic | |
| ## Model Details | |
| - **Base Architecture**: whisper-turbo-v3 | |
| - **Language**: Arabic (with Tashkil/diacritics) | |
| - **Task**: High-fidelity Automatic Speech Recognition | |
| - **Training Data**: Complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset | |
| - **Model Type**: Production-ready, latest version | |
| ## Performance | |
| DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in: | |
| - Pronunciation accuracy detection | |
| - Diacritic prediction | |
| - Handling of Arabic speech variations | |
| - Authentic speech pattern recognition | |
| ## Intended Use | |
| This model is ideal for: | |
| - Speech therapy and pronunciation correction applications | |
| - Arabic language learning platforms | |
| - Linguistic research and analysis | |
| - Educational tools for speech improvement | |
| - Applications requiring authentic speech transcription | |
| - Quality assessment of spoken Arabic | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch torchaudio | |
| ``` | |
| ### Quick Start | |
| ```python | |
| from transformers import WhisperProcessor, WhisperForConditionalGeneration | |
| import torch | |
| import torchaudio | |
| # Load model and processor | |
| processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr") | |
| model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr") | |
| # Load and preprocess audio | |
| audio_path = "path_to_your_arabic_audio.wav" | |
| waveform, sample_rate = torchaudio.load(audio_path) | |
| # Resample to 16kHz if necessary | |
| if sample_rate != 16000: | |
| resampler = torchaudio.transforms.Resample(sample_rate, 16000) | |
| waveform = resampler(waveform) | |
| # Process audio | |
| input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features | |
| # Generate transcription | |
| with torch.no_grad(): | |
| predicted_ids = model.generate(input_features, language="ar") | |
| # Decode transcription (exactly as pronounced) | |
| transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] | |
| print(f"Pronounced as: {transcription}") | |
| ``` | |
| ### Speech Analysis Example | |
| ```python | |
| def analyze_pronunciation(audio_path, target_text=None): | |
| """ | |
| Analyze pronunciation and compare with target text if provided | |
| """ | |
| waveform, sample_rate = torchaudio.load(audio_path) | |
| if sample_rate != 16000: | |
| resampler = torchaudio.transforms.Resample(sample_rate, 16000) | |
| waveform = resampler(waveform) | |
| input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features | |
| with torch.no_grad(): | |
| predicted_ids = model.generate(input_features, language="ar") | |
| actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] | |
| print(f"Actual pronunciation: {actual_pronunciation}") | |
| if target_text: | |
| print(f"Target text: {target_text}") | |
| print("Analysis: Compare the differences for speech improvement") | |
| return actual_pronunciation | |
| # Example usage | |
| pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته") | |
| ``` | |
| ### Batch Processing for Speech Assessment | |
| ```python | |
| def assess_multiple_recordings(audio_files, target_texts=None): | |
| """ | |
| Process multiple recordings for comprehensive speech assessment | |
| """ | |
| results = [] | |
| for i, audio_file in enumerate(audio_files): | |
| waveform, sample_rate = torchaudio.load(audio_file) | |
| if sample_rate != 16000: | |
| resampler = torchaudio.transforms.Resample(sample_rate, 16000) | |
| waveform = resampler(waveform) | |
| input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features | |
| with torch.no_grad(): | |
| predicted_ids = model.generate(input_features, language="ar") | |
| pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] | |
| result = { | |
| 'file': audio_file, | |
| 'pronunciation': pronunciation, | |
| 'target': target_texts[i] if target_texts else None | |
| } | |
| results.append(result) | |
| print(f"File {i+1}: {pronunciation}") | |
| return results | |
| # Example usage | |
| audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"] | |
| target_texts = ["النص الأول", "النص الثاني", "النص الثالث"] | |
| assessment_results = assess_multiple_recordings(audio_files, target_texts) | |
| ``` | |
| ## Training Data | |
| This model was trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics. | |
| ## Model Advantages | |
| - **Authentic transcription**: Captures exactly what is spoken, not what should be spoken | |
| - **High accuracy**: Superior performance compared to similar Whisper-based Arabic models | |
| - **Comprehensive training**: Utilizes the complete dataset for optimal coverage | |
| - **Practical applications**: Specifically designed for speech improvement and assessment | |
| - **Diacritic accuracy**: Excellent performance in Arabic diacritization | |
| ## Limitations | |
| - **MSA focus**: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations | |
| ## License | |
| This model is released under the MIT License. | |
| ``` | |
| MIT License | |
| Copyright (c) 2024 CUAIStudents | |
| Permission is hereby granted, free of charge, to any person obtaining a copy | |
| of this software and associated documentation files (the "Software"), to deal | |
| in the Software without restriction, including without limitation the rights | |
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
| copies of the Software, and to permit persons to whom the Software is | |
| furnished to do so, subject to the following conditions: | |
| The above copyright notice and this permission notice shall be included in all | |
| copies or substantial portions of the Software. | |
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
| SOFTWARE. | |
| ``` | |