language:
- en
tags:
- audio
- automatic-speech-recognition
- gqa
- rope
- pytorch
- safetensors
pipeline_tag: automatic-speech-recognition
license: other
license_name: gradient-ai-license-v1.0
license_link: https://huggingface.co/gradient-research/license
gated: auto
extra_gated_heading: License Agreement Required
extra_gated_prompt: >-
By registering for access to this model, you agree to the strict terms and
conditions of the Gradient-AI License. This model is strictly prohibited from
being used for deception, weaponization, or illegal acts.
extra_gated_button_content: Acknowledge License and Request Access
extra_gated_fields:
I have read and agree to be bound by the Gradient-AI License: checkbox
Name / Organization: text
Intended Use Case:
type: select
options:
- Research
- Education
- label: Commercial (Requires Permission)
value: commercial
- label: Other
value: other
library_name: transformers
Gradient-Transcribe1 (125M)
Gradient-Transcribe1 is a high-efficiency transformer-based model for automatic speech recognition (ASR). It incorporates modern architectural advancements such as Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE) to deliver superior inference performance and long-context stability.
Access to this model is gated. Users must agree to the Gradient-AI License and provide their intended use case before downloading the weights.
Model Details
Gradient-Transcribe1 is a sequence-to-sequence encoder-decoder model optimized for 16kHz audio. Key architectural features include:
- Grouped Query Attention (GQA): Optimized for faster decoding and reduced KV cache memory footprint.
- Rotary Positional Embeddings (RoPE): Enhanced relative position encoding for better sequence length generalization.
- Modern Activation & Norm: Utilizing RMSNorm and SwiGLU for improved training stability.
Specifications
| Component | Configuration |
|---|---|
| Parameters | 138,044,928 |
| Hidden Size | 768 |
| Encoder Layers | 8 |
| Decoder Layers | 10 |
| Attention Heads | 8 (Q), 4 (KV) |
| Vocab Size | 1024 |
| Mel Bins | 80 |
Usage
Due to the custom nature of this architecture, you must set trust_remote_code=True when loading the model.
Loading the Model
from transformers import AutoModel, AutoTokenizer
# Load the model (requires approved access)
model = AutoModel.from_pretrained(
"your-username/gradient-transcribe1-125m",
trust_remote_code=True,
use_auth_token=True
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/gradient-transcribe1-125m")
Transcription Example
Python
import torch
import librosa
# Load 16kHz audio
audio, _ = librosa.load("sample_audio.wav", sr=16000)
# Note: Pre-processing to Mel-spectrogram must match the model's 80-bin configuration.
# transcription = model.generate(input_features)
Training Data Gradient-Transcribe1 was trained on a combination of curated speech datasets and synthetic data to validate the performance of GQA in ASR tasks. It is currently optimized for English speech.
Limitations and Biases Intended Use: This model is designed for research and educational purposes. Usage for deceptive, weaponized, or illegal acts is strictly prohibited.
Hallucinations: As a sequence-to-sequence model, it may generate text that does not exist in the audio, particularly in high-noise environments.
Domain Specificity: Performance may vary across different accents, dialects, and technical terminologies.
License This model is licensed under the Gradient-AI License v1.0. By requesting access, you agree to abide by the terms specified at gradient-research/license.