mazesmazes
/

tiny-audio

@@ -1,142 +1,267 @@
 ---
-library_name: transformers
 tags:
-- generated_from_trainer
-model-index:
-- name: tiny-audio-embedded
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# tiny-audio-embedded
-This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.1981
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0001
-- train_batch_size: 32
-- eval_batch_size: 32
-- seed: 43
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 64
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: constant_with_warmup
-- lr_scheduler_warmup_steps: 500
-- num_epochs: 1
-### Training results
-| Training Loss | Epoch  | Step  | Validation Loss |
-|:-------------:|:------:|:-----:|:---------------:|
-| 0.6455        | 0.0119 | 1000  | 0.2053          |
-| 0.6832        | 0.0238 | 2000  | 0.2058          |
-| 0.6383        | 0.0357 | 3000  | 0.2058          |
-| 0.6507        | 0.0476 | 4000  | 0.2069          |
-| 0.6877        | 0.0596 | 5000  | 0.2060          |
-| 0.6479        | 0.0715 | 6000  | 0.2054          |
-| 0.7227        | 0.0834 | 7000  | 0.2056          |
-| 0.7055        | 0.0953 | 8000  | 0.2057          |
-| 0.6465        | 0.1072 | 9000  | 0.2052          |
-| 0.7416        | 0.1191 | 10000 | 0.2046          |
-| 0.7090        | 0.1310 | 11000 | 0.2048          |
-| 0.6912        | 0.1429 | 12000 | 0.2060          |
-| 0.5886        | 0.1549 | 13000 | 0.2056          |
-| 0.7237        | 0.1668 | 14000 | 0.2045          |
-| 0.6725        | 0.1787 | 15000 | 0.2046          |
-| 0.6518        | 0.1906 | 16000 | 0.2038          |
-| 0.6546        | 0.2025 | 17000 | 0.2042          |
-| 0.6793        | 0.2144 | 18000 | 0.2032          |
-| 0.6697        | 0.2263 | 19000 | 0.2035          |
-| 0.7108        | 0.2382 | 20000 | 0.2042          |
-| 0.7447        | 0.2502 | 21000 | 0.2038          |
-| 0.6575        | 0.2621 | 22000 | 0.2039          |
-| 0.7154        | 0.2740 | 23000 | 0.2034          |
-| 0.6833        | 0.2859 | 24000 | 0.2024          |
-| 0.6613        | 0.2978 | 25000 | 0.2028          |
-| 0.6906        | 0.3097 | 26000 | 0.2025          |
-| 0.6843        | 0.3216 | 27000 | 0.2027          |
-| 0.6966        | 0.3335 | 28000 | 0.2023          |
-| 0.6801        | 0.3454 | 29000 | 0.2027          |
-| 0.7171        | 0.3574 | 30000 | 0.2027          |
-| 0.7029        | 0.3693 | 31000 | 0.2017          |
-| 0.6876        | 0.3812 | 32000 | 0.2019          |
-| 0.6646        | 0.3931 | 33000 | 0.2022          |
-| 0.6834        | 0.4050 | 34000 | 0.2022          |
-| 0.6868        | 0.4169 | 35000 | 0.2014          |
-| 0.6831        | 0.4288 | 36000 | 0.2019          |
-| 0.6309        | 0.4407 | 37000 | 0.2009          |
-| 0.6603        | 0.4527 | 38000 | 0.2007          |
-| 0.6818        | 0.4646 | 39000 | 0.2006          |
-| 0.6539        | 0.4765 | 40000 | 0.2001          |
-| 0.6999        | 0.4884 | 41000 | 0.2001          |
-| 0.6870        | 0.5003 | 42000 | 0.1997          |
-| 0.5977        | 0.5122 | 43000 | 0.2000          |
-| 0.6747        | 0.5241 | 44000 | 0.2002          |
-| 0.6695        | 0.5360 | 45000 | 0.2005          |
-| 0.6763        | 0.5479 | 46000 | 0.1992          |
-| 0.6656        | 0.5599 | 47000 | 0.2006          |
-| 0.6674        | 0.5718 | 48000 | 0.2000          |
-| 0.7177        | 0.5837 | 49000 | 0.1995          |
-| 0.6904        | 0.5956 | 50000 | 0.1999          |
-| 0.6421        | 0.6075 | 51000 | 0.2003          |
-| 0.6555        | 0.6194 | 52000 | 0.2004          |
-| 0.7010        | 0.6313 | 53000 | 0.2003          |
-| 0.6520        | 0.6432 | 54000 | 0.1993          |
-| 0.6284        | 0.6552 | 55000 | 0.1999          |
-| 0.6770        | 0.6671 | 56000 | 0.1994          |
-| 0.7453        | 0.6790 | 57000 | 0.1993          |
-| 0.6441        | 0.6909 | 58000 | 0.1978          |
-| 0.6670        | 0.7028 | 59000 | 0.1980          |
-| 0.6380        | 0.7147 | 60000 | 0.1979          |
-| 0.7013        | 0.7266 | 61000 | 0.1984          |
-| 0.6442        | 0.7385 | 62000 | 0.1988          |
-| 0.6750        | 0.7505 | 63000 | 0.1981          |
-| 0.6776        | 0.7624 | 64000 | 0.1985          |
-| 0.6316        | 0.7743 | 65000 | 0.1992          |
-| 0.6929        | 0.7862 | 66000 | 0.1988          |
-| 0.6887        | 0.7981 | 67000 | 0.1982          |
-| 0.6502        | 0.8100 | 68000 | 0.1975          |
-| 0.7152        | 0.8219 | 69000 | 0.1983          |
-| 0.6906        | 0.8338 | 70000 | 0.1985          |
-| 0.6128        | 0.8457 | 71000 | 0.1978          |
-| 0.5966        | 0.8577 | 72000 | 0.1973          |
-| 0.6726        | 0.8696 | 73000 | 0.1983          |
-| 0.6668        | 0.8815 | 74000 | 0.1984          |
-| 0.6337        | 0.8934 | 75000 | 0.1982          |
-| 0.6272        | 0.9053 | 76000 | 0.1973          |
-| 0.7112        | 0.9172 | 77000 | 0.1978          |
-| 0.5871        | 0.9291 | 78000 | 0.1989          |
-| 0.6428        | 0.9410 | 79000 | 0.1972          |
-| 0.6740        | 0.9530 | 80000 | 0.1966          |
-| 0.6933        | 0.9649 | 81000 | 0.1976          |
-| 0.6668        | 0.9768 | 82000 | 0.1975          |
-| 0.5919        | 0.9887 | 83000 | 0.1977          |
-| 0.7215        | 1.0    | 83950 | 0.1981          |
-### Framework versions
-- Transformers 5.7.0
-- Pytorch 2.8.0+cu128
-- Datasets 3.6.0
-- Tokenizers 0.22.2

 ---
+license: mit
+language:
+- en
+datasets:
+- speechbrain/LoquaciousSet
+base_model:
+- zai-org/GLM-ASR-Nano-2512
+- Qwen/Qwen3-0.6B
+pipeline_tag: automatic-speech-recognition
 tags:
+- asr
+- speech-recognition
+- audio
+- qwen
+- glm-asr
+library_name: transformers
 ---
+# Tiny Audio
+A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
+## Quick Start
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+result = pipe("audio.wav")
+print(result["text"])
+```
+## Usage Examples
+### Basic Transcription
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+# From file
+result = pipe("audio.wav")
+print(result["text"])
+# From URL
+result = pipe("https://example.com/audio.mp3")
+# From numpy array (must be 16kHz)
+import numpy as np
+audio = np.random.randn(16000).astype(np.float32)  # 1 second
+result = pipe(audio)
+```
+### Batch Processing
+```python
+# Process multiple files
+files = ["audio1.wav", "audio2.wav", "audio3.wav"]
+results = pipe(files, batch_size=4)
+for r in results:
+    print(r["text"])
+```
+### Word-Level Timestamps
+```python
+result = pipe("audio.wav", return_timestamps="word")
+# Returns:
+# {
+#   "text": "hello world",
+#   "chunks": [
+#     {"text": "hello", "timestamp": (0.0, 0.5)},
+#     {"text": "world", "timestamp": (0.6, 1.0)}
+#   ]
+# }
+```
+### Streaming Inference
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load and process audio
+import librosa
+audio, sr = librosa.load("audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Stream tokens
+for token in model.generate_streaming(inputs["input_features"]):
+    print(token, end="", flush=True)
+```
+### Using with torch directly
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+import librosa
+# Load model and processor
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load audio (16kHz)
+audio, sr = librosa.load("audio.wav", sr=16000)
+# Process
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Generate
+with torch.no_grad():
+    output = model.generate(
+        input_features=inputs["input_features"],
+        attention_mask=inputs["attention_mask"],
+        max_new_tokens=256
+    )
+# Decode
+text = processor.batch_decode(output, skip_special_tokens=True)[0]
+print(text)
+```
+### GPU Inference
+```python
+import torch
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    device="cuda"  # or device=0
+)
+```
+### Half Precision
+```python
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    torch_dtype=torch.float16,
+    device="cuda"
+)
+```
+## Architecture
+```
+Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
+```
+Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
+| Component | Model | Parameters | Status |
+|-----------|-------|------------|--------|
+| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
+| Projector | 2-layer MLP | ~12M | Trained |
+| Language Model | Qwen3-0.6B | ~600M | Frozen |
+### How It Works
+1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
+2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
+3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
+The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
+## Model Specifications
+| Specification | Value |
+|---------------|-------|
+| Input | Audio (16kHz mono) |
+| Output | Text transcription |
+| Max Audio Length | ~30 seconds (limited by encoder) |
+| Vocabulary | Qwen3 tokenizer |
+| Languages | English only |
+| Generation | Greedy decoding (num_beams=1, do_sample=False) |
+## Training Details
+| | |
+|---|---|
+| **Dataset** | LoquaciousSet (25,000 hours) |
+| **Hardware** | Single NVIDIA A40 |
+| **Time** | ~24 hours |
+| **Cost** | ~$12 |
+| **Optimizer** | AdamW |
+| **Learning Rate** | 1e-4 |
+| **Batch Size** | 4 |
+| **Steps** | 50,000 |
+## Limitations
+- **English only**: Not trained on other languages
+- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
+- **Audio length**: Best for clips under 30 seconds
+- **Accuracy**: May degrade on:
+  - Heavily accented speech
+  - Noisy or low-quality audio
+  - Domain-specific terminology
+  - Overlapping speakers
+- **No punctuation**: Output is lowercase without punctuation by default
+## Requirements
+```
+transformers>=4.40.0
+torch>=2.0.0
+torchaudio>=2.0.0
+```
+Optional for streaming:
+```
+librosa
+soundfile
+```
+## Files
+| File | Description |
+|------|-------------|
+| `config.json` | Model configuration |
+| `model.safetensors` | Projector weights (~48MB) |
+| `preprocessor_config.json` | Audio preprocessing config |
+| `tokenizer.json` | Tokenizer |
+| `tokenizer_config.json` | Tokenizer config |
+| `special_tokens_map.json` | Special tokens |
+Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{tinyaudio2024,
+  author = {Alex Kroman},
+  title = {Tiny Audio: Minimal ASR Training},
+  year = {2024},
+  publisher = {GitHub},
+  url = {https://github.com/alexkroman/tiny-audio}
+}
+```
+## Links
+- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
+- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
+- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
+## Acknowledgments
+- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
+- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
+- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
+## License
+MIT