| | --- |
| | language: |
| | - multilingual |
| | tags: |
| | - audio |
| | - text |
| | - multimodal |
| | - seamless |
| | - subtitle-editing-time-prediction |
| | library_name: transformers |
| | base_model: facebook/hf-seamless-m4t-medium |
| | --- |
| | |
| | # videoloc/seamless-basic |
| |
|
| | ## Model Description |
| |
|
| | This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment. |
| |
|
| | The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations. |
| |
|
| | ### Key Features |
| |
|
| | - **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs |
| | - **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability) |
| | - **TTE Prediction**: Predicts editing time required for subtitle segments |
| | - **Direct Output**: Raw time values in seconds for immediate use |
| |
|
| | ## Model Architecture |
| |
|
| | The model consists of the following components: |
| |
|
| | 1. **Audio Processing**: |
| | - SeamlessM4T speech encoder (frozen) processes raw audio input |
| | - Audio projection layer maps speech encoder output to 1024 dimensions |
| | - Mean pooling over sequence length to get fixed-size audio embedding |
| |
|
| | 2. **Text Processing**: |
| | - SeamlessM4T text encoder (frozen) processes tokenized text input |
| | - Text projection layer maps text encoder output to 1024 dimensions |
| | - Mean pooling over sequence length to get fixed-size text embedding |
| |
|
| | 3. **Feature Fusion**: |
| | - Audio and text embeddings are concatenated (2048 total dimensions) |
| | - No additional cross-modal attention or complex fusion mechanisms |
| |
|
| | 4. **Regression Head**: |
| | - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1 |
| | - ReLU activations and dropout for regularization |
| | - Single output for TTE prediction (regression, in seconds) |
| |
|
| | ## Quick Start |
| |
|
| | ### Installation |
| | ```bash |
| | pip install transformers torch torchaudio huggingface_hub |
| | ``` |
| |
|
| | ### Basic Usage |
| | ```python |
| | from transformers import AutoModel, AutoConfig |
| | from huggingface_hub import hf_hub_download |
| | import torch |
| | import numpy as np |
| | import importlib.util |
| | |
| | # Load model - custom architecture requires importing the model class |
| | model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py") |
| | spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files) |
| | modeling_module = importlib.util.module_from_spec(spec) |
| | spec.loader.exec_module(modeling_module) |
| | |
| | # Now load the model using the custom class |
| | config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic") |
| | model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic") |
| | |
| | # Load the data collator (included in this repo) |
| | collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py") |
| | spec = importlib.util.spec_from_file_location("data_collator", collator_file) |
| | collator_module = importlib.util.module_from_spec(spec) |
| | spec.loader.exec_module(collator_module) |
| | |
| | # Initialize data collator |
| | data_collator = collator_module.DataCollatorSimpleSeamless( |
| | processor="facebook/hf-seamless-m4t-medium", |
| | max_audio_length_sec=8.0, |
| | max_text_length=256 |
| | ) |
| | |
| | # Prepare your data |
| | your_data = [ |
| | { |
| | 'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz |
| | 'raw_text': "Your subtitle text here", |
| | # Note: No translation features needed for basic model |
| | } |
| | ] |
| | |
| | # Process and run inference |
| | batch = data_collator(your_data) |
| | model.eval() |
| | with torch.no_grad(): |
| | outputs = model(**batch) |
| | tte_prediction = outputs.logits.item() |
| | |
| | print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds") |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium) |
| | - **Audio Encoder**: Frozen SeamlessM4T speech encoder |
| | - **Text Encoder**: Frozen SeamlessM4T text encoder |
| | - **Hidden Size**: 1024 |
| | - **Audio Input**: 16kHz, max 8.0 seconds |
| | - **Text Input**: Max 256 tokens |
| | - **Output**: Single regression value (TTE in seconds) |
| | - **Task**: Subtitle editing time prediction |
| |
|
| | ## Data Format |
| |
|
| | Your input data should be a list of dictionaries with: |
| | - `raw_audio`: NumPy array of audio samples (16kHz sampling rate) |
| | - `raw_text`: String of subtitle text |
| | - `labels`: Target TTE values in seconds (optional, for training) |
| |
|
| | Example: |
| | ```python |
| | data = [ |
| | { |
| | 'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz |
| | 'raw_text': "Subtitle text content", |
| | 'labels': 2.5 # optional TTE target value in seconds |
| | } |
| | ] |
| | ``` |
| |
|
| | ## Performance Metrics |
| |
|
| | - **Best Eval RMSE**: 33.34 |
| |
|
| | ## Training Details |
| |
|
| | - **Base Model**: facebook/hf-seamless-m4t-medium |
| | - **Epochs**: 10 |
| | - **Batch Size (Train)**: 32 |
| | - **Batch Size (Eval)**: 64 |
| | - **Learning Rate**: 1.2e-4 |
| | - **LR Scheduler**: cosine_with_restarts |
| | - **Warmup Ratio**: 0.05 |
| | - **Weight Decay**: 0.001 |
| | - **Optimizer**: AdamW (torch) |
| | - **Max Grad Norm**: 1.0 |
| | - **FP16**: True |
| | - **Early Stopping Patience**: 5 |
| | - **Audio Max Length**: 8.0 seconds |
| | - **Text Max Length**: 256 tokens |
| | - **Sample Rate**: 16kHz |
| | - **Normalization**: None (raw values) |
| | - **Dataset Split**: 80/20 train/test |
| | - **Random Seed**: 42 |
| | - **Metric**: RMSE (lower is better) |
| |
|
| | ## Training Configuration |
| |
|
| | The model was trained with the following specifications: |
| |
|
| | - **Dataset**: Multimodal audio-subtitle pairs with TTE annotations |
| | - **Train/Test Split**: 80/20 with random seed 42 |
| | - **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset |
| | - **Text Processing**: Max 256 tokens |
| | - **Normalization**: None (raw TTE values in seconds) |
| | - **Caching**: Audio segments cached and compressed for efficiency |
| |
|
| | ## Usage Notes |
| |
|
| | - This is the **basic** variant - processes only audio and text |
| | - For translation-aware models, see `seamless-translation` and `seamless-langpairs` |
| | - Model expects 16kHz audio input (automatically resampled by data collator) |
| | - Text is processed with SeamlessM4T text encoder |
| | - No feature normalization applied - outputs raw TTE predictions in seconds |
| | - Optimized for subtitle editing time estimation tasks |
| |
|
| | ## Limitations |
| |
|
| | - Designed for TTE prediction, not general audio-text matching |
| | - Performance may vary on out-of-domain content or different editing workflows |
| | - Requires specific data preprocessing (use included data collator) |
| |
|
| | ## Related Models |
| |
|
| | - **seamless-translation**: Adds translation awareness features |
| | - **seamless-langpairs**: Includes language pair embeddings for multilingual scenarios |
| |
|