Training in progress - step 1000

Browse files

Files changed (15) hide show

README.md +199 -0
alignment.py +283 -0
asr_config.py +233 -0
asr_modeling.py +906 -0
asr_pipeline.py +322 -0
asr_processing.py +133 -0
chat_template.jinja +6 -0
config.json +436 -0
diarization.py +732 -0
generation_config.json +18 -0
model.safetensors +3 -0
preprocessor_config.json +18 -0
projectors.py +505 -0
tokenizer.json +0 -0
tokenizer_config.json +18 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

alignment.py ADDED Viewed

	@@ -0,0 +1,283 @@

+"""Forced alignment for word-level timestamps using Wav2Vec2."""
+import numpy as np
+import torch
+def _get_device() -> str:
+    """Get best available device for non-transformers models."""
+    if torch.cuda.is_available():
+        return "cuda"
+    if torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+class ForcedAligner:
+    """Lazy-loaded forced aligner for word-level timestamps using torchaudio wav2vec2.
+    Uses Viterbi trellis algorithm for optimal alignment path finding.
+    """
+    _bundle = None
+    _model = None
+    _labels = None
+    _dictionary = None
+    @classmethod
+    def get_instance(cls, device: str = "cuda"):
+        """Get or create the forced alignment model (singleton).
+        Args:
+            device: Device to run model on ("cuda" or "cpu")
+        Returns:
+            Tuple of (model, labels, dictionary)
+        """
+        if cls._model is None:
+            import torchaudio
+            cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
+            cls._model = cls._bundle.get_model().to(device)
+            cls._model.eval()
+            cls._labels = cls._bundle.get_labels()
+            cls._dictionary = {c: i for i, c in enumerate(cls._labels)}
+        return cls._model, cls._labels, cls._dictionary
+    @staticmethod
+    def _get_trellis(emission: torch.Tensor, tokens: list[int], blank_id: int = 0) -> torch.Tensor:
+        """Build trellis for forced alignment using forward algorithm.
+        The trellis[t, j] represents the log probability of the best path that
+        aligns the first j tokens to the first t frames.
+        Args:
+            emission: Log-softmax emission matrix of shape (num_frames, num_classes)
+            tokens: List of target token indices
+            blank_id: Index of the blank/CTC token (default 0)
+        Returns:
+            Trellis matrix of shape (num_frames + 1, num_tokens + 1)
+        """
+        num_frames = emission.size(0)
+        num_tokens = len(tokens)
+        trellis = torch.full((num_frames + 1, num_tokens + 1), -float("inf"))
+        trellis[0, 0] = 0
+        for t in range(num_frames):
+            for j in range(num_tokens + 1):
+                # Stay: emit blank and stay at j tokens
+                stay = trellis[t, j] + emission[t, blank_id]
+                # Move: emit token j and advance to j+1 tokens
+                move = trellis[t, j - 1] + emission[t, tokens[j - 1]] if j > 0 else -float("inf")
+                trellis[t + 1, j] = max(stay, move)  # Viterbi: take best path
+        return trellis
+    @staticmethod
+    def _backtrack(
+        trellis: torch.Tensor, emission: torch.Tensor, tokens: list[int], blank_id: int = 0
+    ) -> list[tuple[int, float, float]]:
+        """Backtrack through trellis to find optimal forced monotonic alignment.
+        Guarantees:
+        - All tokens are emitted exactly once
+        - Strictly monotonic: each token's frames come after previous token's
+        - No frame skipping or token teleporting
+        Returns list of (token_id, start_frame, end_frame) for each token.
+        """
+        num_frames = emission.size(0)
+        num_tokens = len(tokens)
+        if num_tokens == 0:
+            return []
+        # Find the best ending point (should be at num_tokens)
+        # But verify trellis reached a valid state
+        if trellis[num_frames, num_tokens] == -float("inf"):
+            # Alignment failed - fall back to uniform distribution
+            frames_per_token = num_frames / num_tokens
+            return [
+                (tokens[i], i * frames_per_token, (i + 1) * frames_per_token)
+                for i in range(num_tokens)
+            ]
+        # Backtrack: find where each token transition occurred
+        # path[i] = frame where token i was first emitted
+        token_frames: list[list[int]] = [[] for _ in range(num_tokens)]
+        t = num_frames
+        j = num_tokens
+        while t > 0 and j > 0:
+            # Check: did we transition from j-1 to j at frame t-1?
+            stay_score = trellis[t - 1, j] + emission[t - 1, blank_id]
+            move_score = trellis[t - 1, j - 1] + emission[t - 1, tokens[j - 1]]
+            if move_score >= stay_score:
+                # Token j-1 was emitted at frame t-1
+                token_frames[j - 1].insert(0, t - 1)
+                j -= 1
+            # Always decrement time (monotonic)
+            t -= 1
+        # Handle any remaining tokens at the start (edge case)
+        while j > 0:
+            token_frames[j - 1].insert(0, 0)
+            j -= 1
+        # Convert to spans
+        token_spans: list[tuple[int, float, float]] = []
+        for token_idx, frames in enumerate(token_frames):
+            if not frames:
+                # Token never emitted - assign minimal span after previous
+                if token_spans:
+                    prev_end = token_spans[-1][2]
+                    frames = [int(prev_end)]
+                else:
+                    frames = [0]
+            token_id = tokens[token_idx]
+            start_frame = float(min(frames))
+            end_frame = float(max(frames)) + 1.0
+            token_spans.append((token_id, start_frame, end_frame))
+        return token_spans
+    # Offset compensation for Wav2Vec2-BASE systematic bias (in seconds)
+    # Calibrated on librispeech-alignments dataset
+    START_OFFSET = 0.06  # Subtract from start times (shift earlier)
+    END_OFFSET = -0.03  # Add to end times (shift later)
+    @classmethod
+    def align(
+        cls,
+        audio: np.ndarray,
+        text: str,
+        sample_rate: int = 16000,
+        _language: str = "eng",
+        _batch_size: int = 16,
+    ) -> list[dict]:
+        """Align transcript to audio and return word-level timestamps.
+        Uses Viterbi trellis algorithm for optimal forced alignment.
+        Args:
+            audio: Audio waveform as numpy array
+            text: Transcript text to align
+            sample_rate: Audio sample rate (default 16000)
+            _language: ISO-639-3 language code (default "eng" for English, unused)
+            _batch_size: Batch size for alignment model (unused)
+        Returns:
+            List of dicts with 'word', 'start', 'end' keys
+        """
+        import torchaudio
+        device = _get_device()
+        model, _labels, dictionary = cls.get_instance(device)
+        assert cls._bundle is not None and dictionary is not None  # Initialized by get_instance
+        # Convert audio to tensor (copy to ensure array is writable)
+        if isinstance(audio, np.ndarray):
+            waveform = torch.from_numpy(audio.copy()).float()
+        else:
+            waveform = audio.clone().float()
+        # Ensure 2D (channels, time)
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        # Resample if needed (wav2vec2 expects 16kHz)
+        if sample_rate != cls._bundle.sample_rate:
+            waveform = torchaudio.functional.resample(
+                waveform, sample_rate, cls._bundle.sample_rate
+            )
+        waveform = waveform.to(device)
+        # Get emissions from model
+        with torch.inference_mode():
+            emissions, _ = model(waveform)
+            emissions = torch.log_softmax(emissions, dim=-1)
+        emission = emissions[0].cpu()
+        # Normalize text: uppercase, keep only valid characters
+        transcript = text.upper()
+        # Build tokens from transcript (including word separators)
+        tokens = []
+        for char in transcript:
+            if char in dictionary:
+                tokens.append(dictionary[char])
+            elif char == " ":
+                tokens.append(dictionary.get("|", dictionary.get(" ", 0)))
+        if not tokens:
+            return []
+        # Build Viterbi trellis and backtrack for optimal path
+        trellis = cls._get_trellis(emission, tokens, blank_id=0)
+        alignment_path = cls._backtrack(trellis, emission, tokens, blank_id=0)
+        # Convert frame indices to time (model stride is 320 samples at 16kHz = 20ms)
+        frame_duration = 320 / cls._bundle.sample_rate
+        # Apply separate offset compensation for start/end (Wav2Vec2 systematic bias)
+        start_offset = cls.START_OFFSET
+        end_offset = cls.END_OFFSET
+        # Group aligned tokens into words based on pipe separator
+        words = text.split()
+        word_timestamps = []
+        current_word_start = None
+        current_word_end = None
+        word_idx = 0
+        separator_id = dictionary.get("|", dictionary.get(" ", 0))
+        for token_id, start_frame, end_frame in alignment_path:
+            if token_id == separator_id:  # Word separator
+                if (
+                    current_word_start is not None
+                    and current_word_end is not None
+                    and word_idx < len(words)
+                ):
+                    start_time = max(0.0, current_word_start * frame_duration - start_offset)
+                    end_time = max(0.0, current_word_end * frame_duration - end_offset)
+                    word_timestamps.append(
+                        {
+                            "word": words[word_idx],
+                            "start": start_time,
+                            "end": end_time,
+                        }
+                    )
+                    word_idx += 1
+                current_word_start = None
+                current_word_end = None
+            else:
+                if current_word_start is None:
+                    current_word_start = start_frame
+                current_word_end = end_frame
+        # Don't forget the last word
+        if (
+            current_word_start is not None
+            and current_word_end is not None
+            and word_idx < len(words)
+        ):
+            start_time = max(0.0, current_word_start * frame_duration - start_offset)
+            end_time = max(0.0, current_word_end * frame_duration - end_offset)
+            word_timestamps.append(
+                {
+                    "word": words[word_idx],
+                    "start": start_time,
+                    "end": end_time,
+                }
+            )
+        return word_timestamps

asr_config.py ADDED Viewed

	@@ -0,0 +1,233 @@

+from typing import Optional
+import transformers
+class ASRConfig(transformers.PretrainedConfig):
+    """Configuration class for the ASR model.
+    This config combines settings for:
+    - Audio encoder (GLM-ASR/Whisper)
+    - Text decoder (Qwen)
+    - Projector (MLP, MOSA, MoE, QFormer)
+    - Generation parameters
+    - Training options (SpecAugment, LoRA)
+    """
+    model_type = "asr_model"
+    is_composition = True
+    def __init__(
+        self,
+        audio_model_id: str = "zai-org/GLM-ASR-Nano-2512",
+        text_model_id: str = "Qwen/Qwen3-0.6B",
+        attn_implementation: str = "flash_attention_2",
+        model_dtype: str = "bfloat16",
+        num_beams: Optional[int] = None,
+        system_prompt: str = "You are a helpful assistant.",
+        encoder_dim: Optional[int] = None,
+        llm_dim: Optional[int] = None,
+        # Encoder conv layers: list of (padding, kernel_size, stride) tuples
+        # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
+        encoder_conv_layers: Optional[list] = None,
+        audio_sample_rate: int = 16000,
+        projector_pool_stride: int = 4,
+        downsample_rate: int = 5,  # Granite default
+        projector_hidden_dim: Optional[int] = None,
+        projector_type: str = "mlp",  # "mlp", "mosa", "moe", "qformer"
+        projector_num_layers: int = 2,  # Number of layers in MLP projector
+        projector_init_std: float = 0.02,  # Weight initialization std
+        projector_dropout: float = 0.0,  # Dropout rate for projector layers
+        # MoE-specific configuration
+        num_experts: int = 4,  # Number of experts in MoE projectors
+        num_experts_per_tok: int = 2,  # Top-k experts per token
+        router_aux_loss_coef: float = 0.01,  # Auxiliary loss coefficient for load balancing
+        # QFormer-specific configuration (Granite defaults)
+        qformer_window_size: int = 15,  # Window size for QFormer processing
+        qformer_hidden_size: Optional[int] = None,  # QFormer hidden size (defaults to encoder_dim)
+        qformer_num_layers: int = 2,  # Number of QFormer transformer layers
+        qformer_num_heads: int = 16,  # Number of attention heads in QFormer
+        qformer_intermediate_size: Optional[int] = None,  # FFN size (defaults to 4x hidden)
+        label_smoothing: float = 0.0,  # Label smoothing for cross-entropy loss
+        inference_warmup_tokens: int = 10,
+        # SpecAugment settings
+        use_specaugment: bool = False,
+        num_time_masks: int = 2,
+        time_mask_length: int = 10,
+        num_freq_masks: int = 0,
+        freq_mask_length: int = 10,
+        # LoRA configuration (for Stage 2 fine-tuning)
+        use_lora: bool = False,
+        lora_rank: int = 8,  # SALMONN default
+        lora_alpha: int = 32,  # SALMONN default (scaling factor 4.0)
+        lora_dropout: float = 0.0,
+        lora_target_modules: Optional[list] = None,  # Default: all linear layers
+        freeze_projector: bool = False,  # True for Stage 2 (LoRA-only training)
+        do_sample: bool = False,
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
+        repetition_penalty: Optional[float] = None,
+        length_penalty: Optional[float] = None,
+        no_repeat_ngram_size: Optional[int] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs,
+    ):
+        """Initialize ASR model configuration.
+        Args:
+            audio_model_id: HuggingFace model ID for audio encoder (GLM-ASR/Whisper)
+            text_model_id: HuggingFace model ID for text decoder (Qwen)
+            attn_implementation: Attention implementation ("flash_attention_2", "sdpa", "eager")
+            model_dtype: Model dtype ("bfloat16", "float16", "float32")
+            projector_type: Projector architecture ("mlp", "mosa", "moe", "qformer")
+            use_lora: Enable LoRA adapters for Stage 2 fine-tuning
+            use_specaugment: Enable SpecAugment data augmentation
+        """
+        # Set default generation parameters (greedy decoding only)
+        generation_defaults = {
+            "num_beams": 1,
+            "max_new_tokens": 128,
+            "min_new_tokens": 0,
+            "repetition_penalty": 1.0,
+            "length_penalty": 1.0,
+            "no_repeat_ngram_size": 0,  # Prevent repeating 3-grams like "so so so"
+            "use_cache": True,
+        }
+        # Apply defaults (config.json values take precedence)
+        kwargs = {**generation_defaults, **kwargs}
+        self.audio_model_id = audio_model_id
+        self.text_model_id = text_model_id
+        self.attn_implementation = attn_implementation
+        self.model_dtype = model_dtype
+        self.system_prompt = system_prompt
+        self.encoder_dim = encoder_dim
+        self.llm_dim = llm_dim
+        # Default conv layers for Whisper/GLM-ASR: [(pad, kernel, stride), ...]
+        self.encoder_conv_layers = encoder_conv_layers or [(1, 3, 1), (1, 3, 2)]
+        self.audio_sample_rate = audio_sample_rate
+        self.projector_init_std = projector_init_std
+        self.projector_pool_stride = projector_pool_stride
+        self.downsample_rate = downsample_rate
+        self.projector_hidden_dim = projector_hidden_dim
+        self.projector_type = projector_type
+        self.projector_num_layers = projector_num_layers
+        self.projector_dropout = projector_dropout
+        # MoE-specific configuration
+        self.num_experts = num_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.router_aux_loss_coef = router_aux_loss_coef
+        # QFormer-specific configuration
+        self.qformer_window_size = qformer_window_size
+        self.qformer_hidden_size = qformer_hidden_size
+        self.qformer_num_layers = qformer_num_layers
+        self.qformer_num_heads = qformer_num_heads
+        self.qformer_intermediate_size = qformer_intermediate_size
+        self.label_smoothing = label_smoothing
+        self.inference_warmup_tokens = inference_warmup_tokens
+        # SpecAugment configuration
+        self.use_specaugment = use_specaugment
+        self.num_time_masks = num_time_masks
+        self.time_mask_length = time_mask_length
+        self.num_freq_masks = num_freq_masks
+        self.freq_mask_length = freq_mask_length
+        # LoRA configuration
+        self.use_lora = use_lora
+        self.lora_rank = lora_rank
+        self.lora_alpha = lora_alpha
+        self.lora_dropout = lora_dropout
+        self.lora_target_modules = lora_target_modules or [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ]
+        self.freeze_projector = freeze_projector
+        # Generation parameters (use explicit value if provided, else use default)
+        self.num_beams = num_beams if num_beams is not None else generation_defaults["num_beams"]
+        self.max_new_tokens = (
+            max_new_tokens if max_new_tokens is not None else generation_defaults["max_new_tokens"]
+        )
+        self.min_new_tokens = (
+            min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
+        )
+        self.repetition_penalty = (
+            repetition_penalty
+            if repetition_penalty is not None
+            else generation_defaults["repetition_penalty"]
+        )
+        self.length_penalty = (
+            length_penalty if length_penalty is not None else generation_defaults["length_penalty"]
+        )
+        self.no_repeat_ngram_size = (
+            no_repeat_ngram_size
+            if no_repeat_ngram_size is not None
+            else generation_defaults["no_repeat_ngram_size"]
+        )
+        self.use_cache = use_cache if use_cache is not None else generation_defaults["use_cache"]
+        self.do_sample = do_sample
+        self.temperature = temperature
+        self.top_p = top_p
+        self.top_k = top_k
+        if "audio_config" not in kwargs:
+            self.audio_config = transformers.AutoConfig.from_pretrained(audio_model_id)
+            # Override dtype to match model_dtype
+            self.audio_config.dtype = model_dtype
+        else:
+            self.audio_config = kwargs.pop("audio_config")
+        if "text_config" not in kwargs:
+            self.text_config = transformers.AutoConfig.from_pretrained(
+                text_model_id, trust_remote_code=True
+            )
+            # Override dtype to match model_dtype
+            self.text_config.dtype = model_dtype
+        else:
+            self.text_config = kwargs.pop("text_config")
+        if isinstance(self.text_config, dict):
+            # Reconstruct config from dict using the model_type stored in the dict
+            model_type = self.text_config["model_type"]
+            config_class = transformers.AutoConfig.for_model(model_type).__class__
+            self.text_config = config_class(**self.text_config)
+        if isinstance(self.audio_config, dict):
+            model_type = self.audio_config.get("model_type")
+            if model_type:
+                config_class = transformers.AutoConfig.for_model(model_type).__class__
+                self.audio_config = config_class(**self.audio_config)
+        super().__init__(**kwargs)
+        # Point encoder to audio_config so pipeline uses correct feature extractor
+        # The pipeline looks for config.encoder._name_or_path for feature extractor
+        self.encoder = self.audio_config
+        self.auto_map = {
+            "AutoConfig": "asr_config.ASRConfig",
+            "AutoModel": "asr_modeling.ASRModel",
+            "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+            "AutoProcessor": "asr_processing.ASRProcessor",
+        }
+        self.custom_pipelines = {
+            "automatic-speech-recognition": {
+                "impl": "asr_pipeline.ASRPipeline",
+                "pt": ["AutoModelForSpeechSeq2Seq"],
+                "tf": [],
+                "type": "audio",
+            }
+        }
+        self.architectures = ["ASRModel"]
+        self.pipeline_tag = "automatic-speech-recognition"
+transformers.AutoConfig.register("asr_model", ASRConfig)

asr_modeling.py ADDED Viewed

	@@ -0,0 +1,906 @@

+import json
+from pathlib import Path
+from threading import Thread
+from typing import Iterator, Optional, Union
+import torch
+import torch.nn as nn
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedModel,
+    TextIteratorStreamer,
+)
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import CausalLMOutputWithPast
+try:
+    from .asr_config import ASRConfig
+    from .projectors import PROJECTOR_CLASSES
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+    from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
+from torchaudio.transforms import SpecAugment
+class ASRModel(PreTrainedModel, GenerationMixin):
+    """Audio-to-text model combining an audio encoder, projector, and language model."""
+    config_class = ASRConfig
+    base_model_prefix = "model"
+    main_input_name = "input_features"
+    _supports_flash_attn_2 = True
+    supports_gradient_checkpointing = True
+    _is_loading_from_pretrained: bool = False
+    _pretrained_model_path: Optional[str] = None
+    TRANSCRIBE_PROMPT = ""
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, *args, **kwargs) -> "ASRModel":
+        """Load model from pretrained, handling device placement correctly."""
+        from safetensors.torch import load_file
+        from transformers.utils.hub import cached_file
+        config = kwargs.pop("config", None)
+        if config is None:
+            config = ASRConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        # Set flag to avoid device_map="auto" in sub-model loaders
+        cls._is_loading_from_pretrained = True
+        cls._pretrained_model_path = pretrained_model_name_or_path
+        try:
+            model = cls(config, **kwargs)
+            # Load projector weights from safetensors
+            subfolder = kwargs.get("subfolder")
+            revision = kwargs.get("revision")
+            cache_kwargs = {}
+            if subfolder:
+                cache_kwargs["subfolder"] = subfolder
+            if revision:
+                cache_kwargs["revision"] = revision
+            model_file = cached_file(
+                pretrained_model_name_or_path,
+                "model.safetensors",
+                _raise_exceptions_for_missing_entries=False,
+                **cache_kwargs,
+            )
+            if model_file is not None:
+                state_dict = load_file(model_file)
+                model.load_state_dict(state_dict, strict=False)
+            # Load LoRA adapters if use_lora is enabled
+            if getattr(config, "use_lora", False):
+                # Check for adapter_config.json (required by PEFT to load adapters)
+                adapter_config_file = cached_file(
+                    pretrained_model_name_or_path,
+                    "adapter_config.json",
+                    _raise_exceptions_for_missing_entries=False,
+                    **cache_kwargs,
+                )
+                if adapter_config_file is not None:
+                    # Load saved adapter weights using the original repo_id/path
+                    # PEFT handles Hub downloads and caching internally
+                    from peft import PeftModel
+                    model.language_model = PeftModel.from_pretrained(
+                        model.language_model,
+                        pretrained_model_name_or_path,
+                        is_trainable=True,
+                        **cache_kwargs,
+                    )
+                else:
+                    # No saved adapters - initialize fresh LLM LoRA for training
+                    from peft import LoraConfig, get_peft_model
+                    lora_config = LoraConfig(
+                        r=config.lora_rank,
+                        lora_alpha=config.lora_alpha,
+                        target_modules=config.lora_target_modules,
+                        lora_dropout=config.lora_dropout,
+                        bias="none",
+                        task_type="CAUSAL_LM",
+                    )
+                    model.language_model = get_peft_model(model.language_model, lora_config)
+            return model
+        finally:
+            cls._is_loading_from_pretrained = False
+            cls._pretrained_model_path = None
+    def __init__(self, config: ASRConfig, **kwargs) -> None:
+        super().__init__(config)
+        self.system_prompt = config.system_prompt
+        target_dtype = getattr(torch, config.model_dtype)
+        # Audio encoder (frozen)
+        self.audio_tower = self._load_audio_encoder(config, target_dtype)
+        # Language model (frozen)
+        self.language_model = self._load_language_model(config, target_dtype)
+        # Initialize tokenizer and special tokens
+        self._init_tokenizer(config)
+        # Set up generation config with greedy decoding defaults
+        self.generation_config = self.language_model.generation_config
+        self.generation_config.max_new_tokens = config.max_new_tokens
+        self.generation_config.min_new_tokens = config.min_new_tokens
+        self.generation_config.num_beams = config.num_beams
+        self.generation_config.do_sample = config.do_sample
+        # Set sampling params from config (None means use model defaults)
+        self.generation_config.temperature = config.temperature
+        self.generation_config.top_p = config.top_p
+        self.generation_config.top_k = config.top_k
+        self.generation_config.use_cache = config.use_cache
+        self.generation_config.length_penalty = config.length_penalty
+        self.generation_config.repetition_penalty = config.repetition_penalty
+        self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
+        # Set EOS tokens, filtering out any that don't exist in the tokenizer
+        eos_candidates = [
+            self.tokenizer.convert_tokens_to_ids("<|im_end|>"),
+            self.tokenizer.convert_tokens_to_ids("<|endoftext|>"),
+        ]
+        self.generation_config.eos_token_id = [t for t in eos_candidates if t is not None]
+        self.generation_config.pad_token_id = self.tokenizer.pad_token_id
+        # Feature extractor for audio preprocessing
+        self.feature_extractor = self._create_feature_extractor(config)
+        # Audio projector (trainable unless freeze_projector is set)
+        self.projector = self._create_projector(config, target_dtype)
+        # Setup LoRA if enabled (Stage 2 fine-tuning)
+        # Skip if loading from pretrained - from_pretrained will handle adapter loading
+        if getattr(config, "use_lora", False) and not getattr(
+            self.__class__, "_is_loading_from_pretrained", False
+        ):
+            self._setup_lora(config)
+        # Freeze projector if specified (for Stage 2 LoRA-only training)
+        if getattr(config, "freeze_projector", False):
+            self.projector.requires_grad_(False)
+        # SpecAugment for data augmentation during training
+        if getattr(config, "use_specaugment", False):
+            self.spec_augment = SpecAugment(
+                n_time_masks=config.num_time_masks,
+                time_mask_param=config.time_mask_length,
+                n_freq_masks=config.num_freq_masks,
+                freq_mask_param=config.freq_mask_length,
+            )
+        else:
+            self.spec_augment = None
+        # For model parallelism
+        self._no_split_modules = getattr(self.language_model, "_no_split_modules", [])
+    def _create_feature_extractor(self, config: ASRConfig):
+        """Create the appropriate feature extractor for the audio encoder."""
+        from transformers import AutoFeatureExtractor
+        feature_extractor = AutoFeatureExtractor.from_pretrained(config.audio_model_id)
+        # Whisper's encoder requires a fixed 3000 mel frames (30s) and the
+        # feature extractor pads to that by default — leave it alone. Other
+        # encoders (e.g. GLM-ASR) accept variable-length input, so we disable
+        # padding to avoid wasting compute on silent frames.
+        if "whisper" not in config.audio_model_id.lower():
+            feature_extractor.padding = False
+        return feature_extractor
+    @classmethod
+    def _load_audio_encoder(cls, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Load and freeze the audio encoder."""
+        encoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        if "whisper" in config.audio_model_id.lower():
+            from transformers import WhisperModel
+            full_model = WhisperModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+            encoder = full_model.encoder
+            del full_model
+        elif "glm" in config.audio_model_id.lower():
+            # GLM-ASR models use audio_tower as the encoder
+            # Requires transformers >= 5.x or installed from source
+            from transformers import AutoModelForSeq2SeqLM
+            full_model = AutoModelForSeq2SeqLM.from_pretrained(
+                config.audio_model_id, trust_remote_code=True, **encoder_kwargs
+            )
+            # GLM stores encoder at audio_tower (GlmAsrEncoder)
+            encoder = full_model.audio_tower
+            # Clear references to free VRAM from the LLM decoder
+            full_model.language_model = None
+            full_model.multi_modal_projector = None
+            del full_model
+        else:
+            encoder = AutoModel.from_pretrained(config.audio_model_id, **encoder_kwargs)
+        encoder.requires_grad_(False)
+        encoder.eval()
+        return encoder
+    @classmethod
+    def _load_language_model(cls, config: ASRConfig, dtype: torch.dtype) -> PreTrainedModel:
+        """Load and freeze the language model."""
+        decoder_kwargs = {
+            "attn_implementation": config.attn_implementation,
+            "trust_remote_code": True,
+            "low_cpu_mem_usage": True,
+            "dtype": dtype,
+        }
+        decoder = AutoModelForCausalLM.from_pretrained(config.text_model_id, **decoder_kwargs)
+        decoder.config.use_cache = getattr(config, "use_cache", True)
+        decoder.requires_grad_(False)
+        decoder.eval()
+        return decoder
+    def _create_projector(self, config: ASRConfig, dtype: torch.dtype) -> nn.Module:
+        """Create the trainable audio projector."""
+        # Auto-detect dimensions if not specified
+        if config.encoder_dim is None:
+            enc_cfg = self.audio_tower.config
+            config.encoder_dim = getattr(enc_cfg, "hidden_size", None) or getattr(
+                enc_cfg, "d_model", None
+            )
+            if config.encoder_dim is None:
+                raise ValueError("Could not auto-detect encoder_dim. Please specify in config.")
+        if config.llm_dim is None:
+            dec_cfg = self.language_model.config
+            config.llm_dim = getattr(dec_cfg, "hidden_size", None) or getattr(
+                dec_cfg, "d_model", None
+            )
+            if config.llm_dim is None:
+                raise ValueError("Could not auto-detect llm_dim. Please specify in config.")
+        # Select projector type based on config
+        projector_type = getattr(config, "projector_type", "mlp")
+        projector_class = PROJECTOR_CLASSES.get(projector_type)
+        if projector_class is None:
+            raise ValueError(
+                f"Unknown projector_type: {projector_type}. "
+                f"Valid options: {list(PROJECTOR_CLASSES.keys())}"
+            )
+        projector = projector_class(config)
+        # Move projector to same device as language model (important when using quantization)
+        device = next(self.language_model.parameters()).device
+        return projector.to(device=device, dtype=dtype)
+    def _setup_lora(self, config: ASRConfig):
+        """Apply LoRA adapters to the language model for Stage 2 fine-tuning."""
+        from peft import LoraConfig, get_peft_model
+        lora_config = LoraConfig(
+            r=config.lora_rank,
+            lora_alpha=config.lora_alpha,
+            target_modules=config.lora_target_modules,
+            lora_dropout=config.lora_dropout,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        self.language_model = get_peft_model(self.language_model, lora_config)
+    def _init_tokenizer(self, config: ASRConfig):
+        """Initialize tokenizer with audio token."""
+        self.tokenizer = AutoTokenizer.from_pretrained(config.text_model_id, trust_remote_code=True)
+        # Set pad token. Prefer a dedicated pad token if the tokenizer has one
+        # (e.g. Qwen's <|finetune_right_pad_id|>); otherwise fall back to
+        # eos_token, which is the standard pattern for Llama-style tokenizers
+        # (SmolLM2, Llama, etc.) that ship without a separate pad token.
+        if (
+            self.tokenizer.pad_token is None
+            or self.tokenizer.pad_token_id == self.tokenizer.eos_token_id
+        ):
+            if "<|finetune_right_pad_id|>" in self.tokenizer.get_vocab():
+                self.tokenizer.pad_token = "<|finetune_right_pad_id|>"
+            elif self.tokenizer.pad_token is None:
+                self.tokenizer.pad_token = self.tokenizer.eos_token
+        # Add audio token
+        existing_special = getattr(self.tokenizer, "additional_special_tokens", None) or []
+        if "<audio>" not in existing_special:
+            self.tokenizer.add_special_tokens(
+                {"additional_special_tokens": existing_special + ["<audio>"]}
+            )
+            self.language_model.resize_token_embeddings(len(self.tokenizer), mean_resizing=False)
+        self.audio_token_id = self.tokenizer.convert_tokens_to_ids("<audio>")
+        self.tokenizer.padding_side = "right"
+        # Sync token IDs to configs
+        for cfg in [self.config.text_config, self.language_model.config, self.generation_config]:
+            if cfg is not None:
+                cfg.pad_token_id = self.tokenizer.pad_token_id
+                cfg.eos_token_id = self.tokenizer.eos_token_id
+                cfg.bos_token_id = self.tokenizer.bos_token_id
+    def _init_weights(self, _module):
+        """Weight initialization (projector weights are initialized in MoEAudioProjector)."""
+        pass
+    def _set_gradient_checkpointing(self, enable: bool = True, gradient_checkpointing_func=None):
+        """Enable/disable gradient checkpointing for the language model."""
+        # The LLM still stores activations during forward for backprop to projector
+        # Gradient checkpointing trades compute for memory by recomputing activations
+        if hasattr(self.language_model, "_set_gradient_checkpointing"):
+            self.language_model._set_gradient_checkpointing(enable, gradient_checkpointing_func)
+        elif hasattr(self.language_model, "gradient_checkpointing_enable") and enable:
+            self.language_model.gradient_checkpointing_enable(
+                gradient_checkpointing_kwargs={"use_reentrant": False}
+            )
+        elif hasattr(self.language_model, "gradient_checkpointing_disable") and not enable:
+            self.language_model.gradient_checkpointing_disable()
+    def get_input_embeddings(self) -> nn.Module:
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self) -> nn.Module:
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, value: nn.Module) -> None:
+        self.language_model.set_output_embeddings(value)
+    def get_processor(self):
+        """Get the processor for this model."""
+        try:
+            from .asr_processing import ASRProcessor
+        except ImportError:
+            from asr_processing import ASRProcessor  # type: ignore[no-redef]
+        return ASRProcessor(
+            feature_extractor=self.feature_extractor,
+            tokenizer=self.tokenizer,
+            projector=self.projector,
+            encoder_conv_layers=self.config.encoder_conv_layers,
+        )
+    def state_dict(self, *args, **kwargs) -> dict[str, torch.Tensor]:
+        """Only save trainable projector weights."""
+        return {f"projector.{k}": v for k, v in self.projector.state_dict().items()}
+    def _compute_encoder_output_lengths(
+        self,
+        audio_attention_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        """Compute per-sample encoder output lengths using conv layer formulas.
+        Args:
+            audio_attention_mask: Mask indicating real vs padded mel frames (batch, mel_len)
+        Returns:
+            Tensor of encoder output lengths per sample (batch,)
+        """
+        # Get mel frame lengths from attention mask
+        lengths = audio_attention_mask.sum(dim=-1)
+        # Apply conv layer formulas: output = (input + 2*pad - (kernel-1) - 1) // stride + 1
+        for padding, kernel_size, stride in self.config.encoder_conv_layers:
+            lengths = (lengths + 2 * padding - (kernel_size - 1) - 1) // stride + 1
+        return lengths
+    def _encode_audio(
+        self,
+        audio_features: torch.Tensor,
+        audio_attention_mask: torch.Tensor,
+        expected_token_counts: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        """Encode audio and project to LLM embedding space.
+        Args:
+            audio_features: Mel spectrogram features (batch, n_mels, mel_len)
+            audio_attention_mask: Mask indicating real vs padded mel frames (batch, mel_len)
+            expected_token_counts: Expected number of audio tokens per sample from input_ids.
+                If provided, output will match these counts exactly (padding/truncating as needed).
+        Returns:
+            Flattened audio embeddings of shape (total_audio_tokens, hidden_dim).
+        """
+        with torch.no_grad():
+            encoder_out = self.audio_tower(input_features=audio_features)
+            hidden_states = encoder_out.last_hidden_state
+        # Project to LLM space
+        audio_embeds = self.projector(hidden_states)
+        # Use expected token counts if provided (from input_ids), otherwise compute from audio
+        if expected_token_counts is not None:
+            token_counts = expected_token_counts
+        else:
+            # Compute per-sample encoder output lengths using conv formulas
+            encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+            token_counts = torch.tensor(
+                [
+                    self.projector.get_output_length(int(length.item()))
+                    for length in encoder_lengths
+                ],
+                device=audio_embeds.device,
+            )
+        # Extract embeddings matching expected token counts per sample
+        batch_size = audio_embeds.shape[0]
+        hidden_dim = audio_embeds.shape[2]
+        result_embeds = []
+        for i in range(batch_size):
+            count = int(token_counts[i].item())
+            sample_embeds = audio_embeds[i, :count, :]  # Take first 'count' embeddings
+            # Pad with zeros if we don't have enough embeddings
+            if sample_embeds.shape[0] < count:
+                padding = torch.zeros(
+                    count - sample_embeds.shape[0],
+                    hidden_dim,
+                    device=audio_embeds.device,
+                    dtype=audio_embeds.dtype,
+                )
+                sample_embeds = torch.cat([sample_embeds, padding], dim=0)
+            result_embeds.append(sample_embeds)
+        return torch.cat(result_embeds, dim=0)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        """Forward pass for training and inference."""
+        # Get text embeddings if not provided
+        if inputs_embeds is None:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        if input_features is not None and input_ids is not None:
+            # Apply SpecAugment during training if enabled
+            if self.training and self.spec_augment is not None:
+                input_features = self.spec_augment(input_features)
+            # Count expected audio tokens from input_ids (ground truth from collator)
+            audio_token_counts = (input_ids == self.audio_token_id).sum(dim=-1)
+            # Encode audio -> flattened (total_audio_tokens, hidden_dim)
+            audio_embeds = self._encode_audio(
+                input_features, audio_attention_mask, audio_token_counts
+            )
+            # Replace <audio> token placeholders with audio embeddings using masked_scatter
+            audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+            inputs_embeds = inputs_embeds.masked_scatter(
+                audio_token_mask.to(inputs_embeds.device),
+                audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+            )
+        # Run through language model (let it compute loss if labels provided)
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        # Add auxiliary loss from MoE projectors if available
+        if outputs.loss is not None and hasattr(self.projector, "get_aux_loss"):
+            aux_loss = self.projector.get_aux_loss()
+            if aux_loss is not None and aux_loss.numel() > 0:
+                outputs.loss = outputs.loss + aux_loss.to(outputs.loss.device)
+        return outputs
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        """Prepare inputs for generation, handling audio features for cached decoding."""
+        input_features = kwargs.pop("input_features", None)
+        cache_position = kwargs.get("cache_position")
+        model_inputs = self.language_model.prepare_inputs_for_generation(*args, **kwargs)
+        # Only pass audio features on the first generation step (cache_position[0] == 0)
+        if cache_position is not None and cache_position[0] == 0 and input_features is not None:
+            model_inputs["input_features"] = input_features
+        return model_inputs
+    def _get_num_audio_tokens(
+        self,
+        audio_attention_mask: torch.Tensor,
+    ) -> int:
+        """Calculate number of audio tokens based on actual audio length.
+        Uses attention mask to get real audio length, then computes:
+        mel_frames -> encoder_frames (via conv formulas) -> projector output tokens
+        """
+        encoder_lengths = self._compute_encoder_output_lengths(audio_attention_mask)
+        # Use max length for batch (all samples should have same token count for generation)
+        encoder_output_len = int(encoder_lengths.max().item())
+        return int(self.projector.get_output_length(encoder_output_len))
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        input_features: Optional[torch.Tensor] = None,
+        audio_attention_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> torch.Tensor:
+        """Generate transcription from audio input.
+        Can be called in two ways:
+        1. With input_ids containing <audio> tokens (from processor)
+        2. With just audio, and we build the prompt internally
+        """
+        if input_features is None:
+            raise ValueError("input_features required for generation")
+        if audio_attention_mask is None:
+            raise ValueError("audio_attention_mask required for generation")
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings
+        audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+        # If input_ids not provided, build prompt with correct number of audio tokens
+        if input_ids is None:
+            num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+            audio_placeholder = "<audio>" * num_audio_tokens
+            system_prompt = system_prompt or self.system_prompt
+            messages: list[dict[str, str]] = []
+            if system_prompt:
+                messages.append({"role": "system", "content": system_prompt})
+            # Audio tokens only (instruction-free)
+            user_content = audio_placeholder
+            if self.TRANSCRIBE_PROMPT:
+                user_content += " " + self.TRANSCRIBE_PROMPT
+            messages.append({"role": "user", "content": user_content})
+            chat_result = self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=True,
+                add_generation_prompt=True,
+                return_tensors="pt",
+                enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+            )
+            input_ids = chat_result.input_ids.to(device)
+            if input_ids.dim() == 1:
+                input_ids = input_ids.unsqueeze(0)
+            if input_ids.shape[0] == 1 and batch_size > 1:
+                input_ids = input_ids.expand(batch_size, -1)
+            attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Generate using language model
+        # Pass both input_ids and inputs_embeds so repetition_penalty works correctly
+        # (it needs input_ids to track which tokens have been used)
+        output = self.language_model.generate(
+            input_ids=input_ids,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            generation_config=self.generation_config,
+            **generate_kwargs,
+        )
+        # When using inputs_embeds with input_ids, generate returns full sequence
+        # Strip the input tokens to return only generated tokens
+        sequences = output if isinstance(output, torch.Tensor) else output.sequences
+        input_len = input_ids.shape[1]
+        return sequences[:, input_len:]
+    def generate_streaming(
+        self,
+        input_features: torch.Tensor,
+        audio_attention_mask: torch.Tensor,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> Iterator[str]:
+        """Generate transcription with streaming token output.
+        Yields partial transcript strings as tokens are generated.
+        Reduces time-to-first-word by streaming tokens as they're decoded.
+        Args:
+            input_features: Mel spectrogram features (batch, n_mels, mel_len)
+            audio_attention_mask: Mask for real vs padded mel frames (batch, mel_len)
+            system_prompt: Optional system prompt override
+            **generate_kwargs: Additional generation arguments
+        Yields:
+            Partial transcript text as each token is generated
+        """
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings
+        audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+        # Build prompt with correct number of audio tokens
+        num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+        audio_placeholder = "<audio>" * num_audio_tokens
+        system_prompt = system_prompt or self.system_prompt
+        messages: list[dict[str, str]] = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        # Audio tokens only (instruction-free)
+        user_content = audio_placeholder
+        if self.TRANSCRIBE_PROMPT:
+            user_content += " " + self.TRANSCRIBE_PROMPT
+        messages.append({"role": "user", "content": user_content})
+        chat_result = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        )
+        input_ids = chat_result.input_ids.to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        if input_ids.shape[0] == 1 and batch_size > 1:
+            input_ids = input_ids.expand(batch_size, -1)
+        attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Setup streamer for token-by-token output
+        streamer = TextIteratorStreamer(
+            self.tokenizer,
+            skip_prompt=True,
+            skip_special_tokens=True,
+        )
+        # Prepare generation kwargs
+        gen_kwargs = {
+            "inputs_embeds": inputs_embeds,
+            "attention_mask": attention_mask,
+            "generation_config": self.generation_config,
+            "streamer": streamer,
+            **generate_kwargs,
+        }
+        # Run generation in background thread
+        thread = Thread(target=self.language_model.generate, kwargs=gen_kwargs)
+        thread.start()
+        # Yield tokens as they're generated, filtering out <think>...</think> blocks
+        # Start assuming no think block - only filter when we see <think>
+        in_think_block = False
+        buffer = ""
+        for text in streamer:
+            buffer += text
+            # Check for think block start (in case model outputs think blocks)
+            while "<think>" in buffer:
+                in_think_block = True
+                # Yield any text before <think>
+                before_think = buffer.split("<think>")[0]
+                if before_think:
+                    yield before_think
+                buffer = buffer.split("<think>", 1)[-1]
+            # Check for think block end
+            while in_think_block and "</think>" in buffer:
+                in_think_block = False
+                buffer = buffer.split("</think>", 1)[-1]
+            # Yield text if not in think block
+            if not in_think_block and buffer:
+                yield buffer
+                buffer = ""
+        # Yield any remaining buffer
+        if buffer and not in_think_block:
+            yield buffer
+        thread.join()
+    @torch.no_grad()
+    def generate_text_only(
+        self,
+        messages: list[dict[str, str]],
+        max_new_tokens: int = 256,
+        **generate_kwargs,
+    ) -> str:
+        """Generate text using only the LLM (no audio encoding).
+        Used for SIFT-style response generation from metadata prompts.
+        Args:
+            messages: List of chat messages [{"role": "user", "content": "..."}]
+            max_new_tokens: Maximum tokens to generate
+            **generate_kwargs: Additional generation arguments
+        Returns:
+            Generated text response
+        """
+        device = next(self.language_model.parameters()).device
+        # Apply chat template
+        input_ids = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        ).to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        attention_mask = torch.ones_like(input_ids)
+        # Generate using language model directly
+        output = self.language_model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+            pad_token_id=self.tokenizer.pad_token_id,
+            eos_token_id=self.tokenizer.eos_token_id,
+            **generate_kwargs,
+        )
+        # Decode only the new tokens
+        new_tokens = output[0, input_ids.shape[1] :]
+        response = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+        return response.strip()
+    def save_pretrained(self, save_directory: Union[str, Path], **kwargs) -> None:
+        """Save model, tokenizer, and processor."""
+        import shutil
+        from pathlib import Path as PathlibPath
+        save_dir = PathlibPath(save_directory)
+        save_dir.mkdir(parents=True, exist_ok=True)
+        # Update config with actual vocab size
+        self.config.vocab_size = self.language_model.config.vocab_size
+        self.config.text_config.vocab_size = self.language_model.config.vocab_size
+        if hasattr(self.audio_tower.config, "num_mel_bins"):
+            self.config.audio_config.num_mel_bins = self.audio_tower.config.num_mel_bins
+        # Save model (temporarily remove non-serializable attributes)
+        tokenizer = self.tokenizer
+        del self.tokenizer
+        try:
+            super().save_pretrained(save_dir, **kwargs)
+        finally:
+            self.tokenizer = tokenizer
+        # Save tokenizer and feature extractor
+        self.tokenizer.save_pretrained(save_dir)
+        self.feature_extractor.save_pretrained(save_dir)
+        # Save LoRA adapters if present (creates adapter_model.safetensors and adapter_config.json)
+        # Don't save embedding layers - the <audio> token embedding is never used
+        # (it's replaced with projected audio embeddings before the LLM sees it)
+        if hasattr(self.language_model, "peft_config"):
+            self.language_model.save_pretrained(save_dir, save_embedding_layers=False)
+            # Clear base_model_name_or_path in adapter_config.json to prevent HF pipeline
+            # from redirecting to the base LLM repo (like Qwen) which breaks feature
+            # extractor loading for multimodal models. If a repo_id is provided, use that
+            # so the model can be loaded directly from the Hub.
+            adapter_config_path = save_dir / "adapter_config.json"
+            if adapter_config_path.exists():
+                with adapter_config_path.open() as f:
+                    adapter_config = json.load(f)
+                # Use repo_id if available, otherwise clear to prevent redirect.
+                # Use empty string instead of None to avoid str(None) -> "None" bug
+                # in some transformers/PEFT versions.
+                repo_id = (
+                    kwargs.get("repo_id")
+                    or kwargs.get("push_to_hub_model_id")
+                    or getattr(self.config, "pretrained_model_path", None)
+                    or ""  # Use empty string instead of None
+                )
+                adapter_config["base_model_name_or_path"] = repo_id
+                with adapter_config_path.open("w") as f:
+                    json.dump(adapter_config, f, indent=2)
+        # Add processor auto_map to preprocessor_config.json
+        config_path = save_dir / "preprocessor_config.json"
+        if config_path.exists():
+            with config_path.open() as f:
+                processor_config = json.load(f)
+        else:
+            processor_config = {}
+        processor_config.update(
+            {
+                "processor_class": "ASRProcessor",
+                "auto_map": {"AutoProcessor": "asr_processing.ASRProcessor"},
+            }
+        )
+        with config_path.open("w") as f:
+            json.dump(processor_config, f, indent=2)
+        # Copy source files for auto-loading
+        src_dir = PathlibPath(__file__).parent
+        for asr_file in src_dir.glob("asr_*.py"):
+            shutil.copy(asr_file, save_dir / asr_file.name)
+        # Copy projectors module
+        shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
+        # Copy alignment module
+        shutil.copy(src_dir / "alignment.py", save_dir / "alignment.py")
+        # Copy diarization module
+        shutil.copy(src_dir / "diarization.py", save_dir / "diarization.py")
+    def push_to_hub(self, repo_id: str, **kwargs) -> str:
+        """Push model to HuggingFace Hub, ensuring adapter_config points to repo.
+        IMPORTANT: Sets base_model_name_or_path in adapter_config.json to repo_id
+        so that transformers pipeline() can load the model correctly. Without this,
+        the pipeline tries to load from "None" which fails.
+        """
+        # Store repo_id in config so save_pretrained can access it
+        self.config.pretrained_model_path = repo_id
+        # Call parent's push_to_hub
+        return super().push_to_hub(repo_id, **kwargs)
+    def create_or_update_model_card(self, output_dir: Union[str, Path]) -> None:
+        """No-op for model card creation - we use MODEL_CARD.md in repo instead."""
+        pass
+# Register with transformers Auto classes
+AutoConfig.register("asr_model", ASRConfig)
+AutoModel.register(ASRConfig, ASRModel)

asr_pipeline.py ADDED Viewed

	@@ -0,0 +1,322 @@

+"""ASR pipeline for audio-to-text transcription with optional timestamps and diarization."""
+import re
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import transformers
+try:
+    from .alignment import ForcedAligner
+    from .asr_modeling import ASRModel
+    from .diarization import SpeakerDiarizer
+except ImportError:
+    from alignment import ForcedAligner  # type: ignore[no-redef]
+    from asr_modeling import ASRModel  # type: ignore[no-redef]
+    from diarization import SpeakerDiarizer  # type: ignore[no-redef]
+# Re-export for backwards compatibility
+__all__ = ["ForcedAligner", "SpeakerDiarizer", "ASRPipeline"]
+class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
+    """ASR Pipeline for audio-to-text transcription."""
+    model: ASRModel
+    def __init__(self, model: ASRModel, **kwargs):
+        """Initialize ASR pipeline.
+        Args:
+            model: ASRModel instance for transcription
+            **kwargs: Additional arguments (feature_extractor, tokenizer, device)
+        """
+        feature_extractor = kwargs.pop("feature_extractor", None)
+        tokenizer = kwargs.pop("tokenizer", model.tokenizer)
+        if feature_extractor is None:
+            feature_extractor = model.get_processor().feature_extractor
+        super().__init__(
+            model=model, feature_extractor=feature_extractor, tokenizer=tokenizer, **kwargs
+        )
+        self._current_audio = None
+    def _sanitize_parameters(self, **kwargs):
+        """Intercept our custom parameters before parent class validates them."""
+        # Remove our custom parameters so parent doesn't see them
+        kwargs.pop("return_timestamps", None)
+        kwargs.pop("return_speakers", None)
+        kwargs.pop("num_speakers", None)
+        kwargs.pop("min_speakers", None)
+        kwargs.pop("max_speakers", None)
+        kwargs.pop("hf_token", None)
+        kwargs.pop("user_prompt", None)
+        kwargs.pop("diarization_backend", None)
+        return super()._sanitize_parameters(**kwargs)
+    def __call__(
+        self,
+        inputs,
+        **kwargs,
+    ):
+        """Transcribe audio with optional word-level timestamps and speaker diarization.
+        Args:
+            inputs: Audio input (file path, dict with array/sampling_rate, etc.)
+            return_timestamps: If True, return word-level timestamps using forced alignment
+            return_speakers: If True, return speaker labels for each word
+            user_prompt: Custom transcription prompt (default: "Transcribe: ")
+            num_speakers: Exact number of speakers (if known, for diarization)
+            min_speakers: Minimum number of speakers (for diarization)
+            max_speakers: Maximum number of speakers (for diarization)
+            **kwargs: Additional arguments passed to the pipeline
+        Returns:
+            Dict with 'text' key, 'words' key if return_timestamps=True,
+            and speaker labels on words if return_speakers=True
+        """
+        # Extract our params before super().__call__ (which will also call _sanitize_parameters)
+        return_timestamps = kwargs.pop("return_timestamps", False)
+        return_speakers = kwargs.pop("return_speakers", False)
+        user_prompt = kwargs.pop("user_prompt", None)
+        diarization_params = {
+            "num_speakers": kwargs.pop("num_speakers", None),
+            "min_speakers": kwargs.pop("min_speakers", None),
+            "max_speakers": kwargs.pop("max_speakers", None),
+        }
+        if return_speakers:
+            return_timestamps = True
+        # Set custom user prompt if provided
+        original_prompt = None
+        if user_prompt:
+            original_prompt = self.model.TRANSCRIBE_PROMPT
+            self.model.TRANSCRIBE_PROMPT = user_prompt
+        # Store audio for timestamp alignment and diarization
+        if return_timestamps or return_speakers:
+            self._current_audio = self._extract_audio(inputs)
+        # Run standard transcription
+        result = super().__call__(inputs, **kwargs)
+        # Add timestamps if requested
+        if return_timestamps and self._current_audio is not None:
+            text = result.get("text", "")
+            if text:
+                try:
+                    words = ForcedAligner.align(
+                        self._current_audio["array"],
+                        text,
+                        sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    )
+                    result["words"] = words
+                except Exception as e:
+                    result["words"] = []
+                    result["timestamp_error"] = str(e)
+            else:
+                result["words"] = []
+        # Add speaker diarization if requested
+        if return_speakers and self._current_audio is not None:
+            try:
+                # Run diarization
+                speaker_segments = SpeakerDiarizer.diarize(
+                    self._current_audio["array"],
+                    sample_rate=self._current_audio.get("sampling_rate", 16000),
+                    **{k: v for k, v in diarization_params.items() if v is not None},
+                )
+                result["speaker_segments"] = speaker_segments
+                # Assign speakers to words
+                if result.get("words"):
+                    result["words"] = SpeakerDiarizer.assign_speakers_to_words(
+                        result["words"],
+                        speaker_segments,
+                    )
+            except Exception as e:
+                result["speaker_segments"] = []
+                result["diarization_error"] = str(e)
+        # Clean up
+        self._current_audio = None
+        if original_prompt is not None:
+            self.model.TRANSCRIBE_PROMPT = original_prompt
+        return result
+    def _extract_audio(self, inputs) -> dict | None:
+        """Extract audio array from various input formats using HF utilities."""
+        from transformers.pipelines.audio_utils import ffmpeg_read
+        if isinstance(inputs, dict):
+            if "array" in inputs:
+                return {
+                    "array": inputs["array"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+            if "raw" in inputs:
+                return {
+                    "array": inputs["raw"],
+                    "sampling_rate": inputs.get("sampling_rate", 16000),
+                }
+        elif isinstance(inputs, str):
+            # File path - load audio using ffmpeg (same as HF pipeline)
+            with Path(inputs).open("rb") as f:
+                audio = ffmpeg_read(f.read(), sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, bytes):
+            audio = ffmpeg_read(inputs, sampling_rate=16000)
+            return {"array": audio, "sampling_rate": 16000}
+        elif isinstance(inputs, np.ndarray):
+            return {"array": inputs, "sampling_rate": 16000}
+        return None
+    def preprocess(self, inputs, **preprocess_params):
+        """Preprocess audio inputs for the model.
+        Args:
+            inputs: Audio input (dict with array, file path, etc.)
+            **preprocess_params: Additional preprocessing parameters
+        Yields:
+            Model input dicts with input_features and attention_mask
+        """
+        # Handle dict with "array" key (from datasets)
+        if isinstance(inputs, dict) and "array" in inputs:
+            inputs = {
+                "raw": inputs["array"],
+                "sampling_rate": inputs.get("sampling_rate", self.feature_extractor.sampling_rate),
+            }
+        for item in super().preprocess(inputs, **preprocess_params):
+            if "is_last" not in item:
+                item["is_last"] = True
+            yield item
+    def _forward(self, model_inputs, **generate_kwargs) -> dict[str, Any]:
+        """Run model forward pass to generate transcription.
+        Args:
+            model_inputs: Dict with input_features and attention_mask
+            **generate_kwargs: Generation parameters
+        Returns:
+            Dict with generated token IDs
+        """
+        # Extract audio features and is_last flag
+        is_last = model_inputs.pop("is_last", True) if isinstance(model_inputs, dict) else True
+        input_features = model_inputs["input_features"].to(self.model.device)
+        audio_attention_mask = model_inputs["attention_mask"].to(self.model.device)
+        generated_ids = self.model.generate(
+            input_features=input_features,
+            audio_attention_mask=audio_attention_mask,
+            **generate_kwargs,
+        )
+        return {"tokens": generated_ids, "is_last": is_last}
+    def postprocess(self, model_outputs, **kwargs) -> dict[str, str]:
+        """Convert model output tokens to text.
+        Args:
+            model_outputs: Dict with 'tokens' key containing generated IDs
+            **kwargs: Additional postprocessing parameters
+        Returns:
+            Dict with 'text' key containing transcription
+        """
+        # Handle list of outputs (from chunking)
+        if isinstance(model_outputs, list):
+            model_outputs = model_outputs[0] if model_outputs else {}
+        tokens = model_outputs.get("tokens")
+        if tokens is None:
+            return super().postprocess(model_outputs, **kwargs)
+        if torch.is_tensor(tokens):
+            tokens = tokens.cpu()
+            if tokens.dim() > 1:
+                tokens = tokens[0]
+        # Filter out eos tokens that the tokenizer doesn't recognize as special
+        # (generation_config.eos_token_id may differ from tokenizer.eos_token_id)
+        if hasattr(self, "model") and hasattr(self.model, "generation_config"):
+            eos_ids = self.model.generation_config.eos_token_id
+            if eos_ids is not None:
+                eos_set = set(eos_ids) if isinstance(eos_ids, list) else {eos_ids}
+                tokens = [t for t in tokens.tolist() if t not in eos_set]
+        text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
+        # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
+        text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+        # Truncate repetitions at end of text
+        text = _truncate_repetitions(text)
+        return {"text": text}
+def _truncate_repetitions(text: str, min_repeats: int = 3) -> str:
+    """Truncate repeated words/phrases/characters at end of text.
+    Detects patterns like:
+    - Repeated words: "the the the the" -> "the"
+    - Repeated phrases: "i am sorry i am sorry i am sorry" -> "i am sorry"
+    - Repeated characters: "444444" -> "4"
+    Args:
+        text: Input text to process
+        min_repeats: Minimum repetitions to trigger truncation (default 3)
+    Returns:
+        Text with trailing repetitions removed
+    """
+    if not text:
+        return text
+    # 1. Truncate repeated characters at end (e.g., "444444" -> "4")
+    char_pattern = re.compile(r"(.)\1{" + str(min_repeats - 1) + r",}$")
+    text = char_pattern.sub(r"\1", text)
+    # 2. Truncate repeated words at end (e.g., "the the the" -> "the")
+    word_pattern = re.compile(
+        r"\b(\w+)(?:\s+\1){" + str(min_repeats - 1) + r",}\s*$", re.IGNORECASE
+    )
+    while word_pattern.search(text):
+        text = word_pattern.sub(r"\1", text)
+    # 3. Truncate repeated phrases (2-20 words) at end
+    # e.g., "i am sorry i am sorry i am sorry" -> "i am sorry"
+    words = text.split()
+    if len(words) >= min_repeats * 2:
+        # Try phrase lengths from 2 to 20 words
+        for phrase_len in range(2, min(21, len(words) // min_repeats + 1)):
+            # Check if the last phrase_len words repeat
+            phrase = " ".join(words[-phrase_len:])
+            # Build pattern to match repeated phrases at end
+            phrase_escaped = re.escape(phrase)
+            phrase_pattern = re.compile(
+                r"(^|.*?\s)("
+                + phrase_escaped
+                + r")(?:\s+"
+                + phrase_escaped
+                + r"){"
+                + str(min_repeats - 1)
+                + r",}\s*$",
+                re.IGNORECASE,
+            )
+            match = phrase_pattern.match(text)
+            if match:
+                # Keep prefix + one instance of the phrase
+                text = (match.group(1) + match.group(2)).strip()
+                words = text.split()
+                break
+    return text

asr_processing.py ADDED Viewed

	@@ -0,0 +1,133 @@

+from typing import Optional, Union
+import torch
+import transformers
+from transformers import ProcessorMixin
+try:
+    from .asr_config import ASRConfig
+except ImportError:
+    from asr_config import ASRConfig  # type: ignore[no-redef]
+class ASRProcessor(ProcessorMixin):
+    """Processor for Whisper-based ASR models."""
+    attributes = ["feature_extractor", "tokenizer"]
+    feature_extractor_class = "AutoFeatureExtractor"
+    tokenizer_class = "AutoTokenizer"
+    AUDIO_TOKEN = "<audio>"
+    TRANSCRIBE_PROMPT = ""
+    # Default conv layers for Whisper/GLM-ASR: [(pad, kernel, stride), ...]
+    DEFAULT_ENCODER_CONV_LAYERS = [(1, 3, 1), (1, 3, 2)]
+    def __init__(
+        self,
+        feature_extractor,
+        tokenizer,
+        projector=None,
+        encoder_conv_layers: Optional[list] = None,
+    ):
+        """Initialize the ASR processor.
+        Args:
+            feature_extractor: Audio feature extractor (WhisperFeatureExtractor)
+            tokenizer: Text tokenizer for the language model
+            projector: Audio projector module (for computing output lengths)
+            encoder_conv_layers: Conv layer specs [(pad, kernel, stride), ...]
+        """
+        self.feature_extractor = feature_extractor
+        self.tokenizer = tokenizer
+        self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
+        self.projector = projector
+        self.encoder_conv_layers = encoder_conv_layers or self.DEFAULT_ENCODER_CONV_LAYERS
+    def _compute_encoder_output_length(self, mel_length: int) -> int:
+        """Compute encoder output length using conv layer formulas."""
+        length = mel_length
+        for padding, kernel_size, stride in self.encoder_conv_layers:
+            length = (length + 2 * padding - (kernel_size - 1) - 1) // stride + 1
+        return length
+    def __call__(
+        self,
+        audio: Optional[Union[list, "torch.Tensor"]] = None,
+        text: Optional[str] = None,
+        system_prompt: Optional[str] = None,
+        return_tensors: str = "pt",
+        **kwargs,
+    ) -> dict:
+        """Process audio and text inputs for inference.
+        Args:
+            audio: Raw audio waveform(s)
+            text: Target transcription (optional, for training - but use DataCollator instead)
+            system_prompt: Optional system prompt
+            return_tensors: Return format ("pt" for PyTorch)
+        Returns:
+            Dict with input_features, input_ids, attention_mask
+        """
+        result = {}
+        # Process audio
+        if audio is not None:
+            audio_inputs = self.feature_extractor(
+                audio,
+                sampling_rate=getattr(self.feature_extractor, "sampling_rate", 16000),
+                return_attention_mask=True,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            result["input_features"] = audio_inputs["input_features"]
+            result["audio_attention_mask"] = audio_inputs["attention_mask"]
+            # Use actual audio length (from attention mask) for token count
+            real_mel_len = int(audio_inputs["attention_mask"].sum(dim=-1).max().item())
+            encoder_output_len = self._compute_encoder_output_length(real_mel_len)
+            num_audio_tokens = self.projector.get_output_length(encoder_output_len)
+        else:
+            num_audio_tokens = 0
+        # Build prompt with audio token placeholders (instruction-free)
+        if num_audio_tokens > 0:
+            user_content = self.AUDIO_TOKEN * num_audio_tokens
+            if self.TRANSCRIBE_PROMPT:
+                user_content += " " + self.TRANSCRIBE_PROMPT
+        else:
+            user_content = self.TRANSCRIBE_PROMPT or ""
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append({"role": "user", "content": user_content})
+        if text is not None:
+            messages.append({"role": "assistant", "content": text})
+        # Tokenize
+        tokenized = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=(text is None),
+            return_tensors=return_tensors,
+            enable_thinking=False,  # Disable Qwen3 thinking mode for ASR
+        )
+        # Handle both tensor and BatchEncoding returns
+        if isinstance(tokenized, torch.Tensor):
+            input_ids = tokenized
+        else:
+            # BatchEncoding or dict-like object
+            input_ids = tokenized.get("input_ids", tokenized.input_ids)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        result["input_ids"] = input_ids
+        result["attention_mask"] = torch.ones_like(input_ids)
+        return result
+ASRProcessor.register_for_auto_class()
+transformers.AutoProcessor.register(ASRConfig, ASRProcessor)

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,6 @@

+{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
+You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
+' }}{% endif %}{{'<|im_start|>' + message['role'] + '
+' + message['content'] + '<|im_end|>' + '
+'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
+' }}{% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,436 @@

+{
+  "architectures": [
+    "ASRModel"
+  ],
+  "attn_implementation": "sdpa",
+  "audio_config": {
+    "_name_or_path": "openai/whisper-base",
+    "activation_dropout": 0.0,
+    "activation_function": "gelu",
+    "apply_spec_augment": false,
+    "architectures": [
+      "WhisperForConditionalGeneration"
+    ],
+    "attention_dropout": 0.0,
+    "begin_suppress_tokens": [
+      220,
+      50257
+    ],
+    "bos_token_id": 50257,
+    "classifier_proj_size": 256,
+    "d_model": 512,
+    "decoder_attention_heads": 8,
+    "decoder_ffn_dim": 2048,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 6,
+    "decoder_start_token_id": 50258,
+    "dropout": 0.0,
+    "dtype": "bfloat16",
+    "encoder_attention_heads": 8,
+    "encoder_ffn_dim": 2048,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 6,
+    "eos_token_id": 50257,
+    "forced_decoder_ids": [
+      [
+        1,
+        50259
+      ],
+      [
+        2,
+        50359
+      ],
+      [
+        3,
+        50363
+      ]
+    ],
+    "init_std": 0.02,
+    "mask_feature_length": 10,
+    "mask_feature_min_masks": 0,
+    "mask_feature_prob": 0.0,
+    "mask_time_length": 10,
+    "mask_time_min_masks": 2,
+    "mask_time_prob": 0.05,
+    "max_source_positions": 1500,
+    "max_target_positions": 448,
+    "median_filter_width": 7,
+    "model_type": "whisper",
+    "num_mel_bins": 80,
+    "pad_token_id": 50257,
+    "scale_embedding": false,
+    "suppress_tokens": [
+      1,
+      2,
+      7,
+      8,
+      9,
+      10,
+      14,
+      25,
+      26,
+      27,
+      28,
+      29,
+      31,
+      58,
+      59,
+      60,
+      61,
+      62,
+      63,
+      90,
+      91,
+      92,
+      93,
+      359,
+      503,
+      522,
+      542,
+      873,
+      893,
+      902,
+      918,
+      922,
+      931,
+      1350,
+      1853,
+      1982,
+      2460,
+      2627,
+      3246,
+      3253,
+      3268,
+      3536,
+      3846,
+      3961,
+      4183,
+      4667,
+      6585,
+      6647,
+      7273,
+      9061,
+      9383,
+      10428,
+      10929,
+      11938,
+      12033,
+      12331,
+      12562,
+      13793,
+      14157,
+      14635,
+      15265,
+      15618,
+      16553,
+      16604,
+      18362,
+      18956,
+      20075,
+      21675,
+      22520,
+      26130,
+      26161,
+      26435,
+      28279,
+      29464,
+      31650,
+      32302,
+      32470,
+      36865,
+      42863,
+      47425,
+      49870,
+      50254,
+      50258,
+      50358,
+      50359,
+      50360,
+      50361,
+      50362
+    ],
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "use_weighted_layer_sum": false,
+    "vocab_size": 51865
+  },
+  "audio_model_id": "openai/whisper-base",
+  "audio_sample_rate": 16000,
+  "auto_map": {
+    "AutoConfig": "asr_config.ASRConfig",
+    "AutoModel": "asr_modeling.ASRModel",
+    "AutoModelForSpeechSeq2Seq": "asr_modeling.ASRModel",
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  },
+  "custom_pipelines": {
+    "automatic-speech-recognition": {
+      "impl": "asr_pipeline.ASRPipeline",
+      "pt": [
+        "AutoModelForSpeechSeq2Seq"
+      ],
+      "tf": [],
+      "type": "audio"
+    }
+  },
+  "do_sample": false,
+  "downsample_rate": 5,
+  "dtype": "bfloat16",
+  "encoder": {
+    "_name_or_path": "openai/whisper-base",
+    "activation_dropout": 0.0,
+    "activation_function": "gelu",
+    "apply_spec_augment": false,
+    "architectures": [
+      "WhisperForConditionalGeneration"
+    ],
+    "attention_dropout": 0.0,
+    "begin_suppress_tokens": [
+      220,
+      50257
+    ],
+    "bos_token_id": 50257,
+    "classifier_proj_size": 256,
+    "d_model": 512,
+    "decoder_attention_heads": 8,
+    "decoder_ffn_dim": 2048,
+    "decoder_layerdrop": 0.0,
+    "decoder_layers": 6,
+    "decoder_start_token_id": 50258,
+    "dropout": 0.0,
+    "dtype": "bfloat16",
+    "encoder_attention_heads": 8,
+    "encoder_ffn_dim": 2048,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 6,
+    "eos_token_id": 50257,
+    "forced_decoder_ids": [
+      [
+        1,
+        50259
+      ],
+      [
+        2,
+        50359
+      ],
+      [
+        3,
+        50363
+      ]
+    ],
+    "init_std": 0.02,
+    "mask_feature_length": 10,
+    "mask_feature_min_masks": 0,
+    "mask_feature_prob": 0.0,
+    "mask_time_length": 10,
+    "mask_time_min_masks": 2,
+    "mask_time_prob": 0.05,
+    "max_source_positions": 1500,
+    "max_target_positions": 448,
+    "median_filter_width": 7,
+    "model_type": "whisper",
+    "num_mel_bins": 80,
+    "pad_token_id": 50257,
+    "scale_embedding": false,
+    "suppress_tokens": [
+      1,
+      2,
+      7,
+      8,
+      9,
+      10,
+      14,
+      25,
+      26,
+      27,
+      28,
+      29,
+      31,
+      58,
+      59,
+      60,
+      61,
+      62,
+      63,
+      90,
+      91,
+      92,
+      93,
+      359,
+      503,
+      522,
+      542,
+      873,
+      893,
+      902,
+      918,
+      922,
+      931,
+      1350,
+      1853,
+      1982,
+      2460,
+      2627,
+      3246,
+      3253,
+      3268,
+      3536,
+      3846,
+      3961,
+      4183,
+      4667,
+      6585,
+      6647,
+      7273,
+      9061,
+      9383,
+      10428,
+      10929,
+      11938,
+      12033,
+      12331,
+      12562,
+      13793,
+      14157,
+      14635,
+      15265,
+      15618,
+      16553,
+      16604,
+      18362,
+      18956,
+      20075,
+      21675,
+      22520,
+      26130,
+      26161,
+      26435,
+      28279,
+      29464,
+      31650,
+      32302,
+      32470,
+      36865,
+      42863,
+      47425,
+      49870,
+      50254,
+      50258,
+      50358,
+      50359,
+      50360,
+      50361,
+      50362
+    ],
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "use_weighted_layer_sum": false,
+    "vocab_size": 51865
+  },
+  "encoder_conv_layers": [
+    [
+      1,
+      3,
+      1
+    ],
+    [
+      1,
+      3,
+      2
+    ]
+  ],
+  "encoder_dim": 512,
+  "freeze_projector": false,
+  "freq_mask_length": 27,
+  "inference_warmup_tokens": 10,
+  "label_smoothing": 0.0,
+  "length_penalty": 1.0,
+  "llm_dim": 960,
+  "lora_alpha": 32,
+  "lora_dropout": 0.0,
+  "lora_rank": 8,
+  "lora_target_modules": [
+    "q_proj",
+    "k_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "up_proj",
+    "down_proj"
+  ],
+  "max_new_tokens": 128,
+  "min_new_tokens": 0,
+  "model_dtype": "bfloat16",
+  "model_type": "asr_model",
+  "no_repeat_ngram_size": 0,
+  "num_beams": 1,
+  "num_experts": 4,
+  "num_experts_per_tok": 2,
+  "num_freq_masks": 2,
+  "num_time_masks": 2,
+  "pipeline_tag": "automatic-speech-recognition",
+  "pretrained_model_path": "mazesmazes/tiny-audio-embedded",
+  "projector_dropout": 0.0,
+  "projector_hidden_dim": 512,
+  "projector_init_std": 0.02,
+  "projector_num_layers": 2,
+  "projector_pool_stride": 4,
+  "projector_type": "mlp",
+  "qformer_hidden_size": null,
+  "qformer_intermediate_size": null,
+  "qformer_num_heads": 16,
+  "qformer_num_layers": 2,
+  "qformer_window_size": 15,
+  "repetition_penalty": 1.0,
+  "router_aux_loss_coef": 0.01,
+  "system_prompt": "",
+  "temperature": null,
+  "text_config": {
+    "_name_or_path": "HuggingFaceTB/SmolLM2-360M-Instruct",
+    "architectures": [
+      "LlamaForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 1,
+    "dtype": "bfloat16",
+    "eos_token_id": 2,
+    "head_dim": 64,
+    "hidden_act": "silu",
+    "hidden_size": 960,
+    "initializer_range": 0.02,
+    "intermediate_size": 2560,
+    "is_llama_config": true,
+    "max_position_embeddings": 8192,
+    "mlp_bias": false,
+    "model_type": "llama",
+    "num_attention_heads": 15,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 5,
+    "pad_token_id": 2,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_interleaved": false,
+    "rope_parameters": {
+      "rope_theta": 100000,
+      "rope_type": "default"
+    },
+    "tie_word_embeddings": true,
+    "transformers.js_config": {
+      "kv_cache_dtype": {
+        "fp16": "float16",
+        "q4f16": "float16"
+      }
+    },
+    "use_cache": true,
+    "vocab_size": 49153
+  },
+  "text_model_id": "HuggingFaceTB/SmolLM2-360M-Instruct",
+  "time_mask_length": 100,
+  "top_k": null,
+  "top_p": null,
+  "transformers_version": "5.6.0",
+  "use_cache": false,
+  "use_lora": false,
+  "use_specaugment": true,
+  "vocab_size": 49153
+}

diarization.py ADDED Viewed

	@@ -0,0 +1,732 @@

+"""Speaker diarization using TEN-VAD + ECAPA-TDNN + spectral clustering.
+Spectral clustering implementation adapted from FunASR/3D-Speaker:
+https://github.com/alibaba-damo-academy/FunASR
+MIT License (https://opensource.org/licenses/MIT)
+"""
+import warnings
+import numpy as np
+import scipy
+import sklearn.metrics.pairwise
+import torch
+from sklearn.cluster._kmeans import k_means
+from sklearn.preprocessing import normalize
+def _get_device() -> torch.device:
+    """Get best available device for inference."""
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+class SpectralCluster:
+    """Spectral clustering using unnormalized Laplacian of affinity matrix.
+    Adapted from FunASR/3D-Speaker and SpeechBrain implementations.
+    Uses eigenvalue gap to automatically determine number of speakers.
+    """
+    def __init__(self, min_num_spks: int = 1, max_num_spks: int = 15, pval: float = 0.06):
+        self.min_num_spks = min_num_spks
+        self.max_num_spks = max_num_spks
+        self.pval = pval
+    def __call__(self, embeddings: np.ndarray, oracle_num: int | None = None) -> np.ndarray:
+        """Run spectral clustering on embeddings.
+        Args:
+            embeddings: Speaker embeddings of shape [N, D]
+            oracle_num: Optional known number of speakers
+        Returns:
+            Cluster labels of shape [N]
+        """
+        # Similarity matrix computation
+        sim_mat = self.get_sim_mat(embeddings)
+        # Refining similarity matrix with pval
+        prunned_sim_mat = self.p_pruning(sim_mat)
+        # Symmetrization
+        sym_prund_sim_mat = 0.5 * (prunned_sim_mat + prunned_sim_mat.T)
+        # Laplacian calculation
+        laplacian = self.get_laplacian(sym_prund_sim_mat)
+        # Get Spectral Embeddings
+        emb, num_of_spk = self.get_spec_embs(laplacian, oracle_num)
+        # Perform clustering
+        return self.cluster_embs(emb, num_of_spk)
+    def get_sim_mat(self, embeddings: np.ndarray) -> np.ndarray:
+        """Compute cosine similarity matrix."""
+        return sklearn.metrics.pairwise.cosine_similarity(embeddings, embeddings)
+    def p_pruning(self, affinity: np.ndarray) -> np.ndarray:
+        """Prune low similarity values in affinity matrix (keep top pval fraction)."""
+        n = affinity.shape[0]
+        pval = max(self.pval, 6.0 / n)
+        k_keep = max(1, int(pval * n))
+        # Vectorized: find top-k indices per row and zero out the rest
+        top_k_idx = np.argpartition(affinity, -k_keep, axis=1)[:, -k_keep:]
+        mask = np.zeros_like(affinity, dtype=bool)
+        np.put_along_axis(mask, top_k_idx, True, axis=1)
+        affinity[~mask] = 0
+        return affinity
+    def get_laplacian(self, sim_mat: np.ndarray) -> np.ndarray:
+        """Compute unnormalized Laplacian matrix."""
+        from scipy.sparse.csgraph import laplacian
+        np.fill_diagonal(sim_mat, 0)
+        return laplacian(sim_mat, normed=False)
+    def get_spec_embs(
+        self, laplacian: np.ndarray, k_oracle: int | None = None
+    ) -> tuple[np.ndarray, int]:
+        """Extract spectral embeddings from Laplacian."""
+        lambdas, eig_vecs = scipy.linalg.eigh(laplacian)
+        if k_oracle is not None:
+            num_of_spk = k_oracle
+        else:
+            lambda_gap_list = self.get_eigen_gaps(
+                lambdas[self.min_num_spks - 1 : self.max_num_spks + 1]
+            )
+            num_of_spk = np.argmax(lambda_gap_list) + self.min_num_spks
+        emb = eig_vecs[:, :num_of_spk]
+        return emb, num_of_spk
+    def cluster_embs(self, emb: np.ndarray, k: int) -> np.ndarray:
+        """Cluster spectral embeddings using k-means."""
+        _, labels, _ = k_means(emb, k, n_init=10)
+        return labels
+    def get_eigen_gaps(self, eig_vals: np.ndarray) -> np.ndarray:
+        """Compute gaps between consecutive eigenvalues."""
+        return np.diff(eig_vals)
+class SpeakerClusterer:
+    """Speaker clustering backend using spectral clustering with speaker merging.
+    Features:
+    - Spectral clustering with eigenvalue gap for auto speaker count detection
+    - P-pruning for affinity matrix refinement
+    - Post-clustering speaker merging by cosine similarity
+    """
+    def __init__(
+        self,
+        min_num_spks: int = 2,
+        max_num_spks: int = 10,
+        merge_thr: float = 0.90,  # Moderate merging
+    ):
+        self.min_num_spks = min_num_spks
+        self.max_num_spks = max_num_spks
+        self.merge_thr = merge_thr
+        self._spectral_cluster: SpectralCluster | None = None
+    def _get_spectral_cluster(self) -> SpectralCluster:
+        """Lazy-load spectral clusterer."""
+        if self._spectral_cluster is None:
+            self._spectral_cluster = SpectralCluster(
+                min_num_spks=self.min_num_spks,
+                max_num_spks=self.max_num_spks,
+            )
+        return self._spectral_cluster
+    def __call__(self, embeddings: np.ndarray, num_speakers: int | None = None) -> np.ndarray:
+        """Cluster speaker embeddings and return labels.
+        Args:
+            embeddings: Speaker embeddings of shape [N, D]
+            num_speakers: Optional oracle number of speakers
+        Returns:
+            Cluster labels of shape [N]
+        """
+        import warnings
+        if len(embeddings.shape) != 2:
+            raise ValueError(f"Expected 2D array, got shape {embeddings.shape}")
+        # Handle edge cases
+        if embeddings.shape[0] == 0:
+            return np.array([], dtype=int)
+        if embeddings.shape[0] == 1:
+            return np.array([0], dtype=int)
+        if embeddings.shape[0] < 6:
+            return np.zeros(embeddings.shape[0], dtype=int)
+        # Normalize embeddings and replace NaN/inf
+        embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
+        embeddings = normalize(embeddings)
+        # Run spectral clustering (suppress numerical warnings)
+        spectral = self._get_spectral_cluster()
+        # Update min/max for oracle case
+        if num_speakers is not None:
+            spectral.min_num_spks = num_speakers
+            spectral.max_num_spks = num_speakers
+        with warnings.catch_warnings():
+            warnings.filterwarnings("ignore", category=RuntimeWarning)
+            labels = spectral(embeddings, oracle_num=num_speakers)
+        # Reset min/max
+        if num_speakers is not None:
+            spectral.min_num_spks = self.min_num_spks
+            spectral.max_num_spks = self.max_num_spks
+        # Merge similar speakers if no oracle
+        if num_speakers is None:
+            labels = self._merge_by_cos(labels, embeddings, self.merge_thr)
+        # Re-index labels sequentially
+        _, labels = np.unique(labels, return_inverse=True)
+        return labels
+    def _merge_by_cos(self, labels: np.ndarray, embs: np.ndarray, cos_thr: float) -> np.ndarray:
+        """Merge similar speakers by cosine similarity of centroids."""
+        from scipy.cluster.hierarchy import fcluster, linkage
+        from scipy.spatial.distance import pdist
+        unique_labels = np.unique(labels)
+        if len(unique_labels) <= 1:
+            return labels
+        # Compute normalized speaker centroids
+        centroids = np.array([embs[labels == lbl].mean(0) for lbl in unique_labels])
+        centroids = normalize(centroids)
+        # Hierarchical clustering with cosine distance
+        distances = pdist(centroids, metric="cosine")
+        linkage_matrix = linkage(distances, method="average")
+        merged_labels = fcluster(linkage_matrix, t=1.0 - cos_thr, criterion="distance") - 1
+        # Map original labels to merged labels
+        label_map = dict(zip(unique_labels, merged_labels))
+        return np.array([label_map[lbl] for lbl in labels])
+class LocalSpeakerDiarizer:
+    """Local speaker diarization using TEN-VAD + ECAPA-TDNN + spectral clustering.
+    Pipeline:
+    1. TEN-VAD detects speech segments
+    2. Sliding window (1.0s, 75% overlap) for uniform embedding extraction
+    3. ECAPA-TDNN extracts speaker embeddings per window
+    4. Spectral clustering with eigenvalue gap for auto speaker detection
+    5. Frame-level consensus voting for segment reconstruction
+    6. Post-processing merges short segments to reduce flicker
+    Tunable Parameters (class attributes):
+    - WINDOW_SIZE: Embedding extraction window size in seconds
+    - STEP_SIZE: Sliding window step size (overlap = WINDOW_SIZE - STEP_SIZE)
+    - VAD_THRESHOLD: Speech detection threshold (lower = more sensitive)
+    - VAD_MIN_DURATION: Minimum speech segment duration
+    - VAD_MAX_GAP: Maximum gap to bridge between segments
+    - VAD_PAD_ONSET/OFFSET: Padding added to speech segments
+    - VOTING_RATE: Frame resolution for consensus voting
+    - MIN_SEGMENT_DURATION: Minimum final segment duration
+    - SAME_SPEAKER_GAP: Maximum gap to merge same-speaker segments
+    - TAIL_COVERAGE_RATIO: Minimum tail coverage to add extra window
+    """
+    _ten_vad_model = None
+    _ecapa_model = None
+    _device = None
+    # ==================== TUNABLE PARAMETERS ====================
+    # Sliding window for embedding extraction
+    WINDOW_SIZE = 0.75  # seconds - shorter window for finer resolution
+    STEP_SIZE = 0.15  # seconds (80% overlap for more votes)
+    TAIL_COVERAGE_RATIO = 0.1  # Add extra window if tail > this ratio of window
+    # VAD hysteresis parameters
+    VAD_THRESHOLD = 0.25  # Balanced threshold
+    VAD_MIN_DURATION = 0.05  # Minimum speech segment duration (seconds)
+    VAD_MAX_GAP = 0.50  # Bridge gaps shorter than this (seconds)
+    VAD_PAD_ONSET = 0.05  # Padding at segment start (seconds)
+    VAD_PAD_OFFSET = 0.05  # Padding at segment end (seconds)
+    # Frame-level voting
+    VOTING_RATE = 0.01  # 10ms resolution for consensus voting
+    # Post-processing
+    MIN_SEGMENT_DURATION = 0.15  # Minimum final segment duration (seconds)
+    SHORT_SEGMENT_GAP = 0.1  # Gap threshold for merging short segments
+    SAME_SPEAKER_GAP = 0.5  # Gap threshold for merging same-speaker segments
+    # ===========================================================
+    @classmethod
+    def _get_ten_vad_model(cls):
+        """Lazy-load TEN-VAD model (singleton)."""
+        if cls._ten_vad_model is None:
+            from ten_vad import TenVad
+            cls._ten_vad_model = TenVad(hop_size=256, threshold=cls.VAD_THRESHOLD)
+        return cls._ten_vad_model
+    @classmethod
+    def _get_device(cls) -> torch.device:
+        """Get the best available device."""
+        if cls._device is None:
+            cls._device = _get_device()
+        return cls._device
+    @classmethod
+    def _get_ecapa_model(cls):
+        """Lazy-load ECAPA-TDNN speaker embedding model (singleton)."""
+        if cls._ecapa_model is None:
+            # Suppress torchaudio deprecation warning from SpeechBrain
+            with warnings.catch_warnings():
+                warnings.filterwarnings("ignore", message="torchaudio._backend")
+                from speechbrain.inference.speaker import EncoderClassifier
+                device = cls._get_device()
+                cls._ecapa_model = EncoderClassifier.from_hparams(
+                    source="speechbrain/spkrec-ecapa-voxceleb",
+                    run_opts={"device": str(device)},
+                )
+        return cls._ecapa_model
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int = 2,
+        max_speakers: int = 10,
+        **_kwargs,
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        # Handle file path input
+        if isinstance(audio, str):
+            import librosa
+            audio, sample_rate = librosa.load(audio, sr=16000)
+        # Ensure correct sample rate
+        if sample_rate != 16000:
+            import librosa
+            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
+            sample_rate = 16000
+        audio = audio.astype(np.float32)
+        total_duration = len(audio) / sample_rate
+        # Step 1: VAD (returns segments and raw frame-level decisions)
+        segments, vad_frames = cls._get_speech_segments(audio, sample_rate)
+        if not segments:
+            return []
+        # Step 2: Extract embeddings
+        embeddings, window_segments = cls._extract_embeddings(audio, segments, sample_rate)
+        if len(embeddings) == 0:
+            return []
+        # Step 3: Cluster
+        clusterer = SpeakerClusterer(min_num_spks=min_speakers, max_num_spks=max_speakers)
+        labels = clusterer(embeddings, num_speakers)
+        # Step 4: Post-process with consensus voting (VAD-aware)
+        return cls._postprocess_segments(window_segments, labels, total_duration, vad_frames)
+    @classmethod
+    def _get_speech_segments(
+        cls, audio_array: np.ndarray, sample_rate: int = 16000
+    ) -> tuple[list[dict], list[bool]]:
+        """Get speech segments using TEN-VAD.
+        Returns:
+            Tuple of (segments list, vad_frames list of per-frame speech decisions)
+        """
+        vad_model = cls._get_ten_vad_model()
+        # Convert to int16 as required by TEN-VAD
+        # Clip to prevent integer overflow
+        if audio_array.dtype != np.int16:
+            audio_int16 = (np.clip(audio_array, -1.0, 1.0) * 32767).astype(np.int16)
+        else:
+            audio_int16 = audio_array
+        # Process frame by frame
+        hop_size = 256
+        frame_duration = hop_size / sample_rate
+        speech_frames: list[bool] = []
+        for i in range(0, len(audio_int16) - hop_size, hop_size):
+            frame = audio_int16[i : i + hop_size]
+            _, is_speech = vad_model.process(frame)
+            speech_frames.append(is_speech)
+        # Convert frame-level decisions to segments
+        segments = []
+        in_speech = False
+        start_idx = 0
+        for i, is_speech in enumerate(speech_frames):
+            if is_speech and not in_speech:
+                start_idx = i
+                in_speech = True
+            elif not is_speech and in_speech:
+                start_time = start_idx * frame_duration
+                end_time = i * frame_duration
+                segments.append(
+                    {
+                        "start": start_time,
+                        "end": end_time,
+                        "start_sample": int(start_time * sample_rate),
+                        "end_sample": int(end_time * sample_rate),
+                    }
+                )
+                in_speech = False
+        # Handle trailing speech
+        if in_speech:
+            start_time = start_idx * frame_duration
+            end_time = len(speech_frames) * frame_duration
+            segments.append(
+                {
+                    "start": start_time,
+                    "end": end_time,
+                    "start_sample": int(start_time * sample_rate),
+                    "end_sample": int(end_time * sample_rate),
+                }
+            )
+        return cls._apply_vad_hysteresis(segments, sample_rate), speech_frames
+    @classmethod
+    def _apply_vad_hysteresis(cls, segments: list[dict], sample_rate: int = 16000) -> list[dict]:
+        """Apply hysteresis-like post-processing to VAD segments."""
+        if not segments:
+            return segments
+        segments = sorted(segments, key=lambda x: x["start"])
+        # Fill short gaps
+        merged = [segments[0].copy()]
+        for seg in segments[1:]:
+            gap = seg["start"] - merged[-1]["end"]
+            if gap <= cls.VAD_MAX_GAP:
+                merged[-1]["end"] = seg["end"]
+                merged[-1]["end_sample"] = seg["end_sample"]
+            else:
+                merged.append(seg.copy())
+        # Remove short segments
+        filtered = [seg for seg in merged if (seg["end"] - seg["start"]) >= cls.VAD_MIN_DURATION]
+        # Dilate segments (add padding)
+        for seg in filtered:
+            seg["start"] = max(0.0, seg["start"] - cls.VAD_PAD_ONSET)
+            seg["end"] = seg["end"] + cls.VAD_PAD_OFFSET
+            seg["start_sample"] = int(seg["start"] * sample_rate)
+            seg["end_sample"] = int(seg["end"] * sample_rate)
+        return filtered
+    @classmethod
+    def _extract_embeddings(
+        cls, audio_array: np.ndarray, segments: list[dict], sample_rate: int
+    ) -> tuple[np.ndarray, list[dict]]:
+        """Extract speaker embeddings using sliding windows."""
+        speaker_model = cls._get_ecapa_model()
+        window_samples = int(cls.WINDOW_SIZE * sample_rate)
+        step_samples = int(cls.STEP_SIZE * sample_rate)
+        embeddings = []
+        window_segments = []
+        with torch.no_grad():
+            for seg in segments:
+                seg_start = seg["start_sample"]
+                seg_end = seg["end_sample"]
+                seg_len = seg_end - seg_start
+                # Generate window positions
+                if seg_len <= window_samples:
+                    starts = [seg_start]
+                    ends = [seg_end]
+                else:
+                    starts = list(range(seg_start, seg_end - window_samples + 1, step_samples))
+                    ends = [s + window_samples for s in starts]
+                    # Cover tail if > TAIL_COVERAGE_RATIO of window remains
+                    if ends and ends[-1] < seg_end:
+                        remainder = seg_end - ends[-1]
+                        if remainder > (window_samples * cls.TAIL_COVERAGE_RATIO):
+                            starts.append(seg_end - window_samples)
+                            ends.append(seg_end)
+                for c_start, c_end in zip(starts, ends):
+                    chunk = audio_array[c_start:c_end]
+                    # Pad short chunks with reflection
+                    if len(chunk) < window_samples:
+                        pad_width = window_samples - len(chunk)
+                        chunk = np.pad(chunk, (0, pad_width), mode="reflect")
+                    # Extract embedding using SpeechBrain's encode_batch
+                    chunk_tensor = torch.from_numpy(chunk).float().unsqueeze(0)
+                    embedding = (
+                        speaker_model.encode_batch(chunk_tensor).squeeze(0).squeeze(0).cpu().numpy()
+                    )
+                    # Validate embedding
+                    if np.isfinite(embedding).all() and np.linalg.norm(embedding) > 1e-8:
+                        embeddings.append(embedding)
+                        window_segments.append(
+                            {
+                                "start": c_start / sample_rate,
+                                "end": c_end / sample_rate,
+                            }
+                        )
+        # Normalize all embeddings at once
+        if embeddings:
+            return normalize(np.array(embeddings)), window_segments
+        return np.array([]), []
+    @classmethod
+    def _resample_vad(cls, vad_frames: list[bool], num_frames: int) -> np.ndarray:
+        """Resample VAD frame decisions to match voting grid resolution.
+        VAD operates at 256 samples / 16000 Hz = 16ms per frame.
+        Voting operates at VOTING_RATE (default 10ms) per frame.
+        This maps VAD decisions to the finer voting grid.
+        """
+        if not vad_frames:
+            return np.zeros(num_frames, dtype=bool)
+        vad_rate = 256 / 16000  # 16ms per VAD frame
+        vad_arr = np.array(vad_frames)
+        # Vectorized: compute VAD frame indices for each voting frame
+        voting_times = np.arange(num_frames) * cls.VOTING_RATE
+        vad_indices = np.clip((voting_times / vad_rate).astype(int), 0, len(vad_arr) - 1)
+        return vad_arr[vad_indices]
+    @classmethod
+    def _postprocess_segments(
+        cls,
+        window_segments: list[dict],
+        labels: np.ndarray,
+        total_duration: float,
+        vad_frames: list[bool],
+    ) -> list[dict]:
+        """Post-process using frame-level consensus voting with VAD-aware silence."""
+        if not window_segments or len(labels) == 0:
+            return []
+        # Correct labels to be contiguous
+        unique_labels = np.unique(labels)
+        label_map = {old: new for new, old in enumerate(unique_labels)}
+        clean_labels = np.array([label_map[lbl] for lbl in labels])
+        num_speakers = len(unique_labels)
+        if num_speakers == 0:
+            return []
+        # Create voting grid
+        num_frames = int(np.ceil(total_duration / cls.VOTING_RATE)) + 1
+        votes = np.zeros((num_frames, num_speakers), dtype=np.float32)
+        # Accumulate votes
+        for win, label in zip(window_segments, clean_labels):
+            start_frame = int(win["start"] / cls.VOTING_RATE)
+            end_frame = int(win["end"] / cls.VOTING_RATE)
+            end_frame = min(end_frame, num_frames)
+            if start_frame < end_frame:
+                votes[start_frame:end_frame, label] += 1.0
+        # Determine winner per frame
+        frame_speakers = np.argmax(votes, axis=1)
+        max_votes = np.max(votes, axis=1)
+        # Resample VAD to voting grid resolution for silence-aware voting
+        vad_resampled = cls._resample_vad(vad_frames, num_frames)
+        # Convert frames to segments
+        final_segments = []
+        current_speaker = -1
+        seg_start = 0.0
+        for f in range(num_frames):
+            speaker = int(frame_speakers[f])
+            score = max_votes[f]
+            # Force silence if VAD says no speech OR no votes
+            if score == 0 or not vad_resampled[f]:
+                speaker = -1
+            if speaker != current_speaker:
+                if current_speaker != -1:
+                    final_segments.append(
+                        {
+                            "speaker": f"SPEAKER_{current_speaker}",
+                            "start": seg_start,
+                            "end": f * cls.VOTING_RATE,
+                        }
+                    )
+                current_speaker = speaker
+                seg_start = f * cls.VOTING_RATE
+        # Close last segment
+        if current_speaker != -1:
+            final_segments.append(
+                {
+                    "speaker": f"SPEAKER_{current_speaker}",
+                    "start": seg_start,
+                    "end": num_frames * cls.VOTING_RATE,
+                }
+            )
+        return cls._merge_short_segments(final_segments)
+    @classmethod
+    def _merge_short_segments(cls, segments: list[dict]) -> list[dict]:
+        """Merge short segments to reduce flicker."""
+        if not segments:
+            return []
+        clean: list[dict] = []
+        for seg in segments:
+            dur = seg["end"] - seg["start"]
+            if dur < cls.MIN_SEGMENT_DURATION:
+                if (
+                    clean
+                    and clean[-1]["speaker"] == seg["speaker"]
+                    and seg["start"] - clean[-1]["end"] < cls.SHORT_SEGMENT_GAP
+                ):
+                    clean[-1]["end"] = seg["end"]
+                continue
+            if (
+                clean
+                and clean[-1]["speaker"] == seg["speaker"]
+                and seg["start"] - clean[-1]["end"] < cls.SAME_SPEAKER_GAP
+            ):
+                clean[-1]["end"] = seg["end"]
+            else:
+                clean.append(seg)
+        return clean
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap.
+        Args:
+            words: List of word dicts with 'word', 'start', 'end' keys
+            speaker_segments: List of speaker dicts with 'speaker', 'start', 'end' keys
+        Returns:
+            Words list with 'speaker' key added to each word
+        """
+        for word in words:
+            word_mid = (word["start"] + word["end"]) / 2
+            # Find the speaker segment that contains this word's midpoint
+            best_speaker = None
+            for seg in speaker_segments:
+                if seg["start"] <= word_mid <= seg["end"]:
+                    best_speaker = seg["speaker"]
+                    break
+            # If no exact match, find closest segment
+            if best_speaker is None and speaker_segments:
+                min_dist = float("inf")
+                for seg in speaker_segments:
+                    seg_mid = (seg["start"] + seg["end"]) / 2
+                    dist = abs(word_mid - seg_mid)
+                    if dist < min_dist:
+                        min_dist = dist
+                        best_speaker = seg["speaker"]
+            word["speaker"] = best_speaker
+        return words
+class SpeakerDiarizer:
+    """Speaker diarization using TEN-VAD + ECAPA-TDNN + spectral clustering.
+    Example:
+        >>> segments = SpeakerDiarizer.diarize(audio_array)
+        >>> for seg in segments:
+        ...     print(f"{seg['speaker']}: {seg['start']:.2f} - {seg['end']:.2f}")
+    """
+    @classmethod
+    def diarize(
+        cls,
+        audio: np.ndarray | str,
+        sample_rate: int = 16000,
+        num_speakers: int | None = None,
+        min_speakers: int | None = None,
+        max_speakers: int | None = None,
+        **_kwargs,
+    ) -> list[dict]:
+        """Run speaker diarization on audio.
+        Args:
+            audio: Audio waveform as numpy array or path to audio file
+            sample_rate: Audio sample rate (default 16000)
+            num_speakers: Exact number of speakers (if known)
+            min_speakers: Minimum number of speakers
+            max_speakers: Maximum number of speakers
+        Returns:
+            List of dicts with 'speaker', 'start', 'end' keys
+        """
+        return LocalSpeakerDiarizer.diarize(
+            audio,
+            sample_rate=sample_rate,
+            num_speakers=num_speakers,
+            min_speakers=min_speakers or 2,
+            max_speakers=max_speakers or 10,
+        )
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: list[dict],
+        speaker_segments: list[dict],
+    ) -> list[dict]:
+        """Assign speaker labels to words based on timestamp overlap."""
+        return LocalSpeakerDiarizer.assign_speakers_to_words(words, speaker_segments)

generation_config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "do_sample": false,
+  "eos_token_id": [
+    2,
+    0
+  ],
+  "length_penalty": 1.0,
+  "max_new_tokens": 128,
+  "min_new_tokens": 0,
+  "no_repeat_ngram_size": 0,
+  "num_beams": 1,
+  "pad_token_id": 2,
+  "repetition_penalty": 1.0,
+  "transformers_version": "5.6.0",
+  "use_cache": true
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2c5c68a46ee3dd399313e095fc4f8526d4c25bcf92932d67729c85c9ed386dc6
+size 3081536

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "chunk_length": 30,
+  "dither": 0.0,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 80,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": false,
+  "sampling_rate": 16000,
+  "processor_class": "ASRProcessor",
+  "auto_map": {
+    "AutoProcessor": "asr_processing.ASRProcessor"
+  }
+}

projectors.py ADDED Viewed

	@@ -0,0 +1,505 @@

+"""Audio projector modules for bridging encoder and decoder embeddings.
+This module contains all projector architectures:
+- MLPAudioProjector: Simple 2-layer MLP with frame stacking downsampling
+- MOSAProjector: MOSA-style dense mixture of experts
+- SharedMoEAudioProjector: Shared expert + sparse routed experts
+- QFormerAudioProjector: BLIP-2 QFormer with learnable queries (Granite-style)
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+from transformers import AutoModel, Blip2QFormerConfig
+from transformers.models.llama.modeling_llama import LlamaRMSNorm
+# =============================================================================
+# MLP Projector
+# =============================================================================
+class MLPAudioProjector(nn.Module):
+    """2-layer MLP projector with frame-stacking downsampling (matches GLM-ASR)."""
+    def __init__(self, config):
+        """Initialize MLP projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, projector_pool_stride
+        """
+        super().__init__()
+        encoder_dim = getattr(config, "encoder_dim", 768)
+        llm_dim = getattr(config, "llm_dim", 2048)
+        self.k = getattr(config, "projector_pool_stride", 4)
+        # Frame stacking: concat k adjacent frames then project
+        in_dim = encoder_dim * self.k
+        # Hidden dim defaults to llm_dim, can be overridden via config
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or llm_dim
+        self.linear_1 = nn.Linear(in_dim, hidden_dim, bias=False)
+        self.norm = LlamaRMSNorm(hidden_dim, eps=1e-6)
+        self.act = nn.GELU()
+        self.linear_2 = nn.Linear(hidden_dim, llm_dim, bias=False)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length (matches GLM-ASR)."""
+        # GLM-ASR formula: (L - merge_factor) // merge_factor + 1
+        return (input_length - self.k) // self.k + 1
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features to LLM embedding space.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, (seq_len - k) // k + 1, llm_dim]
+        """
+        batch, seq, dim = x.shape
+        # Truncate to match GLM-ASR: use (seq - k) // k + 1 frames
+        # This drops trailing frames that don't fill a complete k-frame window
+        out_len = (seq - self.k) // self.k + 1
+        x = x[:, : out_len * self.k, :]  # Truncate to exact multiple
+        x = x.reshape(batch, out_len, dim * self.k)
+        x = self.linear_1(x)
+        x = self.norm(x)
+        x = self.act(x)
+        return self.linear_2(x)
+# =============================================================================
+# MoE Projector (MOSA-style)
+# =============================================================================
+class SimpleAdapter(nn.Module):
+    """Simple 2-layer GELU adapter (from MOSA paper)."""
+    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
+        super().__init__()
+        self.fc1 = nn.Linear(input_dim, hidden_dim)
+        self.act = nn.GELU()
+        self.fc2 = nn.Linear(hidden_dim, output_dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.fc2(self.act(self.fc1(x)))
+class SwiGLU(nn.Module):
+    """SwiGLU activation with gated linear units (used in LLaMA, Mistral, etc.)."""
+    def __init__(self, dim: int, hidden_dim: int, bias: bool = False):
+        super().__init__()
+        self.w1 = nn.Linear(dim, hidden_dim, bias=bias)  # Gate
+        self.w2 = nn.Linear(dim, hidden_dim, bias=bias)  # Value
+        self.w3 = nn.Linear(hidden_dim, dim, bias=bias)  # Output
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w3(F.silu(self.w1(x)) * self.w2(x))
+class AsymmetricSwiGLU(nn.Module):
+    """SwiGLU that handles different input and output dimensions."""
+    def __init__(
+        self, in_features: int, hidden_features: int, out_features: int, bias: bool = False
+    ):
+        super().__init__()
+        self.w1 = nn.Linear(in_features, hidden_features, bias=bias)  # Gate
+        self.w2 = nn.Linear(in_features, hidden_features, bias=bias)  # Value
+        self.w3 = nn.Linear(hidden_features, out_features, bias=bias)  # Output
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w3(F.silu(self.w1(x)) * self.w2(x))
+class MOSAProjector(nn.Module):
+    """MOSA-Base projector: simple 2-layer ReLU router with 4 simple adapters.
+    Based on "MOSA: Mixtures of Simple Adapters" (arXiv:2508.18998).
+    Uses softmax gating over all experts (dense MoE) with only cross-entropy loss.
+    Uses Conv1d for downsampling (2 layers, stride 2 each = 4x total).
+    """
+    def __init__(self, config):
+        """Initialize MOSA projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, num_experts
+        """
+        super().__init__()
+        self.encoder_dim = getattr(config, "encoder_dim", None) or 1280
+        self.llm_dim = getattr(config, "llm_dim", None) or 2048
+        self.num_experts = getattr(config, "num_experts", None) or 4  # MOSA-Base uses 4
+        adapter_hidden = getattr(config, "adapter_hidden_dim", None) or 4096
+        router_hidden = getattr(config, "router_hidden_dim", None) or 512
+        # --- 1. Conv1d Downsampler (4x reduction) ---
+        # 2 layers of stride-2 convolution
+        self.downsampler = nn.Sequential(
+            nn.Conv1d(self.encoder_dim, self.encoder_dim, kernel_size=3, stride=2, padding=1),
+            nn.GELU(),
+            nn.Conv1d(self.encoder_dim, self.llm_dim, kernel_size=3, stride=2, padding=1),
+            nn.GELU(),
+        )
+        # --- 2. Simple Router (MOSA-Base: 2 layers with ReLU) ---
+        # Takes downsampled features (llm_dim) -> 512 -> num_experts
+        self.router = nn.Sequential(
+            nn.Linear(self.llm_dim, router_hidden),
+            nn.ReLU(),
+            nn.Linear(router_hidden, self.num_experts),
+        )
+        # --- 3. Experts (Simple 2-layer GELU adapters) ---
+        # Each expert: llm_dim -> hidden -> llm_dim (much smaller than frame-stacking)
+        self.experts = nn.ModuleList(
+            [
+                SimpleAdapter(self.llm_dim, adapter_hidden, self.llm_dim)
+                for _ in range(self.num_experts)
+            ]
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features using mixture of experts.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, out_len, llm_dim]
+        """
+        # --- 1. Conv1d Downsampling ---
+        # Permute for Conv1d: [B, S, D] -> [B, D, S]
+        x = x.transpose(1, 2)
+        x = self.downsampler(x)
+        # Permute back: [B, D, S] -> [B, S, D]
+        x = x.transpose(1, 2)
+        # --- 2. Routing ---
+        routing_weights = F.softmax(self.router(x), dim=-1)  # (B, out_len, num_experts)
+        # --- 3. Expert Mixture (Dense Execution) ---
+        expert_outputs = torch.stack([expert(x) for expert in self.experts])  # (E, B, out_len, D)
+        return torch.einsum("ebsd, bse -> bsd", expert_outputs, routing_weights)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length after Conv1d downsampling (4x reduction)."""
+        # Conv1d with stride 2, kernel 3, padding 1: out = (in + 2*1 - 3) // 2 + 1 = (in - 1) // 2 + 1
+        # Applied twice for 4x total reduction
+        after_conv1 = (input_length + 2 * 1 - 3) // 2 + 1
+        return (after_conv1 + 2 * 1 - 3) // 2 + 1
+# =============================================================================
+# MoE Projector (Pure PyTorch with Shared Expert)
+# =============================================================================
+class MoEAudioProjector(nn.Module):
+    """MoE projector with shared expert (DeepSeek-style), pure PyTorch implementation.
+    Uses 4 sparse experts with top-2 routing plus a shared expert that processes all tokens.
+    No external dependencies (megablocks removed).
+    Architecture matches main branch: norm → experts(in_dim → hidden → out_dim)
+    """
+    def __init__(self, config):
+        """Initialize MoE projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, num_experts, num_experts_per_tok
+        """
+        super().__init__()
+        self.k = getattr(config, "projector_pool_stride", 4)
+        self.aux_coef = getattr(config, "router_aux_loss_coef", 0.01)
+        # Stability coefficients
+        self.router_z_loss_coef = getattr(
+            config, "router_z_loss_coef", 1e-4
+        )  # Prevents logit explosion
+        self.router_jitter_noise = getattr(
+            config, "router_jitter_noise", 0.01
+        )  # Prevents expert collapse
+        in_dim = config.encoder_dim * self.k
+        out_dim = config.llm_dim
+        # Expert hidden dim (default = output dim)
+        hidden_dim = getattr(config, "projector_hidden_dim", None) or out_dim
+        # Number of experts and top-k selection
+        self.num_experts = getattr(config, "num_experts", 4)
+        self.top_k = getattr(config, "num_experts_per_tok", 2)
+        # A. Normalize stacked input (like main branch SharedMoEBlock)
+        self.norm = LlamaRMSNorm(in_dim, eps=1e-6)
+        # B. Router (operates on stacked input)
+        self.router = nn.Linear(in_dim, self.num_experts, bias=False)
+        # C. Experts: simple 2-layer MLP (same as MLPAudioProjector)
+        self.experts = nn.ModuleList(
+            [SimpleAdapter(in_dim, hidden_dim, out_dim) for _ in range(self.num_experts)]
+        )
+        # D. Shared Expert (same architecture)
+        self.shared_expert = SimpleAdapter(in_dim, hidden_dim, out_dim)
+        # E. Initialize weights for stable training
+        self._init_weights()
+        self.last_aux_loss = torch.tensor(0.0)
+    def _init_weights(self):
+        """Initialize weights for stable training start."""
+        with torch.no_grad():
+            # Router: small weights -> uniform probability
+            nn.init.normal_(self.router.weight, mean=0.0, std=0.02)
+            # Experts: xavier for fc1, small for fc2 (output)
+            for expert in [self.shared_expert, *self.experts]:
+                nn.init.xavier_uniform_(expert.fc1.weight)
+                nn.init.normal_(expert.fc2.weight, mean=0.0, std=0.01)  # Small init
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length (matches MLP projector)."""
+        return (input_length - self.k) // self.k + 1
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project audio features using shared + sparse MoE.
+        Args:
+            x: Audio encoder output of shape [batch, seq_len, encoder_dim]
+        Returns:
+            Projected features of shape [batch, out_len, llm_dim]
+        """
+        # 1. Frame Stacking
+        batch, seq, dim = x.shape
+        out_len = (seq - self.k) // self.k + 1
+        x = x[:, : out_len * self.k, :]
+        x = x.reshape(batch, out_len, dim * self.k)
+        # 2. Normalize stacked input (like main branch SharedMoEBlock)
+        x = self.norm(x)
+        flat_x = x.view(-1, x.size(-1))  # [tokens, in_dim]
+        # 3. Shared Expert (compute first, creates output tensor)
+        output = self.shared_expert(flat_x)
+        # 4. Sparse Experts (in-place add to shared output)
+        self.last_aux_loss = self._forward_sparse(flat_x, output)
+        return output.view(batch, out_len, -1)
+    def _forward_sparse(self, x: torch.Tensor, output: torch.Tensor) -> torch.Tensor:
+        """Stability-hardened sparse expert dispatch (in-place add to output).
+        Args:
+            x: Flattened input of shape [tokens, dim]
+            output: Output tensor to add sparse expert results into (in-place)
+        Returns:
+            Auxiliary loss tensor
+        """
+        # A. Router Logic with Jitter
+        logits = self.router(x)
+        if self.training and self.router_jitter_noise > 0:
+            # Jitter: multiply by uniform noise (1-eps, 1+eps) to shake decision boundary
+            # Prevents router from getting stuck on one expert early in training
+            noise = torch.empty_like(logits).uniform_(
+                1.0 - self.router_jitter_noise, 1.0 + self.router_jitter_noise
+            )
+            logits = logits * noise
+        # Force float32 for softmax (bf16/fp16 exponentials can overflow)
+        probs = torch.softmax(logits, dim=-1, dtype=torch.float32).type_as(x)
+        # B. Top-K Selection
+        top_k_weights, top_k_indices = torch.topk(probs, self.top_k, dim=-1)
+        # Normalize weights so they sum to 1.0
+        top_k_weights = top_k_weights / (top_k_weights.sum(dim=-1, keepdim=True) + 1e-6)
+        # C. Aux Loss + Z-Loss
+        aux_loss = torch.tensor(0.0, device=x.device)
+        if self.training:
+            # Load balancing loss (batch-size invariant)
+            prob_per_expert = probs.mean(0)  # [num_experts]
+            target = 1.0 / self.num_experts
+            balance_loss = (
+                self.aux_coef * ((prob_per_expert - target) ** 2).mean() * self.num_experts
+            )
+            # Z-loss: penalty on large logits to prevent softmax saturation
+            z_loss = self.router_z_loss_coef * torch.logsumexp(logits, dim=-1).pow(2).mean()
+            aux_loss = balance_loss + z_loss
+        # D. Dispatch Loop (in-place add to output)
+        for i, expert in enumerate(self.experts):
+            # Create boolean mask for tokens that selected Expert 'i'
+            mask = top_k_indices == i
+            if mask.any():
+                # token_idx = which tokens, k_idx = 1st or 2nd choice
+                token_idx, k_idx = torch.where(mask)
+                # Gather inputs and compute
+                expert_input = x[token_idx]
+                expert_output = expert(expert_input)
+                # Apply routing weight
+                weight = top_k_weights[token_idx, k_idx].unsqueeze(-1)
+                weighted_output = (expert_output * weight).type_as(output)
+                # Scatter back in-place (index_add_ is atomic and deterministic)
+                output.index_add_(0, token_idx, weighted_output)
+        return aux_loss
+    def get_aux_loss(self) -> torch.Tensor:
+        """Return auxiliary load balancing loss."""
+        return self.last_aux_loss
+# =============================================================================
+# QFormer Projector (Granite-style)
+# =============================================================================
+class QFormerAudioProjector(nn.Module):
+    """
+    BLIP-2 QFormer projector with learnable queries.
+    Based on GraniteSpeechEncoderProjector - uses a QFormer model with learnable
+    query embeddings to compress and project audio encoder outputs. The audio
+    sequence is processed in windows and downsampled via cross-attention.
+    """
+    def __init__(self, config):
+        """Initialize QFormer projector.
+        Args:
+            config: ASRConfig with encoder_dim, llm_dim, qformer_* settings
+        """
+        super().__init__()
+        encoder_dim = config.encoder_dim
+        llm_dim = config.llm_dim
+        # Window and downsampling parameters (Granite defaults: window=15, downsample=5)
+        self.window_size = getattr(config, "qformer_window_size", 15)
+        self.downsample_rate = getattr(config, "downsample_rate", 5)
+        self.num_queries = self.window_size // self.downsample_rate
+        # QFormer hidden size (matches encoder for cross-attention)
+        qformer_hidden = getattr(config, "qformer_hidden_size", None) or encoder_dim
+        qformer_num_layers = getattr(config, "qformer_num_layers", 2)
+        qformer_num_heads = getattr(config, "qformer_num_heads", 16)
+        qformer_intermediate = getattr(config, "qformer_intermediate_size", None) or (
+            qformer_hidden * 4
+        )
+        # Learnable query embeddings (Granite uses std=1.0)
+        self.query = nn.Parameter(torch.zeros(1, self.num_queries, qformer_hidden))
+        self.query.data.normal_(mean=0.0, std=1.0)
+        # Optional projection if encoder dim != qformer hidden
+        if encoder_dim != qformer_hidden:
+            self.encoder_proj = nn.Linear(encoder_dim, qformer_hidden, bias=False)
+        else:
+            self.encoder_proj = None
+        # Configure QFormer to match Granite's exact config
+        qformer_config = Blip2QFormerConfig(
+            hidden_size=qformer_hidden,
+            num_hidden_layers=qformer_num_layers,
+            num_attention_heads=qformer_num_heads,
+            intermediate_size=qformer_intermediate,
+            encoder_hidden_size=qformer_hidden,
+            cross_attention_frequency=1,
+            # Granite-specific settings
+            hidden_act="gelu",
+            attention_probs_dropout_prob=0.1,
+            hidden_dropout_prob=0.1,
+            layer_norm_eps=1e-12,
+            initializer_range=0.02,
+        )
+        self.qformer = AutoModel.from_config(qformer_config)
+        # Final projection to LLM dimension (Granite uses bias=True)
+        self.linear = nn.Linear(qformer_hidden, llm_dim)
+    def get_output_length(self, input_length: int) -> int:
+        """Calculate output sequence length given input length."""
+        # QFormer uses window-based processing with num_queries per window
+        nblocks = math.ceil(input_length / self.window_size)
+        return nblocks * self.num_queries
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            hidden_states: [batch_size, seq_len, encoder_dim]
+        Returns:
+            projected: [batch_size, num_output_tokens, llm_dim]
+        """
+        batch_size, seq_len, dim = hidden_states.size()
+        # Ensure float dtype for QFormer
+        target_dtype = self.query.dtype
+        if hidden_states.dtype != target_dtype:
+            hidden_states = hidden_states.to(target_dtype)
+        # Optional encoder projection
+        if self.encoder_proj is not None:
+            hidden_states = self.encoder_proj(hidden_states)
+        # Compute number of windows and pad to fit
+        nblocks = math.ceil(seq_len / self.window_size)
+        pad = nblocks * self.window_size - seq_len
+        if pad > 0:
+            hidden_states = F.pad(hidden_states, (0, 0, 0, pad), "constant", 0)
+        # Reshape to process each window: [batch*nblocks, window_size, dim]
+        effective_batch = batch_size * nblocks
+        hidden_states = hidden_states.view(effective_batch, self.window_size, -1)
+        # Expand queries to match batch size
+        query_embeds = self.query.expand(effective_batch, -1, -1)
+        # QFormer cross-attention
+        query_output = self.qformer(
+            query_embeds=query_embeds,
+            encoder_hidden_states=hidden_states,
+            return_dict=True,
+        )
+        # Reshape back: [batch, nblocks * num_queries, hidden]
+        output_tokens = nblocks * self.num_queries
+        query_proj = query_output.last_hidden_state.view(batch_size, output_tokens, -1)
+        # Project to LLM dimension
+        return self.linear(query_proj)
+# =============================================================================
+# Projector Registry
+# =============================================================================
+PROJECTOR_CLASSES = {
+    "mlp": MLPAudioProjector,
+    "mosa": MOSAProjector,
+    "moe": MoEAudioProjector,
+    "qformer": QFormerAudioProjector,
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|im_start|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<audio>"
+  ],
+  "is_local": false,
+  "local_files_only": false,
+  "model_max_length": 8192,
+  "pad_token": "<|im_end|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}