Transformers documentation
VibeVoice ASR
This model was released on 2026-01-26 and added to Hugging Face Transformers on 2026-03-02.
VibeVoice ASR
Overview
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice’s acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. See the technical report for more details.
The model checkpoint is available at: microsoft/VibeVoice-ASR-HF
Highlights:
🕒 60-minute Single-Pass Processing: Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
👤 Customized Hotwords: Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
📝 Rich Transcription (Who, When, What): The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.
🌍 Multilingual & Code-Switching Support: It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found here.
This model was contributed by Eric Bezzam.
Usage
The model supports various automatic speech recognition functionalities.
Speaker-timestamped transcription
A notable feature of VibeVoice ASR is its ability to transcribe multi-speaker content, denoting who spoke and when.
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "microsoft/VibeVoice-ASR-HF"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
print(f"Model loaded on {model.device} with dtype {model.dtype}")
# Prepare inputs using `apply_transcription_request`
inputs = processor.apply_transcription_request(
audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
).to(model.device, model.dtype)
# Apply model
output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
transcription = processor.decode(generated_ids)[0]
print("\n" + "=" * 60)
print("RAW OUTPUT")
print("=" * 60)
print(transcription)
transcription = processor.decode(generated_ids, return_format="parsed")[0]
print("\n" + "=" * 60)
print("TRANSCRIPTION (list of dicts)")
print("=" * 60)
for speaker_transcription in transcription:
print(speaker_transcription)
# Remove speaker labels, only get raw transcription
transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
print("\n" + "=" * 60)
print("TRANSCRIPTION ONLY")
print("=" * 60)
print(transcription)
"""
============================================================
RAW OUTPUT
============================================================
<|im_start|>assistant
[{"Start":0,"End":15.43,"Speaker":0,"Content":"Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me."},{"Start":15.43,"End":21.05,"Speaker":1,"Content":"Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings."},{"Start":21.05,"End":31.66,"Speaker":0,"Content":"Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible."},{"Start":31.66,"End":40.93,"Speaker":1,"Content":"Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."}]<|im_end|>
<|endoftext|>
============================================================
TRANSCRIPTION (list of dicts)
============================================================
{'Start': 0, 'End': 15.43, 'Speaker': 0, 'Content': "Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me."}
{'Start': 15.43, 'End': 21.05, 'Speaker': 1, 'Content': "Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings."}
{'Start': 21.05, 'End': 31.66, 'Speaker': 0, 'Content': "Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible."}
{'Start': 31.66, 'End': 40.93, 'Speaker': 1, 'Content': "Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."}
============================================================
TRANSCRIPTION ONLY
============================================================
Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it.
"""The VibeVoice ASR model is trained to generate a string that resembles a JSON structure. The flag return_format="parsed" tries to return the generated output as a list of dicts, while return_format="transcription_only" tries to extract only the transcribed audio. If they fail, the generated output is returned as-is.
Providing context
It is also possible to provide context. This can be useful if certain words cannot be transcribed correctly, such as proper nouns.
Below we transcribe an audio where the speaker (with a German accent) talks about VibeVoice, comparing with and without the context “About VibeVoice”.
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "microsoft/VibeVoice-ASR-HF"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
print(f"Model loaded on {model.device} with dtype {model.dtype}")
# Without context
inputs = processor.apply_transcription_request(
audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
).to(model.device, model.dtype)
output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
print(f"WITHOUT CONTEXT: {transcription}")
# With context
inputs = processor.apply_transcription_request(
audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
prompt="About VibeVoice",
).to(model.device, model.dtype)
output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
print(f"WITH CONTEXT : {transcription}")
"""
WITHOUT CONTEXT: Revevoices is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
WITH CONTEXT : VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.
"""Batch inference
Batch inference is possible by passing a list of audio and, if provided, a list of prompts. The number of audio inputs and prompts should match (for prompts, you can set an entry to None if not needed for a given audio).
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "microsoft/VibeVoice-ASR-HF"
audio = [
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
]
prompts = ["About VibeVoice", None]
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
print(f"Model loaded on {model.device} with dtype {model.dtype}")
inputs = processor.apply_transcription_request(audio, prompt=prompts).to(model.device, model.dtype)
output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
transcription = processor.decode(generated_ids, return_format="transcription_only")
print(transcription)Adjusting tokenizer chunk (e.g. if out-of-memory)
A key feature of VibeVoice ASR is that it can transcribe up to 60 minutes of continuous audio. This is done by chunking audio into 60-second segments (1440000 samples at 24kHz) and caching the convolution states between each segment.
However, if chunks of 60 seconds are too large for your device, the acoustic_tokenizer_chunk_size argument passed to generate can be adjusted. Note it should be a multiple of the hop length (3200 for the original acoustic tokenizer).
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
acoustic_tokenizer_chunk_size = 64000 # default is 1440000 (60s @ 24kHz)
model_id = "microsoft/VibeVoice-ASR-HF"
audio = [
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
]
prompts = ["About VibeVoice", None]
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
print(f"Model loaded on {model.device} with dtype {model.dtype}")
inputs = processor.apply_transcription_request(audio, prompt=prompts).to(model.device, model.dtype)
output_ids = model.generate(**inputs, acoustic_tokenizer_chunk_size=acoustic_tokenizer_chunk_size)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
transcription = processor.decode(generated_ids, return_format="transcription_only")
print(transcription)Chat template
VibeVoice ASR also accepts chat template inputs (apply_transcription_request is actually a wrapper for apply_chat_template for convenience):
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "microsoft/VibeVoice-ASR-HF"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
chat_template = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "About VibeVoice"},
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
},
],
}
],
]
inputs = processor.apply_chat_template(
chat_template,
tokenize=True,
return_dict=True,
).to(model.device, model.dtype)
output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
transcription = processor.decode(generated_ids, return_format="transcription_only")
print(transcription)Training
VibeVoice ASR can be trained with the loss outputted by the model.
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "microsoft/VibeVoice-ASR-HF"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
model.train()
# Prepare batch of 2
# -- NOTE: the original model is trained to output transcription, speaker ID, and timestamps in JSON-like format. Below we are only using the transcription text as the label
chat_template = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio."},
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
},
],
}
],
[
{
"role": "user",
"content": [
{"type": "text", "text": "Hello everyone and welcome to the VibeVoice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."},
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
},
],
}
],
]
inputs = processor.apply_chat_template(
chat_template,
tokenize=True,
return_dict=True,
output_labels=True,
).to(model.device, model.dtype)
loss = model(**inputs).loss
print("Loss:", loss.item())
loss.backward()Torch compile
The model can be compiled for faster inference/training.
import time
import torch
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
model_id = "microsoft/VibeVoice-ASR-HF"
num_warmup = 5
num_runs = 20
# Load processor + model
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16,).to("cuda")
# Prepare static inputs
chat_template = [
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
},
],
}
],
] * 4 # batch size 4
inputs = processor.apply_chat_template(
chat_template,
tokenize=True,
return_dict=True,
).to("cuda", torch.bfloat16)
# Benchmark without compile
print("Warming up without compile...")
with torch.no_grad():
for _ in range(num_warmup):
_ = model(**inputs)
torch.cuda.synchronize()
print("\nBenchmarking without torch.compile...")
torch.cuda.synchronize()
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
_ = model(**inputs)
torch.cuda.synchronize()
no_compile_time = (time.time() - start) / num_runs
print(f"Average time without compile: {no_compile_time:.4f}s")
# Benchmark with compile
print("\nCompiling model...")
model = torch.compile(model)
print("Warming up with compile (includes graph capture)...")
with torch.no_grad():
for _ in range(num_warmup):
_ = model(**inputs)
torch.cuda.synchronize()
print("\nBenchmarking with torch.compile...")
torch.cuda.synchronize()
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
_ = model(**inputs)
torch.cuda.synchronize()
compile_time = (time.time() - start) / num_runs
print(f"Average time with compile: {compile_time:.4f}s")
speedup = no_compile_time / compile_time
print(f"\nSpeedup: {speedup:.2f}x")Pipeline usage
The model can be used as a pipeline, but you will have to define your own methods for parsing the raw output.
from transformers import pipeline
model_id = "microsoft/VibeVoice-ASR-HF"
pipe = pipeline("any-to-any", model=model_id, device_map="auto")
chat_template = [
{
"role": "user",
"content": [
{"type": "text", "text": "About VibeVoice"},
{
"type": "audio",
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
},
],
}
]
outputs = pipe(text=chat_template, return_full_text=False)
print("\n" + "=" * 60)
print("RAW PIPELINE OUTPUT")
print("=" * 60)
print(outputs[0]["generated_text"])
print("\n" + "=" * 60)
print("DICT OUTPUT")
print("=" * 60)
dict_output = pipe.processor.extract_speaker_dict(outputs[0]["generated_text"])
print(dict_output)
print("\n" + "=" * 60)
print("TRANSCRIPT OUTPUT")
print("=" * 60)
transcription = pipe.processor.extract_transcription(outputs[0]["generated_text"])
print(transcription)VibeVoiceAsrConfig
class transformers.VibeVoiceAsrConfig
< source >( acoustic_tokenizer_encoder_config = None semantic_tokenizer_encoder_config = None text_config = None audio_token_id = 151648 audio_bos_token_id = 151646 audio_eos_token_id = 151647 acoustic_tokenizer_chunk_size = 1440000 **kwargs )
Parameters
- acoustic_tokenizer_encoder_config (
Union[VibeVoiceAcousticTokenizerConfig, dict], optional) — The config object or dictionary of the acoustic tokenizer. This tokenizer extracts acoustic features from audio. - semantic_tokenizer_encoder_config (
Union[VibeVoiceAcousticTokenizerConfig, dict], optional) — The config object or dictionary of the semantic tokenizer. This tokenizer extracts semantic features from audio. - text_config (
Union[AutoConfig, dict], optional, defaults toQwen2Config) — The config object or dictionary of the text backbone (language model). - audio_token_id (
int, optional, defaults to 151648) — The audio token index to encode the audio prompt. - audio_bos_token_id (
int, optional, defaults to 151646) — The audio begin-of-sequence token index. - audio_eos_token_id (
int, optional, defaults to 151647) — The audio end-of-sequence token index. - acoustic_tokenizer_chunk_size (
int, optional, defaults to 1440000) — The chunk size (in number of samples) to use when tokenizer audio inputs. Default corresponds to 60 seconds at 24kHz.
This is the configuration class to store the configuration of a VibeVoiceAsrForConditionalGeneration. It is used to instantiate a VibeVoice ASR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of Microsoft’s VibeVoice ASR architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import VibeVoiceAsrForConditionalGeneration, VibeVoiceAsrConfig, VibeVoiceAcousticTokenizerEncoderConfig, Qwen2Config
>>> # Initializing VibeVoice acoustic and semantic encoder configs
>>> acoustic_config = VibeVoiceAcousticTokenizerEncoderConfig()
>>> semantic_config = VibeVoiceAcousticTokenizerEncoderConfig(hidden_size=128)
>>> # Initializing a Qwen2 config
>>> text_config = Qwen2Config()
>>> # Initializing a VibeVoice ASR configuration
>>> configuration = VibeVoiceAsrConfig(acoustic_config, semantic_config, text_config)
>>> # Initializing a model from the vibevoice_asr style configuration
>>> model = VibeVoiceAsrForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configVibeVoiceAsrProcessor
class transformers.VibeVoiceAsrProcessor
< source >( feature_extractor tokenizer chat_template = None audio_token = '<|box_start|>' audio_bos_token = '<|object_ref_start|>' audio_eos_token = '<|object_ref_end|>' audio_duration_token = '<|AUDIO_DURATION|>' )
Parameters
- feature_extractor (
VibeVoiceAcousticTokenizerFeatureExtractor) — The feature extractor for audio processing. - tokenizer (
Qwen2TokenizerFast) — The tokenizer for text processing. - chat_template (
str, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string. - audio_token (
str, optional, defaults to"<|box_start|>") — The audio token placeholder to use in the chat template. - audio_bos_token (
str, optional, defaults to"<|object_ref_start|>") — The audio begin-of-sequence token placeholder to use in the chat template. - audio_eos_token (
str, optional, defaults to"<|object_ref_end|>") — The audio end-of-sequence token placeholder to use in the chat template. - audio_duration_token (
str, optional, defaults to"<|AUDIO_DURATION|>") — The audio duration token placeholder to use in the chat template.
Constructs a VibeVoice ASR processor which wraps VibeVoiceAcousticTokenizerFeatureExtractor and Qwen2TokenizerFast into a single processor that inherits both the audio feature extraction and tokenizer functionalities.
See the call() for more information.
__call__
< source >( text: str | list[str] audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor']] output_labels: bool | None = False **kwargs: typing_extensions.Unpack[transformers.models.vibevoice_asr.processing_vibevoice_asr.VibeVoiceAsrProcessorKwargs] ) → BatchFeature
Parameters
- text (
str,List[str]) — The input text(s) to process, typically prepared by apply_chat_template with audio token placeholders. - audio (
List[Union[str, np.ndarray]]) — Audio samples for transcription. Should match the number of audio token placeholders in text. - output_labels (bool, optional, default=False) — Whether to return labels for training.
- **kwargs — Additional keyword arguments passed to the tokenizer and feature extractor.
Returns
A dictionary with tokenized text (input_ids, attention_mask) and
audio features (input_features, input_features_mask).
Main method to process text inputs with optional audio samples for ASR.
This method processes text inputs (typically prepared by apply_chat_template) and optional audio samples for transcription. It replaces the audio duration placeholder and expands audio token placeholders based on the actual audio length.
apply_transcription_request
< source >( audio: typing.Union[str, list[str], numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor']] prompt: str | list[str] | None = None **kwargs: typing_extensions.Unpack[transformers.models.vibevoice_asr.processing_vibevoice_asr.VibeVoiceAsrProcessorKwargs] ) → BatchFeature
Parameters
- audio (
str,list[str],np.ndarray,torch.Tensor,list[np.ndarray],list[torch.Tensor]) — Audio to transcribe. Strings are interpreted as local paths or URLs and will be loaded automatically by the chat template loader; NumPy arrays and PyTorch tensors are forwarded directly. - prompt (
strorlist[str], optional) — Custom prompt(s) to include in the user turn as extra context. A list must be the same length as the batch. WhenNone, no additional context is provided. - **kwargs —
Additional keyword arguments forwarded to apply_chat_template() (for example
text_kwargs,audio_kwargs, …).
Returns
Processor outputs ready to be passed to VibeVoiceAsrForConditionalGeneration.generate().
Prepare inputs for automatic speech recognition without manually writing the chat template.
decode
< source >( *args return_format = 'raw' **kwargs )
Parameters
- return_format (
str, optional, defaults to"raw") — Options are:"raw": Return a list of raw decoded strings from the tokenizer, without any parsing."parsed": Return a list of list of parsed dictionary objects for each speaker utterance with timestamps."transcription_only": Return a list of extracted transcription strings.
skip_special_tokensis automatically enforced (hard-set) toTruefor"parsed"and"transcription_only".
Forward arguments to decode() and optionally parse the dict-like output.
VibeVoiceAsrForConditionalGeneration
class transformers.VibeVoiceAsrForConditionalGeneration
< source >( config: VibeVoiceAsrConfig )
Parameters
- config (VibeVoiceAsrConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The VibeVoice ASR model with pre-trained acoustic tokenizers and a language model.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None input_values: torch.FloatTensor | None = None padding_mask: torch.BoolTensor | None = None acoustic_tokenizer_chunk_size: int | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - input_values (
torch.FloatTensorof shape(batch_size, sequence_length), optional) — Float values of input raw speech waveform. Values can be obtained by loading a.flacor.wavaudio file into an array of typelist[float], anumpy.ndarrayor atorch.Tensor, e.g. via the torchcodec library (pip install torchcodec) or the soundfile library (pip install soundfile). To prepare the array intoinput_values, the AutoProcessor should be used for padding and conversion into a tensor of typetorch.FloatTensor. See VibeVoiceAsrProcessor.call() for details. - padding_mask (
torch.BoolTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing operations on padding feature indices. - acoustic_tokenizer_chunk_size (
int, optional) — Size of audio chunks processed by the acoustic and semantic tokenizers. Defaults toconfig.acoustic_tokenizer_chunk_size, but can be modified to fit the available memory.
Returns
CausalLMOutputWithPast or tuple(torch.FloatTensor)
A CausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (VibeVoiceAsrConfig) and inputs.
The VibeVoiceAsrForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import VibeVoiceAsrForConditionalGeneration, AutoProcessor
>>> model_id = "microsoft/VibeVoice-ASR-HF"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, dtype="auto", device_map="auto")
>>> inputs = processor.apply_transcription_request("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
>>> inputs = inputs.to(model.device, dtype=model.dtype)
>>> outputs = model.generate(**inputs)
>>> decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)
>>> print(decoded_outputs)get_audio_features
< source >( input_values: FloatTensor padding_mask: torch.BoolTensor | None = None acoustic_tokenizer_chunk_size: int | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- input_values (
torch.FloatTensorof shape(batch_size, num_samples)) — Input audio tensor. Audio should be sampled at 24kHz. - padding_mask (
torch.BoolTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing operations on padding feature indices. - acoustic_tokenizer_chunk_size (
int, optional) — Size of audio chunks to process at once through the tokenizers. Defaults toconfig.acoustic_tokenizer_chunk_size, but can be modified to fit the available memory.
Returns
BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (VibeVoiceAsrConfig) and inputs.
Encode audio into embeddings that can be used by the language model.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("microsoft/VibeVoice-ASR")
>>> model = VibeVoiceAsrForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
...
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
...