Buckets:

hf-doc-build/doc-dev / diffusers /pr_12595 /en /api /models /autoencoder_oobleck.md
rtrm's picture
|
download
raw
7.93 kB
# AutoencoderOobleck
The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms.
The abstract from the paper is:
*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*
## AutoencoderOobleck[[diffusers.AutoencoderOobleck]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class diffusers.AutoencoderOobleck</name><anchor>diffusers.AutoencoderOobleck</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/autoencoders/autoencoder_oobleck.py#L295</source><parameters>[{"name": "encoder_hidden_size", "val": " = 128"}, {"name": "downsampling_ratios", "val": " = [2, 4, 4, 8, 8]"}, {"name": "channel_multiples", "val": " = [1, 2, 4, 8, 16]"}, {"name": "decoder_channels", "val": " = 128"}, {"name": "decoder_input_channels", "val": " = 64"}, {"name": "audio_channels", "val": " = 2"}, {"name": "sampling_rate", "val": " = 44100"}]</parameters><paramsdesc>- **encoder_hidden_size** (`int`, *optional*, defaults to 128) --
Intermediate representation dimension for the encoder.
- **downsampling_ratios** (`List[int]`, *optional*, defaults to `[2, 4, 4, 8, 8]`) --
Ratios for downsampling in the encoder. These are used in reverse order for upsampling in the decoder.
- **channel_multiples** (`List[int]`, *optional*, defaults to `[1, 2, 4, 8, 16]`) --
Multiples used to determine the hidden sizes of the hidden layers.
- **decoder_channels** (`int`, *optional*, defaults to 128) --
Intermediate representation dimension for the decoder.
- **decoder_input_channels** (`int`, *optional*, defaults to 64) --
Input dimension for the decoder. Corresponds to the latent dimension.
- **audio_channels** (`int`, *optional*, defaults to 2) --
Number of channels in the audio data. Either 1 for mono or 2 for stereo.
- **sampling_rate** (`int`, *optional*, defaults to 44100) --
The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).</paramsdesc><paramgroups>0</paramgroups></docstring>
An autoencoder for encoding waveforms into latents and decoding latent representations into waveforms. First
introduced in Stable Audio.
This model inherits from [ModelMixin](/docs/diffusers/pr_12595/en/api/models/overview#diffusers.ModelMixin). Check the superclass documentation for it's generic methods implemented
for all models (such as downloading or saving).
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>wrapper</name><anchor>diffusers.AutoencoderOobleck.decode</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/utils/accelerate_utils.py#L43</source><parameters>[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}]</parameters></docstring>
</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>wrapper</name><anchor>diffusers.AutoencoderOobleck.encode</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/utils/accelerate_utils.py#L43</source><parameters>[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}]</parameters></docstring>
</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>forward</name><anchor>diffusers.AutoencoderOobleck.forward</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/autoencoders/autoencoder_oobleck.py#L426</source><parameters>[{"name": "sample", "val": ": Tensor"}, {"name": "sample_posterior", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "generator", "val": ": typing.Optional[torch._C.Generator] = None"}]</parameters><paramsdesc>- **sample** (`torch.Tensor`) -- Input sample.
- **sample_posterior** (`bool`, *optional*, defaults to `False`) --
Whether to sample from the posterior.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
Whether or not to return a `OobleckDecoderOutput` instead of a plain tuple.</paramsdesc><paramgroups>0</paramgroups></docstring>
</div></div>
## OobleckDecoderOutput[[diffusers.models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class diffusers.models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput</name><anchor>diffusers.models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/autoencoders/autoencoder_oobleck.py#L203</source><parameters>[{"name": "sample", "val": ": Tensor"}]</parameters><paramsdesc>- **sample** (`torch.Tensor` of shape `(batch_size, audio_channels, sequence_length)`) --
The decoded output sample from the last layer of the model.</paramsdesc><paramgroups>0</paramgroups></docstring>
Output of decoding method.
</div>
## OobleckDecoderOutput[[diffusers.models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class diffusers.models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput</name><anchor>diffusers.models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/autoencoders/autoencoder_oobleck.py#L203</source><parameters>[{"name": "sample", "val": ": Tensor"}]</parameters><paramsdesc>- **sample** (`torch.Tensor` of shape `(batch_size, audio_channels, sequence_length)`) --
The decoded output sample from the last layer of the model.</paramsdesc><paramgroups>0</paramgroups></docstring>
Output of decoding method.
</div>
## AutoencoderOobleckOutput[[diffusers.models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput]]
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">
<docstring><name>class diffusers.models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput</name><anchor>diffusers.models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/autoencoders/autoencoder_oobleck.py#L188</source><parameters>[{"name": "latent_dist", "val": ": OobleckDiagonalGaussianDistribution"}]</parameters><paramsdesc>- **latent_dist** (`OobleckDiagonalGaussianDistribution`) --
Encoded outputs of `Encoder` represented as the mean and standard deviation of
`OobleckDiagonalGaussianDistribution`. `OobleckDiagonalGaussianDistribution` allows for sampling latents
from the distribution.</paramsdesc><paramgroups>0</paramgroups></docstring>
Output of AutoencoderOobleck encoding method.
</div>
<EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/models/autoencoder_oobleck.md" />

Xet Storage Details

Size:
7.93 kB
·
Xet hash:
c5fa6a7087da55947b8381c862ea1e4c7fab95361b79763a84bcfd86f0c51a1b

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.