Buckets:
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
EasyAnimate
EasyAnimate by Alibaba PAI.
The description from it's GitHub page: EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.
This pipeline was contributed by bubbliiiing. The original codebase can be found here. The original weights can be found under hf.co/alibaba-pai.
There are two official EasyAnimate checkpoints for text-to-video and video-to-video.
| checkpoints | recommended inference dtype |
|---|---|
alibaba-pai/EasyAnimateV5.1-12b-zh |
torch.float16 |
alibaba-pai/EasyAnimateV5.1-12b-zh-InP |
torch.float16 |
There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.
| checkpoints | recommended inference dtype |
|---|---|
alibaba-pai/EasyAnimateV5.1-12b-zh-InP |
torch.float16 |
There are two official EasyAnimate checkpoints available for control-to-video.
| checkpoints | recommended inference dtype |
|---|---|
alibaba-pai/EasyAnimateV5.1-12b-zh-Control |
torch.float16 |
alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera |
torch.float16 |
For the EasyAnimateV5.1 series:
- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.
Quantization
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized EasyAnimatePipeline for inference with bitsandbytes.
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
"alibaba-pai/EasyAnimateV5.1-12b-zh",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
pipeline = EasyAnimatePipeline.from_pretrained(
"alibaba-pai/EasyAnimateV5.1-12b-zh",
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)
prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)
EasyAnimatePipeline[[diffusers.EasyAnimatePipeline]]
diffusers.EasyAnimatePipeline[[diffusers.EasyAnimatePipeline]]
Pipeline for text-to-video generation using EasyAnimate.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
EasyAnimate uses one text encoder qwen2 vl in V5.1.
__call__diffusers.EasyAnimatePipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L524[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "num_frames", "val": ": typing.Optional[int] = 49"}, {"name": "height", "val": ": typing.Optional[int] = 512"}, {"name": "width", "val": ": typing.Optional[int] = 512"}, {"name": "num_inference_steps", "val": ": typing.Optional[int] = 50"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 5.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "eta", "val": ": typing.Optional[float] = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[int]] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "guidance_rescale", "val": ": float = 0.0"}]StableDiffusionPipelineOutput or tupleIf return_dict is True, StableDiffusionPipelineOutput is returned,
otherwise a tuple is returned where the first element is a list with the generated images and the
second element is a list of bools indicating whether the corresponding generated image contains
"not-safe-for-work" (nsfw) content.
Generates images or video using the EasyAnimate pipeline based on the provided prompts.
Examples:
>>> import torch
>>> from diffusers import EasyAnimatePipeline
>>> from diffusers.utils import export_to_video
>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
>>> pipe = EasyAnimatePipeline.from_pretrained(
... "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
... ).to("cuda")
>>> prompt = (
... "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
... "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
... "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
... "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
... "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
... "atmosphere of this unique musical performance."
... )
>>> sample_size = (512, 512)
>>> video = pipe(
... prompt=prompt,
... guidance_scale=6,
... negative_prompt="bad detailed",
... height=sample_size[0],
... width=sample_size[1],
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=8)
prompt (str or List[str], optional):
Text prompts to guide the image or video generation. If not provided, use prompt_embeds instead.
num_frames (int, optional):
Length of the generated video (in frames).
height (int, optional):
Height of the generated image in pixels.
width (int, optional):
Width of the generated image in pixels.
num_inference_steps (int, optional, defaults to 50):
Number of denoising steps during generation. More steps generally yield higher quality images but slow
down inference.
guidance_scale (float, optional, defaults to 5.0):
Encourages the model to align outputs with prompts. A higher value may decrease image quality.
negative_prompt (str or List[str], optional):
Prompts indicating what to exclude in generation. If not specified, use negative_prompt_embeds.
num_images_per_prompt (int, optional, defaults to 1):
Number of images to generate for each prompt.
eta (float, optional, defaults to 0.0):
Applies to DDIM scheduling. Controlled by the eta parameter from the related literature.
generator (torch.Generator or List[torch.Generator], optional):
A generator to ensure reproducibility in image generation.
latents (torch.Tensor, optional):
Predefined latent tensors to condition generation.
prompt_embeds (torch.Tensor, optional):
Text embeddings for the prompts. Overrides prompt string inputs for more flexibility.
negative_prompt_embeds (torch.Tensor, optional):
Embeddings for negative prompts. Overrides string inputs if defined.
prompt_attention_mask (torch.Tensor, optional):
Attention mask for the primary prompt embeddings.
negative_prompt_attention_mask (torch.Tensor, optional):
Attention mask for negative prompt embeddings.
output_type (str, optional, defaults to "latent"):
Format of the generated output, either as a PIL image or as a NumPy array.
return_dict (bool, optional, defaults to True):
If True, returns a structured output. Otherwise returns a simple tuple.
callback_on_step_end (Callable, optional):
Functions called at the end of each denoising step.
callback_on_step_end_tensor_inputs (List[str], optional):
Tensor names to be included in callback function calls.
guidance_rescale (float, optional, defaults to 0.0):
Adjusts noise levels based on guidance scale.
original_size (Tuple[int, int], optional, defaults to (1024, 1024)):
Original dimensions of the output.
target_size (Tuple[int, int], optional):
Desired output dimensions for calculations.
crops_coords_top_left (Tuple[int, int], optional, defaults to (0, 0)):
Coordinates for cropping.
Parameters:
vae (AutoencoderKLMagvit) : Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations.
text_encoder (Optional[~transformers.Qwen2VLForConditionalGeneration, ~transformers.BertModel]) : EasyAnimate uses qwen2 vl in V5.1.
tokenizer (Optional[~transformers.Qwen2Tokenizer, ~transformers.BertTokenizer]) : A Qwen2Tokenizer or BertTokenizer to tokenize text.
transformer (EasyAnimateTransformer3DModel) : The EasyAnimate model designed by EasyAnimate Team.
scheduler (FlowMatchEulerDiscreteScheduler) : A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents.
Returns:
[StableDiffusionPipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or tuple``
If return_dict is True, StableDiffusionPipelineOutput is returned,
otherwise a tuple is returned where the first element is a list with the generated images and the
second element is a list of bools indicating whether the corresponding generated image contains
"not-safe-for-work" (nsfw) content.
encode_prompt[[diffusers.EasyAnimatePipeline.encode_prompt]]
Encodes the prompt into text encoder hidden states.
Parameters:
prompt (str or List[str], optional) : prompt to be encoded
device : (torch.device): torch device
dtype (torch.dtype) : torch dtype
num_images_per_prompt (int) : number of images that should be generated per prompt
do_classifier_free_guidance (bool) : whether to use classifier free guidance or not
negative_prompt (str or List[str], optional) : The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
prompt_embeds (torch.Tensor, optional) : Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
prompt_attention_mask (torch.Tensor, optional) : Attention mask for the prompt. Required when prompt_embeds is passed directly.
negative_prompt_attention_mask (torch.Tensor, optional) : Attention mask for the negative prompt. Required when negative_prompt_embeds is passed directly.
max_sequence_length (int, optional) : maximum sequence length to use for the prompt.
EasyAnimatePipelineOutput[[diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput]]
diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput[[diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput]]
Output class for EasyAnimate pipelines.
Parameters:
frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) : List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).
Xet Storage Details
- Size:
- 15.7 kB
- Xet hash:
- 47ed9db9a014dbdd57d1df1cd6943048eb5884ade6aec2f5cf066fb3d054294b
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.