Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_11739 /en /api /pipelines /easyanimate.md

rtrm

23 days ago

preview code

download

raw

15.7 kB

	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	-->

	# EasyAnimate
	[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI.

	The description from it's GitHub page:
	EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.

	This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai).

	There are two official EasyAnimate checkpoints for text-to-video and video-to-video.

	\| checkpoints \| recommended inference dtype \|
	\|:---:\|:---:\|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) \| torch.float16 \|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) \| torch.float16 \|

	There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.

	\| checkpoints \| recommended inference dtype \|
	\|:---:\|:---:\|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) \| torch.float16 \|

	There are two official EasyAnimate checkpoints available for control-to-video.

	\| checkpoints \| recommended inference dtype \|
	\|:---:\|:---:\|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) \| torch.float16 \|
	\| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) \| torch.float16 \|

	For the EasyAnimateV5.1 series:
	- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
	- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.

	## Quantization

	Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

	Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [EasyAnimatePipeline](/docs/diffusers/pr_11739/en/api/pipelines/easyanimate#diffusers.EasyAnimatePipeline) for inference with bitsandbytes.

	```py
	import torch
	from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
	from diffusers.utils import export_to_video

	quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
	transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
	"alibaba-pai/EasyAnimateV5.1-12b-zh",
	subfolder="transformer",
	quantization_config=quant_config,
	torch_dtype=torch.float16,
	)

	pipeline = EasyAnimatePipeline.from_pretrained(
	"alibaba-pai/EasyAnimateV5.1-12b-zh",
	transformer=transformer_8bit,
	torch_dtype=torch.float16,
	device_map="balanced",
	)

	prompt = "A cat walks on the grass, realistic style."
	negative_prompt = "bad detailed"
	video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
	export_to_video(video, "cat.mp4", fps=8)
	```

	## EasyAnimatePipeline[[diffusers.EasyAnimatePipeline]]

	#### diffusers.EasyAnimatePipeline[[diffusers.EasyAnimatePipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L186)

	Pipeline for text-to-video generation using EasyAnimate.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_11739/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	EasyAnimate uses one text encoder [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.

	__call__diffusers.EasyAnimatePipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L524[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "num_frames", "val": ": typing.Optional[int] = 49"}, {"name": "height", "val": ": typing.Optional[int] = 512"}, {"name": "width", "val": ": typing.Optional[int] = 512"}, {"name": "num_inference_steps", "val": ": typing.Optional[int] = 50"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 5.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "eta", "val": ": typing.Optional[float] = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[int]] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "guidance_rescale", "val": ": float = 0.0"}][StableDiffusionPipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or `tuple`If `return_dict` is `True`, [StableDiffusionPipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) is returned,
	otherwise a `tuple` is returned where the first element is a list with the generated images and the
	second element is a list of `bool`s indicating whether the corresponding generated image contains
	"not-safe-for-work" (nsfw) content.

	Generates images or video using the EasyAnimate pipeline based on the provided prompts.

	Examples:
	```python
	>>> import torch
	>>> from diffusers import EasyAnimatePipeline
	>>> from diffusers.utils import export_to_video

	>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
	>>> pipe = EasyAnimatePipeline.from_pretrained(
	... "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
	... ).to("cuda")
	>>> prompt = (
	... "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
	... "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
	... "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
	... "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
	... "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
	... "atmosphere of this unique musical performance."
	... )
	>>> sample_size = (512, 512)
	>>> video = pipe(
	... prompt=prompt,
	... guidance_scale=6,
	... negative_prompt="bad detailed",
	... height=sample_size[0],
	... width=sample_size[1],
	... num_inference_steps=50,
	... ).frames[0]
	>>> export_to_video(video, "output.mp4", fps=8)
	```

	prompt (`str` or `List[str]`, optional):
	Text prompts to guide the image or video generation. If not provided, use `prompt_embeds` instead.
	num_frames (`int`, optional):
	Length of the generated video (in frames).
	height (`int`, optional):
	Height of the generated image in pixels.
	width (`int`, optional):
	Width of the generated image in pixels.
	num_inference_steps (`int`, optional, defaults to 50):
	Number of denoising steps during generation. More steps generally yield higher quality images but slow
	down inference.
	guidance_scale (`float`, optional, defaults to 5.0):
	Encourages the model to align outputs with prompts. A higher value may decrease image quality.
	negative_prompt (`str` or `List[str]`, optional):
	Prompts indicating what to exclude in generation. If not specified, use `negative_prompt_embeds`.
	num_images_per_prompt (`int`, optional, defaults to 1):
	Number of images to generate for each prompt.
	eta (`float`, optional, defaults to 0.0):
	Applies to DDIM scheduling. Controlled by the eta parameter from the related literature.
	generator (`torch.Generator` or `List[torch.Generator]`, optional):
	A generator to ensure reproducibility in image generation.
	latents (`torch.Tensor`, optional):
	Predefined latent tensors to condition generation.
	prompt_embeds (`torch.Tensor`, optional):
	Text embeddings for the prompts. Overrides prompt string inputs for more flexibility.
	negative_prompt_embeds (`torch.Tensor`, optional):
	Embeddings for negative prompts. Overrides string inputs if defined.
	prompt_attention_mask (`torch.Tensor`, optional):
	Attention mask for the primary prompt embeddings.
	negative_prompt_attention_mask (`torch.Tensor`, optional):
	Attention mask for negative prompt embeddings.
	output_type (`str`, optional, defaults to "latent"):
	Format of the generated output, either as a PIL image or as a NumPy array.
	return_dict (`bool`, optional, defaults to `True`):
	If `True`, returns a structured output. Otherwise returns a simple tuple.
	callback_on_step_end (`Callable`, optional):
	Functions called at the end of each denoising step.
	callback_on_step_end_tensor_inputs (`List[str]`, optional):
	Tensor names to be included in callback function calls.
	guidance_rescale (`float`, optional, defaults to 0.0):
	Adjusts noise levels based on guidance scale.
	original_size (`Tuple[int, int]`, optional, defaults to `(1024, 1024)`):
	Original dimensions of the output.
	target_size (`Tuple[int, int]`, optional):
	Desired output dimensions for calculations.
	crops_coords_top_left (`Tuple[int, int]`, optional, defaults to `(0, 0)`):
	Coordinates for cropping.

	Parameters:

	vae ([AutoencoderKLMagvit](/docs/diffusers/pr_11739/en/api/models/autoencoderkl_magvit#diffusers.AutoencoderKLMagvit)) : Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations.

	text_encoder (Optional[`~transformers.Qwen2VLForConditionalGeneration`, `~transformers.BertModel`]) : EasyAnimate uses [qwen2 vl](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) in V5.1.

	tokenizer (Optional[`~transformers.Qwen2Tokenizer`, `~transformers.BertTokenizer`]) : A `Qwen2Tokenizer` or `BertTokenizer` to tokenize text.

	transformer ([EasyAnimateTransformer3DModel](/docs/diffusers/pr_11739/en/api/models/easyanimate_transformer3d#diffusers.EasyAnimateTransformer3DModel)) : The EasyAnimate model designed by EasyAnimate Team.

	scheduler ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/pr_11739/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) : A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents.

	Returns:

	`[StableDiffusionPipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or `tuple``

	If `return_dict` is `True`, [StableDiffusionPipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_diffusion/text2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) is returned,
	otherwise a `tuple` is returned where the first element is a list with the generated images and the
	second element is a list of `bool`s indicating whether the corresponding generated image contains
	"not-safe-for-work" (nsfw) content.
	#### encode_prompt[[diffusers.EasyAnimatePipeline.encode_prompt]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/easyanimate/pipeline_easyanimate.py#L241)

	Encodes the prompt into text encoder hidden states.

	Parameters:

	prompt (`str` or `List[str]`, optional) : prompt to be encoded

	device : (`torch.device`): torch device

	dtype (`torch.dtype`) : torch dtype

	num_images_per_prompt (`int`) : number of images that should be generated per prompt

	do_classifier_free_guidance (`bool`) : whether to use classifier free guidance or not

	negative_prompt (`str` or `List[str]`, optional) : The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).

	prompt_embeds (`torch.Tensor`, optional) : Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.

	negative_prompt_embeds (`torch.Tensor`, optional) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument.

	prompt_attention_mask (`torch.Tensor`, optional) : Attention mask for the prompt. Required when `prompt_embeds` is passed directly.

	negative_prompt_attention_mask (`torch.Tensor`, optional) : Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly.

	max_sequence_length (`int`, optional) : maximum sequence length to use for the prompt.

	## EasyAnimatePipelineOutput[[diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput]]

	#### diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput[[diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/easyanimate/pipeline_output.py#L9)

	Output class for EasyAnimate pipelines.

	Parameters:

	frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]) : List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.

Xet Storage Details

Size:: 15.7 kB
Xet hash:: 47ed9db9a014dbdd57d1df1cd6943048eb5884ade6aec2f5cf066fb3d054294b

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.