Upload QwenImageLayeredModularPipeline

ea054f1 verified 13 days ago

5.03 kB

library_name: diffusers
tags:
  - modular-diffusers
  - diffusers
  - qwenimage-layered
  - text-to-image

This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: QwenImageLayeredAutoBlocks

Description: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

text_encoder (QwenImageLayeredTextEncoderStep)
- QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
- resize: QwenImageLayeredResizeStep
  - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
- get_image_prompt: QwenImageLayeredGetImagePromptStep
  - Auto-caption step that generates a text prompt from the input image if none is provided.
- encode: QwenImageTextEncoderStep
  - Text Encoder step that generates text embeddings to guide the image generation.
vae_encoder (QwenImageLayeredVaeEncoderStep)
- Vae encoder step that encode the image inputs into their latent representations.
- resize: QwenImageLayeredResizeStep
  - Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
- preprocess: QwenImageEditProcessImagesInputStep
  - Image Preprocess step. Images needs to be resized first.
- encode: QwenImageVaeEncoderStep
  - VAE Encoder step that converts processed_image into latent representations image_latents.
- permute: QwenImageLayeredPermuteLatentsStep
  - Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
denoise (QwenImageLayeredCoreDenoiseStep)
- Core denoising workflow for QwenImage-Layered img2img task.
- input: QwenImageLayeredInputStep
  - Input step that prepares the inputs for the layered denoising step. It:
- prepare_latents: QwenImageLayeredPrepareLatentsStep
  - Prepare initial random noise (B, layers+1, C, H, W) for the generation process
- set_timesteps: QwenImageLayeredSetTimestepsStep
  - Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
- prepare_rope_inputs: QwenImageLayeredRoPEInputsStep
  - Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
- denoise: QwenImageLayeredDenoiseStep
  - Denoise step that iteratively denoise the latents.
- after_denoise: QwenImageLayeredAfterDenoiseStep
  - Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
decode (QwenImageLayeredDecoderStep)
- Decode unpacked latents (B, C, layers+1, H, W) into layer images.

Model Components

image_resize_processor (VaeImageProcessor)
text_encoder (Qwen2_5_VLForConditionalGeneration)
processor (Qwen2VLProcessor)
tokenizer (Qwen2Tokenizer): The tokenizer to use
guider (ClassifierFreeGuidance)
image_processor (VaeImageProcessor)
vae (AutoencoderKLQwenImage)
pachifier (QwenImageLayeredPachifier)
scheduler (FlowMatchEulerDiscreteScheduler)
transformer (QwenImageTransformer2DModel) ## Input/Output Specification

Inputs:

image (Image | list): Reference image(s) for denoising. Can be a single image or list of images.
resolution (int, optional, defaults to 640): The target area to resize the image to, can be 1024 or 640
prompt (str, optional): The prompt or prompts to guide image generation.
use_en_prompt (bool, optional, defaults to False): Whether to use English prompt template
negative_prompt (str, optional): The prompt or prompts not to guide the image generation.
max_sequence_length (int, optional, defaults to 1024): Maximum sequence length for prompt encoding.
generator (Generator, optional): Torch generator for deterministic generation.
num_images_per_prompt (int, optional, defaults to 1): The number of images to generate per prompt.
latents (Tensor, optional): Pre-generated noisy latents for image generation.
layers (int, optional, defaults to 4): Number of layers to extract from the image
num_inference_steps (int, optional, defaults to 50): The number of denoising steps.
sigmas (list, optional): Custom sigmas for the denoising process.
attention_kwargs (dict, optional): Additional kwargs for attention processors.
**denoiser_input_fields (None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
output_type (str, optional, defaults to pil): Output format: 'pil', 'np', 'pt'.

Outputs:

images (list): Generated images.