YiYiXu's picture
YiYiXu HF Staff
Upload QwenImageLayeredModularPipeline
ea054f1 verified
|
raw
history blame
5.03 kB
metadata
library_name: diffusers
tags:
  - modular-diffusers
  - diffusers
  - qwenimage-layered
  - text-to-image

This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: QwenImageLayeredAutoBlocks

Description: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (QwenImageLayeredTextEncoderStep)
    • QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
    • resize: QwenImageLayeredResizeStep
      • Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
    • get_image_prompt: QwenImageLayeredGetImagePromptStep
      • Auto-caption step that generates a text prompt from the input image if none is provided.
    • encode: QwenImageTextEncoderStep
      • Text Encoder step that generates text embeddings to guide the image generation.
  2. vae_encoder (QwenImageLayeredVaeEncoderStep)
    • Vae encoder step that encode the image inputs into their latent representations.
    • resize: QwenImageLayeredResizeStep
      • Image Resize step that resize the image to a target area (defined by the resolution parameter from user) while maintaining the aspect ratio.
    • preprocess: QwenImageEditProcessImagesInputStep
      • Image Preprocess step. Images needs to be resized first.
    • encode: QwenImageVaeEncoderStep
      • VAE Encoder step that converts processed_image into latent representations image_latents.
    • permute: QwenImageLayeredPermuteLatentsStep
      • Permute image latents from (B, C, 1, H, W) to (B, 1, C, H, W) for Layered packing.
  3. denoise (QwenImageLayeredCoreDenoiseStep)
    • Core denoising workflow for QwenImage-Layered img2img task.
    • input: QwenImageLayeredInputStep
      • Input step that prepares the inputs for the layered denoising step. It:
    • prepare_latents: QwenImageLayeredPrepareLatentsStep
      • Prepare initial random noise (B, layers+1, C, H, W) for the generation process
    • set_timesteps: QwenImageLayeredSetTimestepsStep
      • Set timesteps step for QwenImage Layered with custom mu calculation based on image_latents.
    • prepare_rope_inputs: QwenImageLayeredRoPEInputsStep
      • Step that prepares the RoPE inputs for the denoising process. Should be place after prepare_latents step
    • denoise: QwenImageLayeredDenoiseStep
      • Denoise step that iteratively denoise the latents.
    • after_denoise: QwenImageLayeredAfterDenoiseStep
      • Unpack latents from (B, seq, C*4) to (B, C, layers+1, H, W) after denoising.
  4. decode (QwenImageLayeredDecoderStep)
    • Decode unpacked latents (B, C, layers+1, H, W) into layer images.

Model Components

  1. image_resize_processor (VaeImageProcessor)
  2. text_encoder (Qwen2_5_VLForConditionalGeneration)
  3. processor (Qwen2VLProcessor)
  4. tokenizer (Qwen2Tokenizer): The tokenizer to use
  5. guider (ClassifierFreeGuidance)
  6. image_processor (VaeImageProcessor)
  7. vae (AutoencoderKLQwenImage)
  8. pachifier (QwenImageLayeredPachifier)
  9. scheduler (FlowMatchEulerDiscreteScheduler)
  10. transformer (QwenImageTransformer2DModel) ## Input/Output Specification

Inputs:

  • image (Image | list): Reference image(s) for denoising. Can be a single image or list of images.
  • resolution (int, optional, defaults to 640): The target area to resize the image to, can be 1024 or 640
  • prompt (str, optional): The prompt or prompts to guide image generation.
  • use_en_prompt (bool, optional, defaults to False): Whether to use English prompt template
  • negative_prompt (str, optional): The prompt or prompts not to guide the image generation.
  • max_sequence_length (int, optional, defaults to 1024): Maximum sequence length for prompt encoding.
  • generator (Generator, optional): Torch generator for deterministic generation.
  • num_images_per_prompt (int, optional, defaults to 1): The number of images to generate per prompt.
  • latents (Tensor, optional): Pre-generated noisy latents for image generation.
  • layers (int, optional, defaults to 4): Number of layers to extract from the image
  • num_inference_steps (int, optional, defaults to 50): The number of denoising steps.
  • sigmas (list, optional): Custom sigmas for the denoising process.
  • attention_kwargs (dict, optional): Additional kwargs for attention processors.
  • **denoiser_input_fields (None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
  • output_type (str, optional, defaults to pil): Output format: 'pil', 'np', 'pt'.

Outputs:

  • images (list): Generated images.