Buckets:
| # DiffEdit | |
| [DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. | |
| The abstract from the paper is: | |
| *Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.* | |
| The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html). | |
| This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️ | |
| ## Tips | |
| * The pipeline can generate masks that can be fed into other inpainting pipelines. | |
| * In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [generate_mask()](/docs/diffusers/pr_12509/en/api/pipelines/diffedit#diffusers.StableDiffusionDiffEditPipeline.generate_mask)) | |
| and a set of partially inverted latents (generated using [invert()](/docs/diffusers/pr_12509/en/api/pipelines/diffedit#diffusers.StableDiffusionDiffEditPipeline.invert)) _must_ be provided as arguments when calling the pipeline to generate the final edited image. | |
| * The function [generate_mask()](/docs/diffusers/pr_12509/en/api/pipelines/diffedit#diffusers.StableDiffusionDiffEditPipeline.generate_mask) exposes two prompt arguments, `source_prompt` and `target_prompt` | |
| that let you control the locations of the semantic edits in the final image to be generated. Let's say, | |
| you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect | |
| this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to | |
| `source_prompt` and "dog" to `target_prompt`. | |
| * When generating partially inverted latents using `invert`, assign a caption or text embedding describing the | |
| overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the | |
| source concept is sufficiently descriptive to yield good results, but feel free to explore alternatives. | |
| * When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt` | |
| and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to | |
| the phrases including "cat" to `negative_prompt` and "dog" to `prompt`. | |
| * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to: | |
| * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`. | |
| * Change the input prompt in [invert()](/docs/diffusers/pr_12509/en/api/pipelines/diffedit#diffusers.StableDiffusionDiffEditPipeline.invert) to include "dog". | |
| * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image. | |
| * The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](../../using-diffusers/diffedit) guide for more details. | |
| ## StableDiffusionDiffEditPipeline[[diffusers.StableDiffusionDiffEditPipeline]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.StableDiffusionDiffEditPipeline</name><anchor>diffusers.StableDiffusionDiffEditPipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12509/src/diffusers/pipelines/stable_diffusion_diffedit/pipeline_stable_diffusion_diffedit.py#L244</source><parameters>[{"name": "vae", "val": ": AutoencoderKL"}, {"name": "text_encoder", "val": ": CLIPTextModel"}, {"name": "tokenizer", "val": ": CLIPTokenizer"}, {"name": "unet", "val": ": UNet2DConditionModel"}, {"name": "scheduler", "val": ": KarrasDiffusionSchedulers"}, {"name": "safety_checker", "val": ": StableDiffusionSafetyChecker"}, {"name": "feature_extractor", "val": ": CLIPImageProcessor"}, {"name": "inverse_scheduler", "val": ": DDIMInverseScheduler"}, {"name": "requires_safety_checker", "val": ": bool = True"}]</parameters><paramsdesc>- **vae** ([AutoencoderKL](/docs/diffusers/pr_12509/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) -- | |
| Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. | |
| - **text_encoder** ([CLIPTextModel](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTextModel)) -- | |
| Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). | |
| - **tokenizer** ([CLIPTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTokenizer)) -- | |
| A `CLIPTokenizer` to tokenize text. | |
| - **unet** ([UNet2DConditionModel](/docs/diffusers/pr_12509/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) -- | |
| A `UNet2DConditionModel` to denoise the encoded image latents. | |
| - **scheduler** ([SchedulerMixin](/docs/diffusers/pr_12509/en/api/schedulers/overview#diffusers.SchedulerMixin)) -- | |
| A scheduler to be used in combination with `unet` to denoise the encoded image latents. | |
| - **inverse_scheduler** ([DDIMInverseScheduler](/docs/diffusers/pr_12509/en/api/schedulers/ddim_inverse#diffusers.DDIMInverseScheduler)) -- | |
| A scheduler to be used in combination with `unet` to fill in the unmasked part of the input latents. | |
| - **safety_checker** (`StableDiffusionSafetyChecker`) -- | |
| Classification module that estimates whether generated images could be considered offensive or harmful. | |
| Please refer to the [model card](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) for | |
| more details about a model's potential harms. | |
| - **feature_extractor** ([CLIPImageProcessor](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPImageProcessor)) -- | |
| A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| > [!WARNING] > This is an experimental feature! | |
| Pipeline for text-guided image inpainting using Stable Diffusion and DiffEdit. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12509/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods | |
| implemented for all pipelines (downloading, saving, running on a particular device, etc.). | |
| The pipeline also inherits the following loading and saving methods: | |
| - [load_textual_inversion()](/docs/diffusers/pr_12509/en/api/loaders/textual_inversion#diffusers.loaders.TextualInversionLoaderMixin.load_textual_inversion) for loading textual inversion embeddings | |
| - [load_lora_weights()](/docs/diffusers/pr_12509/en/api/loaders/lora#diffusers.loaders.StableDiffusionLoraLoaderMixin.load_lora_weights) for loading LoRA weights | |
| - [save_lora_weights()](/docs/diffusers/pr_12509/en/api/loaders/lora#diffusers.loaders.StableDiffusionLoraLoaderMixin.save_lora_weights) for saving LoRA weights | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>generate_mask</name><anchor>diffusers.StableDiffusionDiffEditPipeline.generate_mask</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12509/src/diffusers/pipelines/stable_diffusion_diffedit/pipeline_stable_diffusion_diffedit.py#L843</source><parameters>[{"name": "image", "val": ": typing.Union[torch.Tensor, PIL.Image.Image] = None"}, {"name": "target_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "target_negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "target_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "target_negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "source_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "source_negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "source_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "source_negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "num_maps_per_mask", "val": ": typing.Optional[int] = 10"}, {"name": "mask_encode_strength", "val": ": typing.Optional[float] = 0.5"}, {"name": "mask_thresholding_ratio", "val": ": typing.Optional[float] = 3.0"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "cross_attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}]</parameters><paramsdesc>- **image** (`PIL.Image.Image`) -- | |
| `Image` or tensor representing an image batch to be used for computing the mask. | |
| - **target_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide semantic mask generation. If not defined, you need to pass | |
| `prompt_embeds`. | |
| - **target_negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide what to not include in image generation. If not defined, you need to | |
| pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). | |
| - **target_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not | |
| provided, text embeddings are generated from the `prompt` input argument. | |
| - **target_negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If | |
| not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. | |
| - **source_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide semantic mask generation using DiffEdit. If not defined, you need to | |
| pass `source_prompt_embeds` or `source_image` instead. | |
| - **source_negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide semantic mask generation away from using DiffEdit. If not defined, you | |
| need to pass `source_negative_prompt_embeds` or `source_image` instead. | |
| - **source_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings to guide the semantic mask generation. Can be used to easily tweak text | |
| inputs (prompt weighting). If not provided, text embeddings are generated from `source_prompt` input | |
| argument. | |
| - **source_negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings to negatively guide the semantic mask generation. Can be used to easily | |
| tweak text inputs (prompt weighting). If not provided, text embeddings are generated from | |
| `source_negative_prompt` input argument. | |
| - **num_maps_per_mask** (`int`, *optional*, defaults to 10) -- | |
| The number of noise maps sampled to generate the semantic mask using DiffEdit. | |
| - **mask_encode_strength** (`float`, *optional*, defaults to 0.5) -- | |
| The strength of the noise maps sampled to generate the semantic mask using DiffEdit. Must be between 0 | |
| and 1. | |
| - **mask_thresholding_ratio** (`float`, *optional*, defaults to 3.0) -- | |
| The maximum multiple of the mean absolute difference used to clamp the semantic guidance map before | |
| mask binarization. | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **guidance_scale** (`float`, *optional*, defaults to 7.5) -- | |
| A higher guidance scale value encourages the model to generate images closely linked to the text | |
| `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. | |
| - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- | |
| A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make | |
| generation deterministic. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generated image. Choose between `PIL.Image` or `np.array`. | |
| - **cross_attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the | |
| [AttnProcessor](/docs/diffusers/pr_12509/en/api/attnprocessor#diffusers.models.attention_processor.AttnProcessor) as defined in | |
| [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).</paramsdesc><paramgroups>0</paramgroups><rettype>`List[PIL.Image.Image]` or `np.array`</rettype><retdesc>When returning a `List[PIL.Image.Image]`, the list consists of a batch of single-channel binary images | |
| with dimensions `(height // self.vae_scale_factor, width // self.vae_scale_factor)`. If it's | |
| `np.array`, the shape is `(batch_size, height // self.vae_scale_factor, width // | |
| self.vae_scale_factor)`.</retdesc></docstring> | |
| Generate a latent mask given a mask prompt, a target prompt, and an image. | |
| <ExampleCodeBlock anchor="diffusers.StableDiffusionDiffEditPipeline.generate_mask.example"> | |
| ```py | |
| >>> import PIL | |
| >>> import requests | |
| >>> import torch | |
| >>> from io import BytesIO | |
| >>> from diffusers import StableDiffusionDiffEditPipeline | |
| >>> def download_image(url): | |
| ... response = requests.get(url) | |
| ... return PIL.Image.open(BytesIO(response.content)).convert("RGB") | |
| >>> img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" | |
| >>> init_image = download_image(img_url).resize((768, 768)) | |
| >>> pipeline = StableDiffusionDiffEditPipeline.from_pretrained( | |
| ... "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16 | |
| ... ) | |
| >>> pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) | |
| >>> pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) | |
| >>> pipeline.enable_model_cpu_offload() | |
| >>> mask_prompt = "A bowl of fruits" | |
| >>> prompt = "A bowl of pears" | |
| >>> mask_image = pipeline.generate_mask(image=init_image, source_prompt=prompt, target_prompt=mask_prompt) | |
| >>> image_latents = pipeline.invert(image=init_image, prompt=mask_prompt).latents | |
| >>> image = pipeline(prompt=prompt, mask_image=mask_image, image_latents=image_latents).images[0] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>invert</name><anchor>diffusers.StableDiffusionDiffEditPipeline.invert</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12509/src/diffusers/pipelines/stable_diffusion_diffedit/pipeline_stable_diffusion_diffedit.py#L1062</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "image", "val": ": typing.Union[torch.Tensor, PIL.Image.Image] = None"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "inpaint_strength", "val": ": float = 0.8"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "decode_latents", "val": ": bool = False"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback", "val": ": typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None"}, {"name": "callback_steps", "val": ": typing.Optional[int] = 1"}, {"name": "cross_attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "lambda_auto_corr", "val": ": float = 20.0"}, {"name": "lambda_kl", "val": ": float = 20.0"}, {"name": "num_reg_steps", "val": ": int = 0"}, {"name": "num_auto_corr_rolls", "val": ": int = 5"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. | |
| - **image** (`PIL.Image.Image`) -- | |
| `Image` or tensor representing an image batch to produce the inverted latents guided by `prompt`. | |
| - **inpaint_strength** (`float`, *optional*, defaults to 0.8) -- | |
| Indicates extent of the noising process to run latent inversion. Must be between 0 and 1. When | |
| `inpaint_strength` is 1, the inversion process is run for the full number of iterations specified in | |
| `num_inference_steps`. `image` is used as a reference for the inversion process, and adding more noise | |
| increases `inpaint_strength`. If `inpaint_strength` is 0, no inpainting occurs. | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **guidance_scale** (`float`, *optional*, defaults to 7.5) -- | |
| A higher guidance scale value encourages the model to generate images closely linked to the text | |
| `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide what to not include in image generation. If not defined, you need to | |
| pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). | |
| - **generator** (`torch.Generator`, *optional*) -- | |
| A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make | |
| generation deterministic. | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not | |
| provided, text embeddings are generated from the `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If | |
| not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. | |
| - **decode_latents** (`bool`, *optional*, defaults to `False`) -- | |
| Whether or not to decode the inverted latents into a generated image. Setting this argument to `True` | |
| decodes all inverted latents for each timestep into a list of generated images. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generated image. Choose between `PIL.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a `~pipelines.stable_diffusion.DiffEditInversionPipelineOutput` instead of a | |
| plain tuple. | |
| - **callback** (`Callable`, *optional*) -- | |
| A function that calls every `callback_steps` steps during inference. The function is called with the | |
| following arguments: `callback(step: int, timestep: int, latents: torch.Tensor)`. | |
| - **callback_steps** (`int`, *optional*, defaults to 1) -- | |
| The frequency at which the `callback` function is called. If not specified, the callback is called at | |
| every step. | |
| - **cross_attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the | |
| [AttnProcessor](/docs/diffusers/pr_12509/en/api/attnprocessor#diffusers.models.attention_processor.AttnProcessor) as defined in | |
| [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). | |
| - **lambda_auto_corr** (`float`, *optional*, defaults to 20.0) -- | |
| Lambda parameter to control auto correction. | |
| - **lambda_kl** (`float`, *optional*, defaults to 20.0) -- | |
| Lambda parameter to control Kullback-Leibler divergence output. | |
| - **num_reg_steps** (`int`, *optional*, defaults to 0) -- | |
| Number of regularization loss steps. | |
| - **num_auto_corr_rolls** (`int`, *optional*, defaults to 5) -- | |
| Number of auto correction roll steps.</paramsdesc><paramgroups>0</paramgroups><retdesc>`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput` or | |
| `tuple`: | |
| If `return_dict` is `True`, | |
| `~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput` is | |
| returned, otherwise a `tuple` is returned where the first element is the inverted latents tensors | |
| ordered by increasing noise, and the second is the corresponding decoded images if `decode_latents` is | |
| `True`, otherwise `None`.</retdesc></docstring> | |
| Generate inverted latents given a prompt and image. | |
| <ExampleCodeBlock anchor="diffusers.StableDiffusionDiffEditPipeline.invert.example"> | |
| ```py | |
| >>> import PIL | |
| >>> import requests | |
| >>> import torch | |
| >>> from io import BytesIO | |
| >>> from diffusers import StableDiffusionDiffEditPipeline | |
| >>> def download_image(url): | |
| ... response = requests.get(url) | |
| ... return PIL.Image.open(BytesIO(response.content)).convert("RGB") | |
| >>> img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" | |
| >>> init_image = download_image(img_url).resize((768, 768)) | |
| >>> pipeline = StableDiffusionDiffEditPipeline.from_pretrained( | |
| ... "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16 | |
| ... ) | |
| >>> pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) | |
| >>> pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) | |
| >>> pipeline.enable_model_cpu_offload() | |
| >>> prompt = "A bowl of fruits" | |
| >>> inverted_latents = pipeline.invert(image=init_image, prompt=prompt).latents | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__call__</name><anchor>diffusers.StableDiffusionDiffEditPipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12509/src/diffusers/pipelines/stable_diffusion_diffedit/pipeline_stable_diffusion_diffedit.py#L1300</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "mask_image", "val": ": typing.Union[torch.Tensor, PIL.Image.Image] = None"}, {"name": "image_latents", "val": ": typing.Union[torch.Tensor, PIL.Image.Image] = None"}, {"name": "inpaint_strength", "val": ": typing.Optional[float] = 0.8"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "eta", "val": ": float = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback", "val": ": typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None"}, {"name": "callback_steps", "val": ": int = 1"}, {"name": "cross_attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "clip_skip", "val": ": int = None"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. | |
| - **mask_image** (`PIL.Image.Image`) -- | |
| `Image` or tensor representing an image batch to mask the generated image. White pixels in the mask are | |
| repainted, while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a | |
| single channel (luminance) before use. If it's a tensor, it should contain one color channel (L) | |
| instead of 3, so the expected shape would be `(B, 1, H, W)`. | |
| - **image_latents** (`PIL.Image.Image` or `torch.Tensor`) -- | |
| Partially noised image latents from the inversion process to be used as inputs for image generation. | |
| - **inpaint_strength** (`float`, *optional*, defaults to 0.8) -- | |
| Indicates extent to inpaint the masked area. Must be between 0 and 1. When `inpaint_strength` is 1, the | |
| denoising process is run on the masked area for the full number of iterations specified in | |
| `num_inference_steps`. `image_latents` is used as a reference for the masked area, and adding more | |
| noise to a region increases `inpaint_strength`. If `inpaint_strength` is 0, no inpainting occurs. | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **guidance_scale** (`float`, *optional*, defaults to 7.5) -- | |
| A higher guidance scale value encourages the model to generate images closely linked to the text | |
| `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide what to not include in image generation. If not defined, you need to | |
| pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). | |
| - **num_images_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| The number of images to generate per prompt. | |
| - **eta** (`float`, *optional*, defaults to 0.0) -- | |
| Corresponds to parameter eta (η) from the [DDIM](https://huggingface.co/papers/2010.02502) paper. Only | |
| applies to the [DDIMScheduler](/docs/diffusers/pr_12509/en/api/schedulers/ddim#diffusers.DDIMScheduler), and is ignored in other schedulers. | |
| - **generator** (`torch.Generator`, *optional*) -- | |
| A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make | |
| generation deterministic. | |
| - **latents** (`torch.Tensor`, *optional*) -- | |
| Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor is generated by sampling using the supplied random `generator`. | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not | |
| provided, text embeddings are generated from the `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If | |
| not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generated image. Choose between `PIL.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a [StableDiffusionPipelineOutput](/docs/diffusers/pr_12509/en/api/pipelines/stable_diffusion/gligen#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) instead of a | |
| plain tuple. | |
| - **callback** (`Callable`, *optional*) -- | |
| A function that calls every `callback_steps` steps during inference. The function is called with the | |
| following arguments: `callback(step: int, timestep: int, latents: torch.Tensor)`. | |
| - **callback_steps** (`int`, *optional*, defaults to 1) -- | |
| The frequency at which the `callback` function is called. If not specified, the callback is called at | |
| every step. | |
| - **cross_attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined in | |
| [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). | |
| - **clip_skip** (`int`, *optional*) -- | |
| Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that | |
| the output of the pre-final layer will be used for computing the prompt embeddings.</paramsdesc><paramgroups>0</paramgroups><rettype>[StableDiffusionPipelineOutput](/docs/diffusers/pr_12509/en/api/pipelines/stable_diffusion/gligen#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or `tuple`</rettype><retdesc>If `return_dict` is `True`, [StableDiffusionPipelineOutput](/docs/diffusers/pr_12509/en/api/pipelines/stable_diffusion/gligen#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) is returned, | |
| otherwise a `tuple` is returned where the first element is a list with the generated images and the | |
| second element is a list of `bool`s indicating whether the corresponding generated image contains | |
| "not-safe-for-work" (nsfw) content.</retdesc></docstring> | |
| The call function to the pipeline for generation. | |
| <ExampleCodeBlock anchor="diffusers.StableDiffusionDiffEditPipeline.__call__.example"> | |
| ```py | |
| >>> import PIL | |
| >>> import requests | |
| >>> import torch | |
| >>> from io import BytesIO | |
| >>> from diffusers import StableDiffusionDiffEditPipeline | |
| >>> def download_image(url): | |
| ... response = requests.get(url) | |
| ... return PIL.Image.open(BytesIO(response.content)).convert("RGB") | |
| >>> img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" | |
| >>> init_image = download_image(img_url).resize((768, 768)) | |
| >>> pipeline = StableDiffusionDiffEditPipeline.from_pretrained( | |
| ... "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16 | |
| ... ) | |
| >>> pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) | |
| >>> pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) | |
| >>> pipeline.enable_model_cpu_offload() | |
| >>> mask_prompt = "A bowl of fruits" | |
| >>> prompt = "A bowl of pears" | |
| >>> mask_image = pipeline.generate_mask(image=init_image, source_prompt=prompt, target_prompt=mask_prompt) | |
| >>> image_latents = pipeline.invert(image=init_image, prompt=mask_prompt).latents | |
| >>> image = pipeline(prompt=prompt, mask_image=mask_image, image_latents=image_latents).images[0] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_prompt</name><anchor>diffusers.StableDiffusionDiffEditPipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12509/src/diffusers/pipelines/stable_diffusion_diffedit/pipeline_stable_diffusion_diffedit.py#L422</source><parameters>[{"name": "prompt", "val": ""}, {"name": "device", "val": ""}, {"name": "num_images_per_prompt", "val": ""}, {"name": "do_classifier_free_guidance", "val": ""}, {"name": "negative_prompt", "val": " = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "lora_scale", "val": ": typing.Optional[float] = None"}, {"name": "clip_skip", "val": ": typing.Optional[int] = None"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| prompt to be encoded | |
| - **device** -- (`torch.device`): | |
| torch device | |
| - **num_images_per_prompt** (`int`) -- | |
| number of images that should be generated per prompt | |
| - **do_classifier_free_guidance** (`bool`) -- | |
| whether to use classifier free guidance or not | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts not to guide the image generation. If not defined, one has to pass | |
| `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is | |
| less than `1`). | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt | |
| weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input | |
| argument. | |
| - **lora_scale** (`float`, *optional*) -- | |
| A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded. | |
| - **clip_skip** (`int`, *optional*) -- | |
| Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that | |
| the output of the pre-final layer will be used for computing the prompt embeddings.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Encodes the prompt into text encoder hidden states. | |
| </div></div> | |
| ## StableDiffusionPipelineOutput[[diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput</name><anchor>diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12509/src/diffusers/pipelines/stable_diffusion/pipeline_output.py#L11</source><parameters>[{"name": "images", "val": ": typing.Union[typing.List[PIL.Image.Image], numpy.ndarray]"}, {"name": "nsfw_content_detected", "val": ": typing.Optional[typing.List[bool]]"}]</parameters><paramsdesc>- **images** (`List[PIL.Image.Image]` or `np.ndarray`) -- | |
| List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, | |
| num_channels)`. | |
| - **nsfw_content_detected** (`List[bool]`) -- | |
| List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or | |
| `None` if safety checking could not be performed.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Output class for Stable Diffusion pipelines. | |
| </div> | |
| <EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/diffedit.md" /> |
Xet Storage Details
- Size:
- 35.4 kB
- Xet hash:
- 5b1c33ba0edb04d92414a7c4b384b2f83ed83933a6e1811424a2b8b9a866035a
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.