Buckets:
| # BLIP-Diffusion | |
| BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. | |
| The abstract from the paper is: | |
| *Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).* | |
| The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization. | |
| `BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/). | |
| > [!TIP] | |
| > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. | |
| ## BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]] | |
| #### diffusers.BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L78) | |
| Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12652/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| __call__diffusers.BlipDiffusionPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L190[{"name": "prompt", "val": ": list"}, {"name": "reference_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": list"}, {"name": "target_subject_category", "val": ": list"}, {"name": "latents", "val": ": torch.Tensor | None = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "neg_prompt", "val": ": str | None = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": str | None = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- **prompt** (`list[str]`) -- | |
| The prompt or prompts to guide the image generation. | |
| - **reference_image** (`PIL.Image.Image`) -- | |
| The reference image to condition the generation on. | |
| - **source_subject_category** (`list[str]`) -- | |
| The source subject category. | |
| - **target_subject_category** (`list[str]`) -- | |
| The target subject category. | |
| - **latents** (`torch.Tensor`, *optional*) -- | |
| Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor will be generated by random sampling. | |
| - **guidance_scale** (`float`, *optional*, defaults to 7.5) -- | |
| Guidance scale as defined in [Classifier-Free Diffusion | |
| Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2. | |
| of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting | |
| `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to | |
| the text `prompt`, usually at the expense of lower image quality. | |
| - **height** (`int`, *optional*, defaults to 512) -- | |
| The height of the generated image. | |
| - **width** (`int`, *optional*, defaults to 512) -- | |
| The width of the generated image. | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **generator** (`torch.Generator` or `list[torch.Generator]`, *optional*) -- | |
| One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) | |
| to make generation deterministic. | |
| - **neg_prompt** (`str`, *optional*, defaults to "") -- | |
| The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored | |
| if `guidance_scale` is less than `1`). | |
| - **prompt_strength** (`float`, *optional*, defaults to 1.0) -- | |
| The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps | |
| to amplify the prompt. | |
| - **prompt_reps** (`int`, *optional*, defaults to 20) -- | |
| The number of times the prompt is repeated along with prompt_strength to amplify the prompt. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"` | |
| (`np.array`) or `"pt"` (`torch.Tensor`). | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a [ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) instead of a plain tuple.0[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple` | |
| Function invoked when calling the pipeline for generation. | |
| Examples: | |
| ```py | |
| >>> from diffusers.pipelines import BlipDiffusionPipeline | |
| >>> from diffusers.utils import load_image | |
| >>> import torch | |
| >>> blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained( | |
| ... "Salesforce/blipdiffusion", torch_dtype=torch.float16 | |
| ... ).to("cuda") | |
| >>> cond_subject = "dog" | |
| >>> tgt_subject = "dog" | |
| >>> text_prompt_input = "swimming underwater" | |
| >>> cond_image = load_image( | |
| ... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg" | |
| ... ) | |
| >>> guidance_scale = 7.5 | |
| >>> num_inference_steps = 25 | |
| >>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate" | |
| >>> output = blip_diffusion_pipe( | |
| ... text_prompt_input, | |
| ... cond_image, | |
| ... cond_subject, | |
| ... tgt_subject, | |
| ... guidance_scale=guidance_scale, | |
| ... num_inference_steps=num_inference_steps, | |
| ... neg_prompt=negative_prompt, | |
| ... height=512, | |
| ... width=512, | |
| ... ).images | |
| >>> output[0].save("image.png") | |
| ``` | |
| **Parameters:** | |
| tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder | |
| text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt | |
| vae ([AutoencoderKL](/docs/diffusers/pr_12652/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image | |
| unet ([UNet2DConditionModel](/docs/diffusers/pr_12652/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding. | |
| scheduler ([PNDMScheduler](/docs/diffusers/pr_12652/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents. | |
| qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image. | |
| image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image. | |
| ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder. | |
| **Returns:** | |
| `[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple`` | |
| ## BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]] | |
| #### diffusers.BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L85) | |
| Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12652/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| __call__diffusers.BlipDiffusionControlNetPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L238[{"name": "prompt", "val": ": list"}, {"name": "reference_image", "val": ": Image"}, {"name": "condtioning_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": list"}, {"name": "target_subject_category", "val": ": list"}, {"name": "latents", "val": ": torch.Tensor | None = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "neg_prompt", "val": ": str | None = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": str | None = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- **prompt** (`list[str]`) -- | |
| The prompt or prompts to guide the image generation. | |
| - **reference_image** (`PIL.Image.Image`) -- | |
| The reference image to condition the generation on. | |
| - **condtioning_image** (`PIL.Image.Image`) -- | |
| The conditioning canny edge image to condition the generation on. | |
| - **source_subject_category** (`list[str]`) -- | |
| The source subject category. | |
| - **target_subject_category** (`list[str]`) -- | |
| The target subject category. | |
| - **latents** (`torch.Tensor`, *optional*) -- | |
| Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor will be generated by random sampling. | |
| - **guidance_scale** (`float`, *optional*, defaults to 7.5) -- | |
| Guidance scale as defined in [Classifier-Free Diffusion | |
| Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2. | |
| of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting | |
| `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to | |
| the text `prompt`, usually at the expense of lower image quality. | |
| - **height** (`int`, *optional*, defaults to 512) -- | |
| The height of the generated image. | |
| - **width** (`int`, *optional*, defaults to 512) -- | |
| The width of the generated image. | |
| - **seed** (`int`, *optional*, defaults to 42) -- | |
| The seed to use for random generation. | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **generator** (`torch.Generator` or `list[torch.Generator]`, *optional*) -- | |
| One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) | |
| to make generation deterministic. | |
| - **neg_prompt** (`str`, *optional*, defaults to "") -- | |
| The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored | |
| if `guidance_scale` is less than `1`). | |
| - **prompt_strength** (`float`, *optional*, defaults to 1.0) -- | |
| The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps | |
| to amplify the prompt. | |
| - **prompt_reps** (`int`, *optional*, defaults to 20) -- | |
| The number of times the prompt is repeated along with prompt_strength to amplify the prompt.0[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple` | |
| Function invoked when calling the pipeline for generation. | |
| Examples: | |
| ```py | |
| >>> from diffusers.pipelines import BlipDiffusionControlNetPipeline | |
| >>> from diffusers.utils import load_image | |
| >>> from controlnet_aux import CannyDetector | |
| >>> import torch | |
| >>> blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained( | |
| ... "Salesforce/blipdiffusion-controlnet", torch_dtype=torch.float16 | |
| ... ).to("cuda") | |
| >>> style_subject = "flower" | |
| >>> tgt_subject = "teapot" | |
| >>> text_prompt = "on a marble table" | |
| >>> cldm_cond_image = load_image( | |
| ... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg" | |
| ... ).resize((512, 512)) | |
| >>> canny = CannyDetector() | |
| >>> cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil") | |
| >>> style_image = load_image( | |
| ... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg" | |
| ... ) | |
| >>> guidance_scale = 7.5 | |
| >>> num_inference_steps = 50 | |
| >>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate" | |
| >>> output = blip_diffusion_pipe( | |
| ... text_prompt, | |
| ... style_image, | |
| ... cldm_cond_image, | |
| ... style_subject, | |
| ... tgt_subject, | |
| ... guidance_scale=guidance_scale, | |
| ... num_inference_steps=num_inference_steps, | |
| ... neg_prompt=negative_prompt, | |
| ... height=512, | |
| ... width=512, | |
| ... ).images | |
| >>> output[0].save("image.png") | |
| ``` | |
| **Parameters:** | |
| tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder | |
| text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt | |
| vae ([AutoencoderKL](/docs/diffusers/pr_12652/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image | |
| unet ([UNet2DConditionModel](/docs/diffusers/pr_12652/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding. | |
| scheduler ([PNDMScheduler](/docs/diffusers/pr_12652/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents. | |
| qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image. | |
| controlnet ([ControlNetModel](/docs/diffusers/pr_12652/en/api/models/controlnet#diffusers.ControlNetModel)) : ControlNet model to get the conditioning image embedding. | |
| image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image. | |
| ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder. | |
| **Returns:** | |
| `[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple`` | |
Xet Storage Details
- Size:
- 16.2 kB
- Xet hash:
- edaed2ecf4a6b8b80e30fb5bbc64162bca4f7e5b62ab7bc18b39c519118ae8db
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.