Buckets:

rtrm's picture
|
download
raw
16.4 kB
# BLIP-Diffusion
BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
The abstract from the paper is:
*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).*
The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization.
`BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/).
> [!TIP]
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
## BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]]
#### diffusers.BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]]
[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L80)
Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion.
This model inherits from [DiffusionPipeline](/docs/diffusers/pr_11739/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__diffusers.BlipDiffusionPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L192[{"name": "prompt", "val": ": typing.List[str]"}, {"name": "reference_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": typing.List[str]"}, {"name": "target_subject_category", "val": ": typing.List[str]"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "neg_prompt", "val": ": typing.Optional[str] = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- **prompt** (`List[str]`) --
The prompt or prompts to guide the image generation.
- **reference_image** (`PIL.Image.Image`) --
The reference image to condition the generation on.
- **source_subject_category** (`List[str]`) --
The source subject category.
- **target_subject_category** (`List[str]`) --
The target subject category.
- **latents** (`torch.Tensor`, *optional*) --
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by random sampling.
- **guidance_scale** (`float`, *optional*, defaults to 7.5) --
Guidance scale as defined in [Classifier-Free Diffusion
Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
- **height** (`int`, *optional*, defaults to 512) --
The height of the generated image.
- **width** (`int`, *optional*, defaults to 512) --
The width of the generated image.
- **num_inference_steps** (`int`, *optional*, defaults to 50) --
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
- **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) --
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
- **neg_prompt** (`str`, *optional*, defaults to "") --
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
if `guidance_scale` is less than `1`).
- **prompt_strength** (`float`, *optional*, defaults to 1.0) --
The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
to amplify the prompt.
- **prompt_reps** (`int`, *optional*, defaults to 20) --
The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
- **output_type** (`str`, *optional*, defaults to `"pil"`) --
The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
(`np.array`) or `"pt"` (`torch.Tensor`).
- **return_dict** (`bool`, *optional*, defaults to `True`) --
Whether or not to return a [ImagePipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) instead of a plain tuple.0[ImagePipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple`
Function invoked when calling the pipeline for generation.
Examples:
```py
>>> from diffusers.pipelines import BlipDiffusionPipeline
>>> from diffusers.utils import load_image
>>> import torch
>>> blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained(
... "Salesforce/blipdiffusion", torch_dtype=torch.float16
... ).to("cuda")
>>> cond_subject = "dog"
>>> tgt_subject = "dog"
>>> text_prompt_input = "swimming underwater"
>>> cond_image = load_image(
... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg"
... )
>>> guidance_scale = 7.5
>>> num_inference_steps = 25
>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
>>> output = blip_diffusion_pipe(
... text_prompt_input,
... cond_image,
... cond_subject,
... tgt_subject,
... guidance_scale=guidance_scale,
... num_inference_steps=num_inference_steps,
... neg_prompt=negative_prompt,
... height=512,
... width=512,
... ).images
>>> output[0].save("image.png")
```
**Parameters:**
tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder
text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt
vae ([AutoencoderKL](/docs/diffusers/pr_11739/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image
unet ([UNet2DConditionModel](/docs/diffusers/pr_11739/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding.
scheduler ([PNDMScheduler](/docs/diffusers/pr_11739/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents.
qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image.
image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image.
ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder.
**Returns:**
`[ImagePipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple``
## BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]]
#### diffusers.BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]]
[Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L87)
Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion.
This model inherits from [DiffusionPipeline](/docs/diffusers/pr_11739/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__diffusers.BlipDiffusionControlNetPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L240[{"name": "prompt", "val": ": typing.List[str]"}, {"name": "reference_image", "val": ": Image"}, {"name": "condtioning_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": typing.List[str]"}, {"name": "target_subject_category", "val": ": typing.List[str]"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "neg_prompt", "val": ": typing.Optional[str] = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- **prompt** (`List[str]`) --
The prompt or prompts to guide the image generation.
- **reference_image** (`PIL.Image.Image`) --
The reference image to condition the generation on.
- **condtioning_image** (`PIL.Image.Image`) --
The conditioning canny edge image to condition the generation on.
- **source_subject_category** (`List[str]`) --
The source subject category.
- **target_subject_category** (`List[str]`) --
The target subject category.
- **latents** (`torch.Tensor`, *optional*) --
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by random sampling.
- **guidance_scale** (`float`, *optional*, defaults to 7.5) --
Guidance scale as defined in [Classifier-Free Diffusion
Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
- **height** (`int`, *optional*, defaults to 512) --
The height of the generated image.
- **width** (`int`, *optional*, defaults to 512) --
The width of the generated image.
- **seed** (`int`, *optional*, defaults to 42) --
The seed to use for random generation.
- **num_inference_steps** (`int`, *optional*, defaults to 50) --
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
- **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) --
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
- **neg_prompt** (`str`, *optional*, defaults to "") --
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
if `guidance_scale` is less than `1`).
- **prompt_strength** (`float`, *optional*, defaults to 1.0) --
The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
to amplify the prompt.
- **prompt_reps** (`int`, *optional*, defaults to 20) --
The number of times the prompt is repeated along with prompt_strength to amplify the prompt.0[ImagePipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple`
Function invoked when calling the pipeline for generation.
Examples:
```py
>>> from diffusers.pipelines import BlipDiffusionControlNetPipeline
>>> from diffusers.utils import load_image
>>> from controlnet_aux import CannyDetector
>>> import torch
>>> blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
... "Salesforce/blipdiffusion-controlnet", torch_dtype=torch.float16
... ).to("cuda")
>>> style_subject = "flower"
>>> tgt_subject = "teapot"
>>> text_prompt = "on a marble table"
>>> cldm_cond_image = load_image(
... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg"
... ).resize((512, 512))
>>> canny = CannyDetector()
>>> cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil")
>>> style_image = load_image(
... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
... )
>>> guidance_scale = 7.5
>>> num_inference_steps = 50
>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
>>> output = blip_diffusion_pipe(
... text_prompt,
... style_image,
... cldm_cond_image,
... style_subject,
... tgt_subject,
... guidance_scale=guidance_scale,
... num_inference_steps=num_inference_steps,
... neg_prompt=negative_prompt,
... height=512,
... width=512,
... ).images
>>> output[0].save("image.png")
```
**Parameters:**
tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder
text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt
vae ([AutoencoderKL](/docs/diffusers/pr_11739/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image
unet ([UNet2DConditionModel](/docs/diffusers/pr_11739/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding.
scheduler ([PNDMScheduler](/docs/diffusers/pr_11739/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents.
qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image.
controlnet ([ControlNetModel](/docs/diffusers/pr_11739/en/api/models/controlnet#diffusers.ControlNetModel)) : ControlNet model to get the conditioning image embedding.
image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image.
ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder.
**Returns:**
`[ImagePipelineOutput](/docs/diffusers/pr_11739/en/api/pipelines/stable_unclip#diffusers.ImagePipelineOutput) or `tuple``

Xet Storage Details

Size:
16.4 kB
·
Xet hash:
8409b47329f4e2fd11626f196116b2e182214dab32d789e61a1536f1f57507ad

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.