Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_12652 /en /api /pipelines /blip_diffusion.md

rtrm

17 days ago

preview code

download

raw

16.2 kB

	# BLIP-Diffusion

	BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.

	The abstract from the paper is:

	Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).

	The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization.

	`BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/).

	> [!TIP]
	> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

	## BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]]
	#### diffusers.BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L78)

	Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12652/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	__call__diffusers.BlipDiffusionPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L190[{"name": "prompt", "val": ": list"}, {"name": "reference_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": list"}, {"name": "target_subject_category", "val": ": list"}, {"name": "latents", "val": ": torch.Tensor \| None = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": torch._C.Generator \| list[torch._C.Generator] \| None = None"}, {"name": "neg_prompt", "val": ": str \| None = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": str \| None = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- prompt (`list[str]`) --
	The prompt or prompts to guide the image generation.
	- reference_image (`PIL.Image.Image`) --
	The reference image to condition the generation on.
	- source_subject_category (`list[str]`) --
	The source subject category.
	- target_subject_category (`list[str]`) --
	The target subject category.
	- latents (`torch.Tensor`, optional) --
	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
	generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
	tensor will be generated by random sampling.
	- guidance_scale (`float`, optional, defaults to 7.5) --
	Guidance scale as defined in [Classifier-Free Diffusion
	Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
	of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
	`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
	the text `prompt`, usually at the expense of lower image quality.
	- height (`int`, optional, defaults to 512) --
	The height of the generated image.
	- width (`int`, optional, defaults to 512) --
	The width of the generated image.
	- num_inference_steps (`int`, optional, defaults to 50) --
	The number of denoising steps. More denoising steps usually lead to a higher quality image at the
	expense of slower inference.
	- generator (`torch.Generator` or `list[torch.Generator]`, optional) --
	One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
	to make generation deterministic.
	- neg_prompt (`str`, optional, defaults to "") --
	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
	if `guidance_scale` is less than `1`).
	- prompt_strength (`float`, optional, defaults to 1.0) --
	The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
	to amplify the prompt.
	- prompt_reps (`int`, optional, defaults to 20) --
	The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
	- output_type (`str`, optional, defaults to `"pil"`) --
	The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
	(`np.array`) or `"pt"` (`torch.Tensor`).
	- return_dict (`bool`, optional, defaults to `True`) --
	Whether or not to return a [ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) instead of a plain tuple.0[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple`

	Function invoked when calling the pipeline for generation.

	Examples:
	```py
	>>> from diffusers.pipelines import BlipDiffusionPipeline
	>>> from diffusers.utils import load_image
	>>> import torch

	>>> blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained(
	... "Salesforce/blipdiffusion", torch_dtype=torch.float16
	... ).to("cuda")

	>>> cond_subject = "dog"
	>>> tgt_subject = "dog"
	>>> text_prompt_input = "swimming underwater"

	>>> cond_image = load_image(
	... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg"
	... )
	>>> guidance_scale = 7.5
	>>> num_inference_steps = 25
	>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

	>>> output = blip_diffusion_pipe(
	... text_prompt_input,
	... cond_image,
	... cond_subject,
	... tgt_subject,
	... guidance_scale=guidance_scale,
	... num_inference_steps=num_inference_steps,
	... neg_prompt=negative_prompt,
	... height=512,
	... width=512,
	... ).images
	>>> output[0].save("image.png")
	```

	Parameters:

	tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder

	text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt

	vae ([AutoencoderKL](/docs/diffusers/pr_12652/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image

	unet ([UNet2DConditionModel](/docs/diffusers/pr_12652/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding.

	scheduler ([PNDMScheduler](/docs/diffusers/pr_12652/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents.

	qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image.

	image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image.

	ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder.

	Returns:

	`[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple``

	## BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]]
	#### diffusers.BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L85)

	Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion.

	This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12652/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
	library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

	__call__diffusers.BlipDiffusionControlNetPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L238[{"name": "prompt", "val": ": list"}, {"name": "reference_image", "val": ": Image"}, {"name": "condtioning_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": list"}, {"name": "target_subject_category", "val": ": list"}, {"name": "latents", "val": ": torch.Tensor \| None = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": torch._C.Generator \| list[torch._C.Generator] \| None = None"}, {"name": "neg_prompt", "val": ": str \| None = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": str \| None = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- prompt (`list[str]`) --
	The prompt or prompts to guide the image generation.
	- reference_image (`PIL.Image.Image`) --
	The reference image to condition the generation on.
	- condtioning_image (`PIL.Image.Image`) --
	The conditioning canny edge image to condition the generation on.
	- source_subject_category (`list[str]`) --
	The source subject category.
	- target_subject_category (`list[str]`) --
	The target subject category.
	- latents (`torch.Tensor`, optional) --
	Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
	generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
	tensor will be generated by random sampling.
	- guidance_scale (`float`, optional, defaults to 7.5) --
	Guidance scale as defined in [Classifier-Free Diffusion
	Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
	of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
	`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
	the text `prompt`, usually at the expense of lower image quality.
	- height (`int`, optional, defaults to 512) --
	The height of the generated image.
	- width (`int`, optional, defaults to 512) --
	The width of the generated image.
	- seed (`int`, optional, defaults to 42) --
	The seed to use for random generation.
	- num_inference_steps (`int`, optional, defaults to 50) --
	The number of denoising steps. More denoising steps usually lead to a higher quality image at the
	expense of slower inference.
	- generator (`torch.Generator` or `list[torch.Generator]`, optional) --
	One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
	to make generation deterministic.
	- neg_prompt (`str`, optional, defaults to "") --
	The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
	if `guidance_scale` is less than `1`).
	- prompt_strength (`float`, optional, defaults to 1.0) --
	The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
	to amplify the prompt.
	- prompt_reps (`int`, optional, defaults to 20) --
	The number of times the prompt is repeated along with prompt_strength to amplify the prompt.0[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple`

	Function invoked when calling the pipeline for generation.

	Examples:
	```py
	>>> from diffusers.pipelines import BlipDiffusionControlNetPipeline
	>>> from diffusers.utils import load_image
	>>> from controlnet_aux import CannyDetector
	>>> import torch

	>>> blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
	... "Salesforce/blipdiffusion-controlnet", torch_dtype=torch.float16
	... ).to("cuda")

	>>> style_subject = "flower"
	>>> tgt_subject = "teapot"
	>>> text_prompt = "on a marble table"

	>>> cldm_cond_image = load_image(
	... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg"
	... ).resize((512, 512))
	>>> canny = CannyDetector()
	>>> cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil")
	>>> style_image = load_image(
	... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
	... )
	>>> guidance_scale = 7.5
	>>> num_inference_steps = 50
	>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

	>>> output = blip_diffusion_pipe(
	... text_prompt,
	... style_image,
	... cldm_cond_image,
	... style_subject,
	... tgt_subject,
	... guidance_scale=guidance_scale,
	... num_inference_steps=num_inference_steps,
	... neg_prompt=negative_prompt,
	... height=512,
	... width=512,
	... ).images
	>>> output[0].save("image.png")
	```

	Parameters:

	tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder

	text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt

	vae ([AutoencoderKL](/docs/diffusers/pr_12652/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image

	unet ([UNet2DConditionModel](/docs/diffusers/pr_12652/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding.

	scheduler ([PNDMScheduler](/docs/diffusers/pr_12652/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents.

	qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image.

	controlnet ([ControlNetModel](/docs/diffusers/pr_12652/en/api/models/controlnet#diffusers.ControlNetModel)) : ControlNet model to get the conditioning image embedding.

	image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image.

	ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder.

	Returns:

	`[ImagePipelineOutput](/docs/diffusers/pr_12652/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple``

Xet Storage Details

Size:: 16.2 kB
Xet hash:: edaed2ecf4a6b8b80e30fb5bbc64162bca4f7e5b62ab7bc18b39c519118ae8db

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.