40.4 kB

	<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"OmniGen","local":"omnigen","sections":[{"title":"Load model checkpoints","local":"load-model-checkpoints","sections":[],"depth":2},{"title":"Text-to-image","local":"text-to-image","sections":[],"depth":2},{"title":"Image edit","local":"image-edit","sections":[],"depth":2},{"title":"Controllable generation","local":"controllable-generation","sections":[],"depth":2},{"title":"ID and object preserving","local":"id-and-object-preserving","sections":[],"depth":2},{"title":"Optimization when using multiple images","local":"optimization-when-using-multiple-images","sections":[],"depth":2}],"depth":1}">
	<link href="/docs/diffusers/pr_12403/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/entry/start.33959e67.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/scheduler.8c3d61f6.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/singletons.46d5608c.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/index.0997d446.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/paths.0dc9c45f.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/entry/app.87796ad1.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/index.da70eac4.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/nodes/0.9198881c.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/each.e59479a4.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/nodes/314.485eeda3.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/CodeBlock.a9c4becf.js">
	<link rel="modulepreload" href="/docs/diffusers/pr_12403/en/_app/immutable/chunks/getInferenceSnippets.ea1775db.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"OmniGen","local":"omnigen","sections":[{"title":"Load model checkpoints","local":"load-model-checkpoints","sections":[],"depth":2},{"title":"Text-to-image","local":"text-to-image","sections":[],"depth":2},{"title":"Image edit","local":"image-edit","sections":[],"depth":2},{"title":"Controllable generation","local":"controllable-generation","sections":[],"depth":2},{"title":"ID and object preserving","local":"id-and-object-preserving","sections":[],"depth":2},{"title":"Optimization when using multiple images","local":"optimization-when-using-multiple-images","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="omnigen" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#omnigen"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>OmniGen</span></h1> <p data-svelte-h="svelte-wv02ap">OmniGen is an image generation model. Unlike existing text-to-image models, OmniGen is a single model designed to handle a variety of tasks (e.g., text-to-image, image editing, controllable generation). It has the following features:</p> <ul data-svelte-h="svelte-1e1t33d"><li>Minimalist model architecture, consisting of only a VAE and a transformer module, for joint modeling of text and images.</li> <li>Support for multimodal inputs. It can process any text-image mixed data as instructions for image generation, rather than relying solely on text.</li></ul> <p data-svelte-h="svelte-1frdgki">For more information, please refer to the <a href="https://huggingface.co/papers/2409.11340" rel="nofollow">paper</a>.
	This guide will walk you through using OmniGen for various tasks and use cases.</p> <h2 class="relative group"><a id="load-model-checkpoints" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#load-model-checkpoints"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Load model checkpoints</span></h2> <p data-svelte-h="svelte-ut5cgr">Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the <a href="/docs/diffusers/pr_12403/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained">from_pretrained()</a> method.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline

	pipe = OmniGenPipeline.from_pretrained(<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>, torch_dtype=torch.bfloat16)<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="text-to-image" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#text-to-image"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Text-to-image</span></h2> <p data-svelte-h="svelte-1a87kfl">For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image.
	You can try setting the <code>height</code> and <code>width</code> parameters to generate images with different size.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline

	pipe = OmniGenPipeline.from_pretrained(
	<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>,
	torch_dtype=torch.bfloat16
	)
	pipe.to(<span class="hljs-string">"cuda"</span>)

	prompt = <span class="hljs-string">"Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."</span>
	image = pipe(
	prompt=prompt,
	height=<span class="hljs-number">1024</span>,
	width=<span class="hljs-number">1024</span>,
	guidance_scale=<span class="hljs-number">3</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">111</span>),
	).images[<span class="hljs-number">0</span>]
	image.save(<span class="hljs-string">"output.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-uf526e"><img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png" alt="generated image"></div> <h2 class="relative group"><a id="image-edit" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#image-edit"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Image edit</span></h2> <p data-svelte-h="svelte-53bqqd">OmniGen supports multimodal inputs.
	When the input includes an image, you need to add a placeholder <code><img><\|image_1\|></img></code> in the text prompt to represent the image.
	It is recommended to enable <code>use_input_image_size_as_output</code> to keep the edited image the same size as the original image.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline
	<span class="hljs-keyword">from</span> diffusers.utils <span class="hljs-keyword">import</span> load_image

	pipe = OmniGenPipeline.from_pretrained(
	<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>,
	torch_dtype=torch.bfloat16
	)
	pipe.to(<span class="hljs-string">"cuda"</span>)

	prompt=<span class="hljs-string">"<img><\|image_1\|></img> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola."</span>
	input_images=[load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png"</span>)]
	image = pipe(
	prompt=prompt,
	input_images=input_images,
	guidance_scale=<span class="hljs-number">2</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	use_input_image_size_as_output=<span class="hljs-literal">True</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">222</span>)
	).images[<span class="hljs-number">0</span>]
	image.save(<span class="hljs-string">"output.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex flex-row gap-4" data-svelte-h="svelte-ptjo10"><div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">edited image</figcaption></div></div> <p data-svelte-h="svelte-1n0on1x">OmniGen has some interesting features, such as visual reasoning, as shown in the example below.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->prompt=<span class="hljs-string">"If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <img><\|image_1\|></img>"</span>
	input_images=[load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"</span>)]
	image = pipe(
	prompt=prompt,
	input_images=input_images,
	guidance_scale=<span class="hljs-number">2</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	use_input_image_size_as_output=<span class="hljs-literal">True</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">0</span>)
	).images[<span class="hljs-number">0</span>]
	image.save(<span class="hljs-string">"output.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-yas3w7"><img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/reasoning.png" alt="generated image"></div> <h2 class="relative group"><a id="controllable-generation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#controllable-generation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Controllable generation</span></h2> <p data-svelte-h="svelte-gken8m">OmniGen can handle several classic computer vision tasks. As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline
	<span class="hljs-keyword">from</span> diffusers.utils <span class="hljs-keyword">import</span> load_image

	pipe = OmniGenPipeline.from_pretrained(
	<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>,
	torch_dtype=torch.bfloat16
	)
	pipe.to(<span class="hljs-string">"cuda"</span>)

	prompt=<span class="hljs-string">"Detect the skeleton of human in this image: <img><\|image_1\|></img>"</span>
	input_images=[load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"</span>)]
	image1 = pipe(
	prompt=prompt,
	input_images=input_images,
	guidance_scale=<span class="hljs-number">2</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	use_input_image_size_as_output=<span class="hljs-literal">True</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">333</span>)
	).images[<span class="hljs-number">0</span>]
	image1.save(<span class="hljs-string">"image1.png"</span>)

	prompt=<span class="hljs-string">"Generate a new photo using the following picture and text as conditions: <img><\|image_1\|></img>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."</span>
	input_images=[load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png"</span>)]
	image2 = pipe(
	prompt=prompt,
	input_images=input_images,
	guidance_scale=<span class="hljs-number">2</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	use_input_image_size_as_output=<span class="hljs-literal">True</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">333</span>)
	).images[<span class="hljs-number">0</span>]
	image2.save(<span class="hljs-string">"image2.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex flex-row gap-4" data-svelte-h="svelte-n9fk6r"><div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">detected skeleton</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal2img.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">skeleton to image</figcaption></div></div> <p data-svelte-h="svelte-1xps456">OmniGen can also directly use relevant information from input images to generate new images.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline
	<span class="hljs-keyword">from</span> diffusers.utils <span class="hljs-keyword">import</span> load_image

	pipe = OmniGenPipeline.from_pretrained(
	<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>,
	torch_dtype=torch.bfloat16
	)
	pipe.to(<span class="hljs-string">"cuda"</span>)

	prompt=<span class="hljs-string">"Following the pose of this image <img><\|image_1\|></img>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."</span>
	input_images=[load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"</span>)]
	image = pipe(
	prompt=prompt,
	input_images=input_images,
	guidance_scale=<span class="hljs-number">2</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	use_input_image_size_as_output=<span class="hljs-literal">True</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">0</span>)
	).images[<span class="hljs-number">0</span>]
	image.save(<span class="hljs-string">"output.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex flex-row gap-4" data-svelte-h="svelte-17vyas4"><div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/same_pose.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption></div></div> <h2 class="relative group"><a id="id-and-object-preserving" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#id-and-object-preserving"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>ID and object preserving</span></h2> <p data-svelte-h="svelte-18p79vw">OmniGen can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously.
	Additionally, OmniGen can extract desired objects from an image containing multiple objects based on instructions.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline
	<span class="hljs-keyword">from</span> diffusers.utils <span class="hljs-keyword">import</span> load_image

	pipe = OmniGenPipeline.from_pretrained(
	<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>,
	torch_dtype=torch.bfloat16
	)
	pipe.to(<span class="hljs-string">"cuda"</span>)

	prompt=<span class="hljs-string">"A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <img><\|image_1\|></img>. The woman is the woman on the left of <img><\|image_2\|></img>"</span>
	input_image_1 = load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png"</span>)
	input_image_2 = load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png"</span>)
	input_images=[input_image_1, input_image_2]
	image = pipe(
	prompt=prompt,
	input_images=input_images,
	height=<span class="hljs-number">1024</span>,
	width=<span class="hljs-number">1024</span>,
	guidance_scale=<span class="hljs-number">2.5</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">666</span>)
	).images[<span class="hljs-number">0</span>]
	image.save(<span class="hljs-string">"output.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex flex-row gap-4" data-svelte-h="svelte-q7ysaf"><div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">input_image_1</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">input_image_2</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/id2.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption></div></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
	<span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> OmniGenPipeline
	<span class="hljs-keyword">from</span> diffusers.utils <span class="hljs-keyword">import</span> load_image

	pipe = OmniGenPipeline.from_pretrained(
	<span class="hljs-string">"Shitao/OmniGen-v1-diffusers"</span>,
	torch_dtype=torch.bfloat16
	)
	pipe.to(<span class="hljs-string">"cuda"</span>)

	prompt=<span class="hljs-string">"A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <img><\|image_1\|></img>. The long-sleeve blouse and a pleated skirt are <img><\|image_2\|></img>."</span>
	input_image_1 = load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg"</span>)
	input_image_2 = load_image(<span class="hljs-string">"https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg"</span>)
	input_images=[input_image_1, input_image_2]
	image = pipe(
	prompt=prompt,
	input_images=input_images,
	height=<span class="hljs-number">1024</span>,
	width=<span class="hljs-number">1024</span>,
	guidance_scale=<span class="hljs-number">2.5</span>,
	img_guidance_scale=<span class="hljs-number">1.6</span>,
	generator=torch.Generator(device=<span class="hljs-string">"cpu"</span>).manual_seed(<span class="hljs-number">666</span>)
	).images[<span class="hljs-number">0</span>]
	image.save(<span class="hljs-string">"output.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex flex-row gap-4" data-svelte-h="svelte-164h1l6"><div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg"> <figcaption class="mt-2 text-center text-sm text-gray-500">person image</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg"> <figcaption class="mt-2 text-center text-sm text-gray-500">clothe image</figcaption></div> <div class="flex-1"><img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/tryon.png"> <figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption></div></div> <h2 class="relative group"><a id="optimization-when-using-multiple-images" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#optimization-when-using-multiple-images"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Optimization when using multiple images</span></h2> <p data-svelte-h="svelte-1y946ts">For text-to-image task, OmniGen requires minimal memory and time costs (9GB memory and 31s for a 1024x1024 image on A800 GPU).
	However, when using input images, the computational cost increases.</p> <p data-svelte-h="svelte-13qz3jx">Here are some guidelines to help you reduce computational costs when using multiple images. The experiments are conducted on an A800 GPU with two input images.</p> <p data-svelte-h="svelte-1twry1x">Like other pipelines, you can reduce memory usage by offloading the model: <code>pipe.enable_model_cpu_offload()</code> or <code>pipe.enable_sequential_cpu_offload() </code>.
	In OmniGen, you can also decrease computational overhead by reducing the <code>max_input_image_size</code>.
	The memory consumption for different image sizes is shown in the table below:</p> <table data-svelte-h="svelte-1qb6le7"><thead><tr><th>Method</th> <th>Memory Usage</th></tr></thead> <tbody><tr><td>max_input_image_size=1024</td> <td>40GB</td></tr> <tr><td>max_input_image_size=512</td> <td>17GB</td></tr> <tr><td>max_input_image_size=256</td> <td>14GB</td></tr></tbody></table> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/omnigen.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>

	<script>
	{
	__sveltekit_g87enx = {
	assets: "/docs/diffusers/pr_12403/en",
	base: "/docs/diffusers/pr_12403/en",
	env: {}
	};

	const element = document.currentScript.parentElement;

	const data = [null,null];

	Promise.all([
	import("/docs/diffusers/pr_12403/en/_app/immutable/entry/start.33959e67.js"),
	import("/docs/diffusers/pr_12403/en/_app/immutable/entry/app.87796ad1.js")
	]).then(([kit, app]) => {
	kit.start(app, element, {
	node_ids: [0, 314],
	data,
	form: null,
	error: null
	});
	});
	}
	</script>

Xet Storage Details

Size:: 40.4 kB
Xet hash:: 15ad4f30bbf991f1ea078eb703a978ad3141c523a6ffba49feba9ebc622f52b2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.