Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Marigold Pipelines for Computer Vision Tasks","local":"marigold-pipelines-for-computer-vision-tasks","sections":[{"title":"Depth Prediction Quick Start","local":"depth-prediction-quick-start","sections":[],"depth":3},{"title":"Surface Normals Prediction Quick Start","local":"surface-normals-prediction-quick-start","sections":[],"depth":3},{"title":"Speeding up inference","local":"speeding-up-inference","sections":[],"depth":3},{"title":"Qualitative Comparison with Depth Anything","local":"qualitative-comparison-with-depth-anything","sections":[],"depth":2},{"title":"Maximizing Precision and Ensembling","local":"maximizing-precision-and-ensembling","sections":[],"depth":2},{"title":"Quantitative Evaluation","local":"quantitative-evaluation","sections":[],"depth":2},{"title":"Using Predictive Uncertainty","local":"using-predictive-uncertainty","sections":[],"depth":2},{"title":"Frame-by-frame Video Processing with Temporal Consistency","local":"frame-by-frame-video-processing-with-temporal-consistency","sections":[],"depth":2},{"title":"Marigold for ControlNet","local":"marigold-for-controlnet","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/diffusers/pr_10312/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/entry/start.203b6290.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/scheduler.8c3d61f6.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/singletons.8b179f45.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/index.0997d446.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/paths.3fd58d56.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/entry/app.423ea23f.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/index.da70eac4.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/nodes/0.e544eae6.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/nodes/251.c112c17a.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/CodeBlock.00a903b3.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_10312/en/_app/immutable/chunks/EditOnGithub.1e64e623.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Marigold Pipelines for Computer Vision Tasks","local":"marigold-pipelines-for-computer-vision-tasks","sections":[{"title":"Depth Prediction Quick Start","local":"depth-prediction-quick-start","sections":[],"depth":3},{"title":"Surface Normals Prediction Quick Start","local":"surface-normals-prediction-quick-start","sections":[],"depth":3},{"title":"Speeding up inference","local":"speeding-up-inference","sections":[],"depth":3},{"title":"Qualitative Comparison with Depth Anything","local":"qualitative-comparison-with-depth-anything","sections":[],"depth":2},{"title":"Maximizing Precision and Ensembling","local":"maximizing-precision-and-ensembling","sections":[],"depth":2},{"title":"Quantitative Evaluation","local":"quantitative-evaluation","sections":[],"depth":2},{"title":"Using Predictive Uncertainty","local":"using-predictive-uncertainty","sections":[],"depth":2},{"title":"Frame-by-frame Video Processing with Temporal Consistency","local":"frame-by-frame-video-processing-with-temporal-consistency","sections":[],"depth":2},{"title":"Marigold for ControlNet","local":"marigold-for-controlnet","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="marigold-pipelines-for-computer-vision-tasks" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#marigold-pipelines-for-computer-vision-tasks"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Marigold Pipelines for Computer Vision Tasks</span></h1> <p data-svelte-h="svelte-rsgfrp"><a href="../api/pipelines/marigold">Marigold</a> is a novel diffusion-based dense prediction approach, and a set of pipelines for various computer vision tasks, such as monocular depth estimation.</p> <p data-svelte-h="svelte-14zd5ma">This guide will show you how to use Marigold to obtain fast and high-quality predictions for images and videos.</p> <p data-svelte-h="svelte-mg9rv6">Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a <em>prediction</em> of the modality of interest, such as a depth map of the input image. | |
| Currently, the following tasks are implemented:</p> <table data-svelte-h="svelte-1v9cy9r"><thead><tr><th>Pipeline</th> <th>Predicted Modalities</th> <th align="center">Demos</th></tr></thead> <tbody><tr><td><a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py" rel="nofollow">MarigoldDepthPipeline</a></td> <td><a href="https://en.wikipedia.org/wiki/Depth_map" rel="nofollow">Depth</a>, <a href="https://en.wikipedia.org/wiki/Binocular_disparity" rel="nofollow">Disparity</a></td> <td align="center"><a href="https://huggingface.co/spaces/prs-eth/marigold-lcm" rel="nofollow">Fast Demo (LCM)</a>, <a href="https://huggingface.co/spaces/prs-eth/marigold" rel="nofollow">Slow Original Demo (DDIM)</a></td></tr> <tr><td><a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py" rel="nofollow">MarigoldNormalsPipeline</a></td> <td><a href="https://en.wikipedia.org/wiki/Normal_mapping" rel="nofollow">Surface normals</a></td> <td align="center"><a href="https://huggingface.co/spaces/prs-eth/marigold-normals-lcm" rel="nofollow">Fast Demo (LCM)</a></td></tr></tbody></table> <p data-svelte-h="svelte-7crc58">The original checkpoints can be found under the <a href="https://huggingface.co/prs-eth/" rel="nofollow">PRS-ETH</a> Hugging Face organization. | |
| These checkpoints are meant to work with diffusers pipelines and the <a href="https://github.com/prs-eth/marigold" rel="nofollow">original codebase</a>. | |
| The original code can also be used to train new checkpoints.</p> <table data-svelte-h="svelte-dqweec"><thead><tr><th>Checkpoint</th> <th>Modality</th> <th>Comment</th></tr></thead> <tbody><tr><td><a href="https://huggingface.co/prs-eth/marigold-v1-0" rel="nofollow">prs-eth/marigold-v1-0</a></td> <td>Depth</td> <td>The first Marigold Depth checkpoint, which predicts <em>affine-invariant depth</em> maps. The performance of this checkpoint in benchmarks was studied in the original <a href="https://huggingface.co/papers/2312.02145" rel="nofollow">paper</a>. Designed to be used with the <code>DDIMScheduler</code> at inference, it requires at least 10 steps to get reliable predictions. Affine-invariant depth prediction has a range of values in each pixel between 0 (near plane) and 1 (far plane); both planes are chosen by the model as part of the inference process. See the <code>MarigoldImageProcessor</code> reference for visualization utilities.</td></tr> <tr><td><a href="https://huggingface.co/prs-eth/marigold-depth-lcm-v1-0" rel="nofollow">prs-eth/marigold-depth-lcm-v1-0</a></td> <td>Depth</td> <td>The fast Marigold Depth checkpoint, fine-tuned from <code>prs-eth/marigold-v1-0</code>. Designed to be used with the <code>LCMScheduler</code> at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that.</td></tr> <tr><td><a href="https://huggingface.co/prs-eth/marigold-normals-v0-1" rel="nofollow">prs-eth/marigold-normals-v0-1</a></td> <td>Normals</td> <td>A preview checkpoint for the Marigold Normals pipeline. Designed to be used with the <code>DDIMScheduler</code> at inference, it requires at least 10 steps to get reliable predictions. The surface normals predictions are unit-length 3D vectors with values in the range from -1 to 1. <em>This checkpoint will be phased out after the release of <code>v1-0</code> version.</em></td></tr> <tr><td><a href="https://huggingface.co/prs-eth/marigold-normals-lcm-v0-1" rel="nofollow">prs-eth/marigold-normals-lcm-v0-1</a></td> <td>Normals</td> <td>The fast Marigold Normals checkpoint, fine-tuned from <code>prs-eth/marigold-normals-v0-1</code>. Designed to be used with the <code>LCMScheduler</code> at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. <em>This checkpoint will be phased out after the release of <code>v1-0</code> version.</em></td></tr></tbody></table> <p data-svelte-h="svelte-14z37t0">The examples below are mostly given for depth prediction, but they can be universally applied with other supported modalities. | |
| We showcase the predictions using the same input image of Albert Einstein generated by Midjourney. | |
| This makes it easier to compare visualizations of the predictions across various modalities and checkpoints.</p> <div class="flex gap-4" style="justify-content: center; width: 100%;" data-svelte-h="svelte-130z9iz"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://marigoldmonodepth.github.io/images/einstein.jpg"> <figcaption class="mt-1 text-center text-sm text-gray-500">Example input image for all Marigold pipelines</figcaption></div></div> <h3 class="relative group"><a id="depth-prediction-quick-start" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#depth-prediction-quick-start"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Depth Prediction Quick Start</span></h3> <p data-svelte-h="svelte-1284uin">To get the first depth prediction, load <code>prs-eth/marigold-depth-lcm-v1-0</code> checkpoint into <code>MarigoldDepthPipeline</code> pipeline, put the image through the pipeline, and save the predictions:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-lcm-v1-0"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| depth = pipe(image) | |
| vis = pipe.image_processor.visualize_depth(depth.prediction) | |
| vis[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_depth.png"</span>) | |
| depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction) | |
| depth_16bit[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_depth_16bit.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-vmtetj">The visualization function for depth <code>visualize_depth()</code> applies one of <a href="https://matplotlib.org/stable/users/explain/colors/colormaps.html" rel="nofollow">matplotlib’s colormaps</a> (<code>Spectral</code> by default) to map the predicted pixel values from a single-channel <code>[0, 1]</code> depth range into an RGB image. | |
| With the <code>Spectral</code> colormap, pixels with near depth are painted red, and far pixels are assigned blue color. | |
| The 16-bit PNG file stores the single channel values mapped linearly from the <code>[0, 1]</code> range into <code>[0, 65535]</code>. | |
| Below are the raw and the visualized predictions; as can be seen, dark areas (mustache) are easier to distinguish in the visualization:</p> <div class="flex gap-4" data-svelte-h="svelte-16yoeuw"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth_16bit.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted depth (16-bit PNG)</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted depth visualization (Spectral)</figcaption></div></div> <h3 class="relative group"><a id="surface-normals-prediction-quick-start" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#surface-normals-prediction-quick-start"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Surface Normals Prediction Quick Start</span></h3> <p data-svelte-h="svelte-17bf1l7">Load <code>prs-eth/marigold-normals-lcm-v0-1</code> checkpoint into <code>MarigoldNormalsPipeline</code> pipeline, put the image through the pipeline, and save the predictions:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldNormalsPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-normals-lcm-v0-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| normals = pipe(image) | |
| vis = pipe.image_processor.visualize_normals(normals.prediction) | |
| vis[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_normals.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1cxkzwq">The visualization function for normals <code>visualize_normals()</code> maps the three-dimensional prediction with pixel values in the range <code>[-1, 1]</code> into an RGB image. | |
| The visualization function supports flipping surface normals axes to make the visualization compatible with other choices of the frame of reference. | |
| Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where <code>X</code> axis points right, <code>Y</code> axis points up, and <code>Z</code> axis points at the viewer. | |
| Below is the visualized prediction:</p> <div class="flex gap-4" style="justify-content: center; width: 100%;" data-svelte-h="svelte-15wm70y"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted surface normals visualization</figcaption></div></div> <p data-svelte-h="svelte-v6akpn">In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points straight at the viewer, meaning that its coordinates are <code>[0, 0, 1]</code>. | |
| This vector maps to the RGB <code>[128, 128, 255]</code>, which corresponds to the violet-blue color. | |
| Similarly, a surface normal on the cheek in the right part of the image has a large <code>X</code> component, which increases the red hue. | |
| Points on the shoulders pointing up with a large <code>Y</code> promote green color.</p> <h3 class="relative group"><a id="speeding-up-inference" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#speeding-up-inference"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Speeding up inference</span></h3> <p data-svelte-h="svelte-6d9zyk">The above quick start snippets are already optimized for speed: they load the LCM checkpoint, use the <code>fp16</code> variant of weights and computation, and perform just one denoising diffusion step. | |
| The <code>pipe(image)</code> call completes in 280ms on RTX 3090 GPU. | |
| Internally, the input image is encoded with the Stable Diffusion VAE encoder, then the U-Net performs one denoising step, and finally, the prediction latent is decoded with the VAE decoder into pixel space. | |
| In this case, two out of three module calls are dedicated to converting between pixel and latent space of LDM. | |
| Because Marigold’s latent space is compatible with the base Stable Diffusion, it is possible to speed up the pipeline call by more than 3x (85ms on RTX 3090) by using a <a href="../api/models/autoencoder_tiny">lightweight replacement of the SD VAE</a>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| <span class="hljs-addition">+ pipe.vae = diffusers.AutoencoderTiny.from_pretrained(</span> | |
| <span class="hljs-addition">+ "madebyollin/taesd", torch_dtype=torch.float16</span> | |
| <span class="hljs-addition">+ ).cuda()</span> | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-if0k9">As suggested in <a href="../optimization/torch2.0#torch.compile">Optimizations</a>, adding <code>torch.compile</code> may squeeze extra performance depending on the target hardware:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| <span class="hljs-addition">+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)</span> | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image)<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="qualitative-comparison-with-depth-anything" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#qualitative-comparison-with-depth-anything"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Qualitative Comparison with Depth Anything</span></h2> <p data-svelte-h="svelte-kc22hk">With the above speed optimizations, Marigold delivers predictions with more details and faster than <a href="https://huggingface.co/docs/transformers/main/en/model_doc/depth_anything" rel="nofollow">Depth Anything</a> with the largest checkpoint <a href="https://huggingface.co/LiheYoung/depth-anything-large-hf" rel="nofollow">LiheYoung/depth-anything-large-hf</a>:</p> <div class="flex gap-4" data-svelte-h="svelte-4jsfc1"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold LCM fp16 with Tiny AutoEncoder</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/einstein_depthanything_large.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Depth Anything Large</figcaption></div></div> <h2 class="relative group"><a id="maximizing-precision-and-ensembling" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#maximizing-precision-and-ensembling"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Maximizing Precision and Ensembling</span></h2> <p data-svelte-h="svelte-4qrnje">Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. | |
| This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion. | |
| The ensembling path is activated automatically when the <code>ensemble_size</code> argument is set greater than <code>1</code>. | |
| When aiming for maximum precision, it makes sense to adjust <code>num_inference_steps</code> simultaneously with <code>ensemble_size</code>. | |
| The recommended values vary across checkpoints but primarily depend on the scheduler type. | |
| The effect of ensembling is particularly well-seen with surface normals:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| model_path = <span class="hljs-string">"prs-eth/marigold-normals-v1-0"</span> | |
| model_paper_kwargs = { | |
| diffusers.schedulers.DDIMScheduler: { | |
| <span class="hljs-string">"num_inference_steps"</span>: <span class="hljs-number">10</span>, | |
| <span class="hljs-string">"ensemble_size"</span>: <span class="hljs-number">10</span>, | |
| }, | |
| diffusers.schedulers.LCMScheduler: { | |
| <span class="hljs-string">"num_inference_steps"</span>: <span class="hljs-number">4</span>, | |
| <span class="hljs-string">"ensemble_size"</span>: <span class="hljs-number">5</span>, | |
| }, | |
| } | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(model_path).to(<span class="hljs-string">"cuda"</span>) | |
| pipe_kwargs = model_paper_kwargs[<span class="hljs-built_in">type</span>(pipe.scheduler)] | |
| depth = pipe(image, **pipe_kwargs) | |
| vis = pipe.image_processor.visualize_normals(depth.prediction) | |
| vis[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_normals.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex gap-4" data-svelte-h="svelte-etsikn"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Surface normals, no ensembling</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Surface normals, with ensembling</figcaption></div></div> <p data-svelte-h="svelte-1001b16">As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more correct predictions. | |
| Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.</p> <h2 class="relative group"><a id="quantitative-evaluation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#quantitative-evaluation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Quantitative Evaluation</span></h2> <p data-svelte-h="svelte-e52fx">To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values for <code>num_inference_steps</code> and <code>ensemble_size</code>. | |
| Optionally seed randomness to ensure reproducibility. Maximizing <code>batch_size</code> will deliver maximum device utilization.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| device = <span class="hljs-string">"cuda"</span> | |
| seed = <span class="hljs-number">2024</span> | |
| model_path = <span class="hljs-string">"prs-eth/marigold-v1-0"</span> | |
| model_paper_kwargs = { | |
| diffusers.schedulers.DDIMScheduler: { | |
| <span class="hljs-string">"num_inference_steps"</span>: <span class="hljs-number">50</span>, | |
| <span class="hljs-string">"ensemble_size"</span>: <span class="hljs-number">10</span>, | |
| }, | |
| diffusers.schedulers.LCMScheduler: { | |
| <span class="hljs-string">"num_inference_steps"</span>: <span class="hljs-number">4</span>, | |
| <span class="hljs-string">"ensemble_size"</span>: <span class="hljs-number">10</span>, | |
| }, | |
| } | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| generator = torch.Generator(device=device).manual_seed(seed) | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained(model_path).to(device) | |
| pipe_kwargs = model_paper_kwargs[<span class="hljs-built_in">type</span>(pipe.scheduler)] | |
| depth = pipe(image, generator=generator, **pipe_kwargs) | |
| <span class="hljs-comment"># evaluate metrics</span><!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="using-predictive-uncertainty" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#using-predictive-uncertainty"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Using Predictive Uncertainty</span></h2> <p data-svelte-h="svelte-b3srj5">The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random latents. | |
| As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify <code>ensemble_size</code> greater than 1 and set <code>output_uncertainty=True</code>. | |
| The resulting uncertainty will be available in the <code>uncertainty</code> field of the output. | |
| It can be visualized as follows:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-lcm-v1-0"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| depth = pipe( | |
| image, | |
| ensemble_size=<span class="hljs-number">10</span>, <span class="hljs-comment"># any number greater than 1; higher values yield higher precision</span> | |
| output_uncertainty=<span class="hljs-literal">True</span>, | |
| ) | |
| uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty) | |
| uncertainty[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_depth_uncertainty.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex gap-4" data-svelte-h="svelte-jsugc"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_depth_uncertainty.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Depth uncertainty</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals_uncertainty.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Surface normals uncertainty</figcaption></div></div> <p data-svelte-h="svelte-83stc5">The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to make consistent predictions. | |
| Evidently, the depth model is the least confident around edges with discontinuity, where the object depth changes drastically. | |
| The surface normals model is the least confident in fine-grained structures, such as hair, and dark areas, such as the collar.</p> <h2 class="relative group"><a id="frame-by-frame-video-processing-with-temporal-consistency" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#frame-by-frame-video-processing-with-temporal-consistency"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Frame-by-frame Video Processing with Temporal Consistency</span></h2> <p data-svelte-h="svelte-nc59re">Due to Marigold’s generative nature, each prediction is unique and defined by the random noise sampled for the latent initialization. | |
| This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the following videos:</p> <div class="flex gap-4" data-svelte-h="svelte-1k6s96j"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Input video</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption></div></div> <p data-svelte-h="svelte-1cegfq5">To address this issue, it is possible to pass <code>latents</code> argument to the pipelines, which defines the starting point of diffusion. | |
| Empirically, we found that a convex combination of the very same starting point noise latent and the latent corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> imageio | |
| <span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image | |
| <span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm | |
| <span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| device = <span class="hljs-string">"cuda"</span> | |
| path_in = <span class="hljs-string">"obama.mp4"</span> | |
| path_out = <span class="hljs-string">"obama_depth.gif"</span> | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-lcm-v1-0"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(device) | |
| pipe.vae = diffusers.AutoencoderTiny.from_pretrained( | |
| <span class="hljs-string">"madebyollin/taesd"</span>, torch_dtype=torch.float16 | |
| ).to(device) | |
| pipe.set_progress_bar_config(disable=<span class="hljs-literal">True</span>) | |
| <span class="hljs-keyword">with</span> imageio.get_reader(path_in) <span class="hljs-keyword">as</span> reader: | |
| size = reader.get_meta_data()[<span class="hljs-string">'size'</span>] | |
| last_frame_latent = <span class="hljs-literal">None</span> | |
| latent_common = torch.randn( | |
| (<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">768</span> * size[<span class="hljs-number">1</span>] // (<span class="hljs-number">8</span> * <span class="hljs-built_in">max</span>(size)), <span class="hljs-number">768</span> * size[<span class="hljs-number">0</span>] // (<span class="hljs-number">8</span> * <span class="hljs-built_in">max</span>(size))) | |
| ).to(device=device, dtype=torch.float16) | |
| out = [] | |
| <span class="hljs-keyword">for</span> frame_id, frame <span class="hljs-keyword">in</span> tqdm(<span class="hljs-built_in">enumerate</span>(reader), desc=<span class="hljs-string">"Processing Video"</span>): | |
| frame = Image.fromarray(frame) | |
| latents = latent_common | |
| <span class="hljs-keyword">if</span> last_frame_latent <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>: | |
| latents = <span class="hljs-number">0.9</span> * latents + <span class="hljs-number">0.1</span> * last_frame_latent | |
| depth = pipe( | |
| frame, match_input_resolution=<span class="hljs-literal">False</span>, latents=latents, output_latent=<span class="hljs-literal">True</span> | |
| ) | |
| last_frame_latent = depth.latent | |
| out.append(pipe.image_processor.visualize_depth(depth.prediction)[<span class="hljs-number">0</span>]) | |
| diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()[<span class="hljs-string">'fps'</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-tlyfc9">Here, the diffusion process starts from the given computed latent. | |
| The pipeline sets <code>output_latent=True</code> to access <code>out.latent</code> and computes its contribution to the next frame’s latent initialization. | |
| The result is much more stable now:</p> <div class="flex gap-4" data-svelte-h="svelte-u703tp"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_consistent.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth with forced latents initialization</figcaption></div></div> <h2 class="relative group"><a id="marigold-for-controlnet" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#marigold-for-controlnet"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Marigold for ControlNet</span></h2> <p data-svelte-h="svelte-12oh700">A very common application for depth prediction with diffusion models comes in conjunction with ControlNet. | |
| Depth crispness plays a crucial role in obtaining high-quality results from ControlNet. | |
| As seen in comparisons with other methods above, Marigold excels at that task. | |
| The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">import</span> diffusers | |
| device = <span class="hljs-string">"cuda"</span> | |
| generator = torch.Generator(device=device).manual_seed(<span class="hljs-number">2024</span>) | |
| image = diffusers.utils.load_image( | |
| <span class="hljs-string">"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"</span> | |
| ) | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-lcm-v1-0"</span>, torch_dtype=torch.float16, variant=<span class="hljs-string">"fp16"</span> | |
| ).to(device) | |
| depth_image = pipe(image, generator=generator).prediction | |
| depth_image = pipe.image_processor.visualize_depth(depth_image, color_map=<span class="hljs-string">"binary"</span>) | |
| depth_image[<span class="hljs-number">0</span>].save(<span class="hljs-string">"motorcycle_controlnet_depth.png"</span>) | |
| controlnet = diffusers.ControlNetModel.from_pretrained( | |
| <span class="hljs-string">"diffusers/controlnet-depth-sdxl-1.0"</span>, torch_dtype=torch.float16, variant=<span class="hljs-string">"fp16"</span> | |
| ).to(device) | |
| pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained( | |
| <span class="hljs-string">"SG161222/RealVisXL_V4.0"</span>, torch_dtype=torch.float16, variant=<span class="hljs-string">"fp16"</span>, controlnet=controlnet | |
| ).to(device) | |
| pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=<span class="hljs-literal">True</span>) | |
| controlnet_out = pipe( | |
| prompt=<span class="hljs-string">"high quality photo of a sports bike, city"</span>, | |
| negative_prompt=<span class="hljs-string">""</span>, | |
| guidance_scale=<span class="hljs-number">6.5</span>, | |
| num_inference_steps=<span class="hljs-number">25</span>, | |
| image=depth_image, | |
| controlnet_conditioning_scale=<span class="hljs-number">0.7</span>, | |
| control_guidance_end=<span class="hljs-number">0.7</span>, | |
| generator=generator, | |
| ).images | |
| controlnet_out[<span class="hljs-number">0</span>].save(<span class="hljs-string">"motorcycle_controlnet_out.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex gap-4" data-svelte-h="svelte-bddy4e"><div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Input image</figcaption></div> <div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_depth.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Depth in the format compatible with ControlNet</figcaption></div> <div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_out.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city"</figcaption></div></div> <p data-svelte-h="svelte-by2907">Hopefully, you will find Marigold useful for solving your downstream tasks, be it a part of a more broad generative workflow, or a perception task, such as 3D reconstruction.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/marigold_usage.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_zvhs44 = { | |
| assets: "/docs/diffusers/pr_10312/en", | |
| base: "/docs/diffusers/pr_10312/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/diffusers/pr_10312/en/_app/immutable/entry/start.203b6290.js"), | |
| import("/docs/diffusers/pr_10312/en/_app/immutable/entry/app.423ea23f.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 251], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 58.5 kB
- Xet hash:
- 5bbffaff1d5e08ab553ff5d201069fa774625b7e13cba7720058ba783aa100a5
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.