Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Marigold Computer Vision","local":"marigold-computer-vision","sections":[{"title":"Depth Prediction","local":"depth-prediction","sections":[],"depth":2},{"title":"Surface Normals Estimation","local":"surface-normals-estimation","sections":[],"depth":2},{"title":"Intrinsic Image Decomposition","local":"intrinsic-image-decomposition","sections":[],"depth":2},{"title":"Speeding up inference","local":"speeding-up-inference","sections":[],"depth":2},{"title":"Maximizing Precision and Ensembling","local":"maximizing-precision-and-ensembling","sections":[],"depth":2},{"title":"Frame-by-frame Video Processing with Temporal Consistency","local":"frame-by-frame-video-processing-with-temporal-consistency","sections":[],"depth":2},{"title":"Marigold for ControlNet","local":"marigold-for-controlnet","sections":[],"depth":2},{"title":"Quantitative Evaluation","local":"quantitative-evaluation","sections":[],"depth":2},{"title":"Using Predictive Uncertainty","local":"using-predictive-uncertainty","sections":[],"depth":2},{"title":"Conclusion","local":"conclusion","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/diffusers/pr_11234/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/entry/start.5a617280.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/scheduler.8c3d61f6.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/singletons.0c831ec7.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/index.0997d446.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/paths.c606bb8d.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/entry/app.1a1de8b6.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/index.da70eac4.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/nodes/0.8544c871.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/nodes/276.0e042a77.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/CodeBlock.a9c4becf.js"> | |
| <link rel="modulepreload" href="/docs/diffusers/pr_11234/en/_app/immutable/chunks/index.5d4ab994.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Marigold Computer Vision","local":"marigold-computer-vision","sections":[{"title":"Depth Prediction","local":"depth-prediction","sections":[],"depth":2},{"title":"Surface Normals Estimation","local":"surface-normals-estimation","sections":[],"depth":2},{"title":"Intrinsic Image Decomposition","local":"intrinsic-image-decomposition","sections":[],"depth":2},{"title":"Speeding up inference","local":"speeding-up-inference","sections":[],"depth":2},{"title":"Maximizing Precision and Ensembling","local":"maximizing-precision-and-ensembling","sections":[],"depth":2},{"title":"Frame-by-frame Video Processing with Temporal Consistency","local":"frame-by-frame-video-processing-with-temporal-consistency","sections":[],"depth":2},{"title":"Marigold for ControlNet","local":"marigold-for-controlnet","sections":[],"depth":2},{"title":"Quantitative Evaluation","local":"quantitative-evaluation","sections":[],"depth":2},{"title":"Using Predictive Uncertainty","local":"using-predictive-uncertainty","sections":[],"depth":2},{"title":"Conclusion","local":"conclusion","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="marigold-computer-vision" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#marigold-computer-vision"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Marigold Computer Vision</span></h1> <p data-svelte-h="svelte-1m1paos"><strong>Marigold</strong> is a diffusion-based <a href="https://huggingface.co/papers/2312.02145" rel="nofollow">method</a> and a collection of <a href="../api/pipelines/marigold">pipelines</a> designed for | |
| dense computer vision tasks, including <strong>monocular depth prediction</strong>, <strong>surface normals estimation</strong>, and <strong>intrinsic | |
| image decomposition</strong>.</p> <p data-svelte-h="svelte-1bjomdr">This guide will walk you through using Marigold to generate fast and high-quality predictions for images and videos.</p> <p data-svelte-h="svelte-d9hh6k">Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a | |
| corresponding prediction. | |
| Currently, the following computer vision tasks are implemented:</p> <table data-svelte-h="svelte-1puk2lq"><thead><tr><th>Pipeline</th> <th>Recommended Model Checkpoints</th> <th align="center">Spaces (Interactive Apps)</th> <th>Predicted Modalities</th></tr></thead> <tbody><tr><td><a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py" rel="nofollow">MarigoldDepthPipeline</a></td> <td><a href="https://huggingface.co/prs-eth/marigold-depth-v1-1" rel="nofollow">prs-eth/marigold-depth-v1-1</a></td> <td align="center"><a href="https://huggingface.co/spaces/prs-eth/marigold" rel="nofollow">Depth Estimation</a></td> <td><a href="https://en.wikipedia.org/wiki/Depth_map" rel="nofollow">Depth</a>, <a href="https://en.wikipedia.org/wiki/Binocular_disparity" rel="nofollow">Disparity</a></td></tr> <tr><td><a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py" rel="nofollow">MarigoldNormalsPipeline</a></td> <td><a href="https://huggingface.co/prs-eth/marigold-normals-v1-1" rel="nofollow">prs-eth/marigold-normals-v1-1</a></td> <td align="center"><a href="https://huggingface.co/spaces/prs-eth/marigold-normals" rel="nofollow">Surface Normals Estimation</a></td> <td><a href="https://en.wikipedia.org/wiki/Normal_mapping" rel="nofollow">Surface normals</a></td></tr> <tr><td><a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py" rel="nofollow">MarigoldIntrinsicsPipeline</a></td> <td><a href="https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1" rel="nofollow">prs-eth/marigold-iid-appearance-v1-1</a>,<br><a href="https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1" rel="nofollow">prs-eth/marigold-iid-lighting-v1-1</a></td> <td align="center"><a href="https://huggingface.co/spaces/prs-eth/marigold-iid" rel="nofollow">Intrinsic Image Decomposition</a></td> <td><a href="https://en.wikipedia.org/wiki/Albedo" rel="nofollow">Albedo</a>, <a href="https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map" rel="nofollow">Materials</a>, <a href="https://en.wikipedia.org/wiki/Diffuse_reflection" rel="nofollow">Lighting</a></td></tr></tbody></table> <p data-svelte-h="svelte-1w30tm5">All original checkpoints are available under the <a href="https://huggingface.co/prs-eth/" rel="nofollow">PRS-ETH</a> organization on Hugging Face. | |
| They are designed for use with diffusers pipelines and the <a href="https://github.com/prs-eth/marigold" rel="nofollow">original codebase</a>, which can also be used to train | |
| new model checkpoints. | |
| The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.</p> <table><thead data-svelte-h="svelte-ubewwm"><tr><th>Checkpoint</th> <th>Modality</th> <th>Comment</th></tr></thead> <tbody><tr data-svelte-h="svelte-ao7el9"><td><a href="https://huggingface.co/prs-eth/marigold-depth-v1-1" rel="nofollow">prs-eth/marigold-depth-v1-1</a></td> <td>Depth</td> <td>Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference.</td></tr> <tr data-svelte-h="svelte-9wucso"><td><a href="https://huggingface.co/prs-eth/marigold-normals-v0-1" rel="nofollow">prs-eth/marigold-normals-v0-1</a></td> <td>Normals</td> <td>The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1.</td></tr> <tr data-svelte-h="svelte-1mu3nd5"><td><a href="https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1" rel="nofollow">prs-eth/marigold-iid-appearance-v1-1</a></td> <td>Intrinsics</td> <td>InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity.</td></tr> <tr><td data-svelte-h="svelte-1u8ffq5"><a href="https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1" rel="nofollow">prs-eth/marigold-iid-lighting-v1-1</a></td> <td data-svelte-h="svelte-pxwxqa">Intrinsics</td> <td>HyperSim decomposition of an image<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>I</mi></mrow><annotation encoding="application/x-tex">I</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span></span></span></span><!-- HTML_TAG_END --> is comprised of Albedo<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi></mrow><annotation encoding="application/x-tex">A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal">A</span></span></span></span><!-- HTML_TAG_END -->, Diffuse shading<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>S</mi></mrow><annotation encoding="application/x-tex">S</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span></span></span></span><!-- HTML_TAG_END -->, and Non-diffuse residual<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>R</mi></mrow><annotation encoding="application/x-tex">R</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span></span></span></span><!-- HTML_TAG_END -->:<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>I</mi><mo>=</mo><mi>A</mi><mo>∗</mo><mi>S</mi><mo>+</mo><mi>R</mi></mrow><annotation encoding="application/x-tex">I = A*S+R</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal">A</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">∗</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.7667em;vertical-align:-0.0833em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span></span></span></span><!-- HTML_TAG_END -->.</td></tr></tbody></table> <p data-svelte-h="svelte-r6ainx">The examples below are mostly given for depth prediction, but they can be universally applied to other supported | |
| modalities. | |
| We showcase the predictions using the same input image of Albert Einstein generated by Midjourney. | |
| This makes it easier to compare visualizations of the predictions across various modalities and checkpoints.</p> <div class="flex gap-4" style="justify-content: center; width: 100%;" data-svelte-h="svelte-130z9iz"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://marigoldmonodepth.github.io/images/einstein.jpg"> <figcaption class="mt-1 text-center text-sm text-gray-500">Example input image for all Marigold pipelines</figcaption></div></div> <h2 class="relative group"><a id="depth-prediction" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#depth-prediction"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Depth Prediction</span></h2> <p data-svelte-h="svelte-1tk9nfy">To get a depth prediction, load the <code>prs-eth/marigold-depth-v1-1</code> checkpoint into <a href="/docs/diffusers/pr_11234/en/api/pipelines/marigold#diffusers.MarigoldDepthPipeline">MarigoldDepthPipeline</a>, | |
| put the image through the pipeline, and save the predictions:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-v1-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| depth = pipe(image) | |
| vis = pipe.image_processor.visualize_depth(depth.prediction) | |
| vis[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_depth.png"</span>) | |
| depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction) | |
| depth_16bit[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_depth_16bit.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-p0rjqo">The <a href="/docs/diffusers/pr_11234/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth">visualize_depth()</a> function applies one of | |
| <a href="https://matplotlib.org/stable/users/explain/colors/colormaps.html" rel="nofollow">matplotlib’s colormaps</a> (<code>Spectral</code> by default) to map the predicted pixel values from a single-channel <code>[0, 1]</code> | |
| depth range into an RGB image. | |
| With the <code>Spectral</code> colormap, pixels with near depth are painted red, and far pixels are blue. | |
| The 16-bit PNG file stores the single channel values mapped linearly from the <code>[0, 1]</code> range into <code>[0, 65535]</code>. | |
| Below are the raw and the visualized predictions. The darker and closer areas (mustache) are easier to distinguish in | |
| the visualization.</p> <div class="flex gap-4" data-svelte-h="svelte-16yoeuw"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth_16bit.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted depth (16-bit PNG)</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted depth visualization (Spectral)</figcaption></div></div> <h2 class="relative group"><a id="surface-normals-estimation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#surface-normals-estimation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Surface Normals Estimation</span></h2> <p data-svelte-h="svelte-1gomm07">Load the <code>prs-eth/marigold-normals-v1-1</code> checkpoint into <a href="/docs/diffusers/pr_11234/en/api/pipelines/marigold#diffusers.MarigoldNormalsPipeline">MarigoldNormalsPipeline</a>, put the image through the | |
| pipeline, and save the predictions:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldNormalsPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-normals-v1-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| normals = pipe(image) | |
| vis = pipe.image_processor.visualize_normals(normals.prediction) | |
| vis[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_normals.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-nuh5li">The <a href="/docs/diffusers/pr_11234/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals">visualize_normals()</a> maps the three-dimensional | |
| prediction with pixel values in the range <code>[-1, 1]</code> into an RGB image. | |
| The visualization function supports flipping surface normals axes to make the visualization compatible with other | |
| choices of the frame of reference. | |
| Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where <code>X</code> axis | |
| points right, <code>Y</code> axis points up, and <code>Z</code> axis points at the viewer. | |
| Below is the visualized prediction:</p> <div class="flex gap-4" style="justify-content: center; width: 100%;" data-svelte-h="svelte-15wm70y"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted surface normals visualization</figcaption></div></div> <p data-svelte-h="svelte-112j81b">In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points | |
| straight at the viewer, meaning that its coordinates are <code>[0, 0, 1]</code>. | |
| This vector maps to the RGB <code>[128, 128, 255]</code>, which corresponds to the violet-blue color. | |
| Similarly, a surface normal on the cheek in the right part of the image has a large <code>X</code> component, which increases the | |
| red hue. | |
| Points on the shoulders pointing up with a large <code>Y</code> promote green color.</p> <h2 class="relative group"><a id="intrinsic-image-decomposition" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#intrinsic-image-decomposition"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Intrinsic Image Decomposition</span></h2> <p data-svelte-h="svelte-4kpwil">Marigold provides two models for Intrinsic Image Decomposition (IID): “Appearance” and “Lighting”. | |
| Each model produces Albedo maps, derived from InteriorVerse and Hypersim annotations, respectively.</p> <ul data-svelte-h="svelte-1ryhu3o"><li>The “Appearance” model also estimates Material properties: Roughness and Metallicity.</li> <li>The “Lighting” model generates Diffuse Shading and Non-diffuse Residual.</li></ul> <p data-svelte-h="svelte-5ngr7b">Here is the sample code saving predictions made by the “Appearance” model:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-iid-appearance-v1-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| intrinsics = pipe(image) | |
| vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) | |
| vis[<span class="hljs-number">0</span>][<span class="hljs-string">"albedo"</span>].save(<span class="hljs-string">"einstein_albedo.png"</span>) | |
| vis[<span class="hljs-number">0</span>][<span class="hljs-string">"roughness"</span>].save(<span class="hljs-string">"einstein_roughness.png"</span>) | |
| vis[<span class="hljs-number">0</span>][<span class="hljs-string">"metallicity"</span>].save(<span class="hljs-string">"einstein_metallicity.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1gpdgbg">Another example demonstrating the predictions made by the “Lighting” model:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-iid-lighting-v1-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| intrinsics = pipe(image) | |
| vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) | |
| vis[<span class="hljs-number">0</span>][<span class="hljs-string">"albedo"</span>].save(<span class="hljs-string">"einstein_albedo.png"</span>) | |
| vis[<span class="hljs-number">0</span>][<span class="hljs-string">"shading"</span>].save(<span class="hljs-string">"einstein_shading.png"</span>) | |
| vis[<span class="hljs-number">0</span>][<span class="hljs-string">"residual"</span>].save(<span class="hljs-string">"einstein_residual.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1bbpj7y">Both models share the same pipeline while supporting different decomposition types. | |
| The exact decomposition parameterization (e.g., sRGB vs. linear space) is stored in the | |
| <code>pipe.target_properties</code> dictionary, which is passed into the | |
| <a href="/docs/diffusers/pr_11234/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics">visualize_intrinsics()</a> function.</p> <p data-svelte-h="svelte-120nvgo">Below are some examples showcasing the predicted decomposition outputs. | |
| All modalities can be inspected in the | |
| <a href="https://huggingface.co/spaces/prs-eth/marigold-iid" rel="nofollow">Intrinsic Image Decomposition</a> Space.</p> <div class="flex gap-4" data-svelte-h="svelte-nydk8v"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/8c7986eaaab5eb9604eb88336311f46a7b0ff5ab/marigold/marigold_einstein_albedo.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted albedo ("Appearance" model)</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/8c7986eaaab5eb9604eb88336311f46a7b0ff5ab/marigold/marigold_einstein_diffuse.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Predicted diffuse shading ("Lighting" model)</figcaption></div></div> <h2 class="relative group"><a id="speeding-up-inference" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#speeding-up-inference"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Speeding up inference</span></h2> <p data-svelte-h="svelte-ldmjw6">The above quick start snippets are already optimized for quality and speed, loading the checkpoint, utilizing the | |
| <code>fp16</code> variant of weights and computation, and performing the default number (4) of denoising diffusion steps. | |
| The first step to accelerate inference, at the expense of prediction quality, is to reduce the denoising diffusion | |
| steps to the minimum:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| <span class="hljs-deletion">- depth = pipe(image)</span> | |
| <span class="hljs-addition">+ depth = pipe(image, num_inference_steps=1)</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-17orz4u">With this change, the <code>pipe</code> call completes in 280ms on RTX 3090 GPU. | |
| Internally, the input image is first encoded using the Stable Diffusion VAE encoder, followed by a single denoising | |
| step performed by the U-Net. | |
| Finally, the prediction latent is decoded with the VAE decoder into pixel space. | |
| In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM. | |
| Since Marigold’s latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x, | |
| reducing the call time to 85ms on an RTX 3090, by using a <a href="../api/models/autoencoder_tiny">lightweight replacement of the SD VAE</a>. | |
| Note that using a lightweight VAE may slightly reduce the visual quality of the predictions.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| <span class="hljs-addition">+ pipe.vae = diffusers.AutoencoderTiny.from_pretrained(</span> | |
| <span class="hljs-addition">+ "madebyollin/taesd", torch_dtype=torch.float16</span> | |
| <span class="hljs-addition">+ ).cuda()</span> | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image, num_inference_steps=1)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1qszioo">So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a | |
| significant portion of computations. | |
| Speeding them up can be achieved by using a more efficient attention processor:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| import torch | |
| <span class="hljs-addition">+ from diffusers.models.attention_processor import AttnProcessor2_0</span> | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| <span class="hljs-addition">+ pipe.vae.set_attn_processor(AttnProcessor2_0()) </span> | |
| <span class="hljs-addition">+ pipe.unet.set_attn_processor(AttnProcessor2_0())</span> | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image, num_inference_steps=1)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1o5e3pm">Finally, as suggested in <a href="../optimization/torch2.0#torch.compile">Optimizations</a>, enabling <code>torch.compile</code> can further enhance performance depending on | |
| the target hardware. | |
| However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when | |
| the same pipeline instance is called repeatedly, such as within a loop.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| import torch | |
| from diffusers.models.attention_processor import AttnProcessor2_0 | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| pipe.vae.set_attn_processor(AttnProcessor2_0()) | |
| pipe.unet.set_attn_processor(AttnProcessor2_0()) | |
| <span class="hljs-addition">+ pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True)</span> | |
| <span class="hljs-addition">+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)</span> | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image, num_inference_steps=1)<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="maximizing-precision-and-ensembling" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#maximizing-precision-and-ensembling"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Maximizing Precision and Ensembling</span></h2> <p data-svelte-h="svelte-3i2mxl">Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. | |
| This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion. | |
| The ensembling path is activated automatically when the <code>ensemble_size</code> argument is set greater or equal than <code>3</code>. | |
| When aiming for maximum precision, it makes sense to adjust <code>num_inference_steps</code> simultaneously with <code>ensemble_size</code>. | |
| The recommended values vary across checkpoints but primarily depend on the scheduler type. | |
| The effect of ensembling is particularly well-seen with surface normals:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> import diffusers | |
| pipe = diffusers.MarigoldNormalsPipeline.from_pretrained("prs-eth/marigold-normals-v1-1").to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| <span class="hljs-deletion">- depth = pipe(image)</span> | |
| <span class="hljs-addition">+ depth = pipe(image, num_inference_steps=10, ensemble_size=5)</span> | |
| vis = pipe.image_processor.visualize_normals(depth.prediction) | |
| vis[0].save("einstein_normals.png")<!-- HTML_TAG_END --></pre></div> <div class="flex gap-4" data-svelte-h="svelte-etsikn"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Surface normals, no ensembling</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Surface normals, with ensembling</figcaption></div></div> <p data-svelte-h="svelte-3lgzgw">As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more | |
| correct predictions. | |
| Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.</p> <h2 class="relative group"><a id="frame-by-frame-video-processing-with-temporal-consistency" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#frame-by-frame-video-processing-with-temporal-consistency"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Frame-by-frame Video Processing with Temporal Consistency</span></h2> <p data-svelte-h="svelte-1liveae">Due to Marigold’s generative nature, each prediction is unique and defined by the random noise sampled for the latent | |
| initialization. | |
| This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the | |
| following videos:</p> <div class="flex gap-4" data-svelte-h="svelte-1k6s96j"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Input video</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption></div></div> <p data-svelte-h="svelte-o99xnx">To address this issue, it is possible to pass <code>latents</code> argument to the pipelines, which defines the starting point of | |
| diffusion. | |
| Empirically, we found that a convex combination of the very same starting point noise latent and the latent | |
| corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> imageio | |
| <span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">from</span> diffusers.models.attention_processor <span class="hljs-keyword">import</span> AttnProcessor2_0 | |
| <span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image | |
| <span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm | |
| device = <span class="hljs-string">"cuda"</span> | |
| path_in = <span class="hljs-string">"https://huggingface.co/spaces/prs-eth/marigold-lcm/resolve/c7adb5427947d2680944f898cd91d386bf0d4924/files/video/obama.mp4"</span> | |
| path_out = <span class="hljs-string">"obama_depth.gif"</span> | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-v1-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(device) | |
| pipe.vae = diffusers.AutoencoderTiny.from_pretrained( | |
| <span class="hljs-string">"madebyollin/taesd"</span>, torch_dtype=torch.float16 | |
| ).to(device) | |
| pipe.unet.set_attn_processor(AttnProcessor2_0()) | |
| pipe.vae = torch.<span class="hljs-built_in">compile</span>(pipe.vae, mode=<span class="hljs-string">"reduce-overhead"</span>, fullgraph=<span class="hljs-literal">True</span>) | |
| pipe.unet = torch.<span class="hljs-built_in">compile</span>(pipe.unet, mode=<span class="hljs-string">"reduce-overhead"</span>, fullgraph=<span class="hljs-literal">True</span>) | |
| pipe.set_progress_bar_config(disable=<span class="hljs-literal">True</span>) | |
| <span class="hljs-keyword">with</span> imageio.get_reader(path_in) <span class="hljs-keyword">as</span> reader: | |
| size = reader.get_meta_data()[<span class="hljs-string">'size'</span>] | |
| last_frame_latent = <span class="hljs-literal">None</span> | |
| latent_common = torch.randn( | |
| (<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">768</span> * size[<span class="hljs-number">1</span>] // (<span class="hljs-number">8</span> * <span class="hljs-built_in">max</span>(size)), <span class="hljs-number">768</span> * size[<span class="hljs-number">0</span>] // (<span class="hljs-number">8</span> * <span class="hljs-built_in">max</span>(size))) | |
| ).to(device=device, dtype=torch.float16) | |
| out = [] | |
| <span class="hljs-keyword">for</span> frame_id, frame <span class="hljs-keyword">in</span> tqdm(<span class="hljs-built_in">enumerate</span>(reader), desc=<span class="hljs-string">"Processing Video"</span>): | |
| frame = Image.fromarray(frame) | |
| latents = latent_common | |
| <span class="hljs-keyword">if</span> last_frame_latent <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>: | |
| latents = <span class="hljs-number">0.9</span> * latents + <span class="hljs-number">0.1</span> * last_frame_latent | |
| depth = pipe( | |
| frame, | |
| num_inference_steps=<span class="hljs-number">1</span>, | |
| match_input_resolution=<span class="hljs-literal">False</span>, | |
| latents=latents, | |
| output_latent=<span class="hljs-literal">True</span>, | |
| ) | |
| last_frame_latent = depth.latent | |
| out.append(pipe.image_processor.visualize_depth(depth.prediction)[<span class="hljs-number">0</span>]) | |
| diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()[<span class="hljs-string">'fps'</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1dxdmz7">Here, the diffusion process starts from the given computed latent. | |
| The pipeline sets <code>output_latent=True</code> to access <code>out.latent</code> and computes its contribution to the next frame’s latent | |
| initialization. | |
| The result is much more stable now:</p> <div class="flex gap-4" data-svelte-h="svelte-u703tp"><div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption></div> <div style="flex: 1 1 50%; max-width: 50%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_consistent.gif"> <figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth with forced latents initialization</figcaption></div></div> <h2 class="relative group"><a id="marigold-for-controlnet" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#marigold-for-controlnet"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Marigold for ControlNet</span></h2> <p data-svelte-h="svelte-12oh700">A very common application for depth prediction with diffusion models comes in conjunction with ControlNet. | |
| Depth crispness plays a crucial role in obtaining high-quality results from ControlNet. | |
| As seen in comparisons with other methods above, Marigold excels at that task. | |
| The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">import</span> diffusers | |
| device = <span class="hljs-string">"cuda"</span> | |
| generator = torch.Generator(device=device).manual_seed(<span class="hljs-number">2024</span>) | |
| image = diffusers.utils.load_image( | |
| <span class="hljs-string">"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"</span> | |
| ) | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-v1-1"</span>, torch_dtype=torch.float16, variant=<span class="hljs-string">"fp16"</span> | |
| ).to(device) | |
| depth_image = pipe(image, generator=generator).prediction | |
| depth_image = pipe.image_processor.visualize_depth(depth_image, color_map=<span class="hljs-string">"binary"</span>) | |
| depth_image[<span class="hljs-number">0</span>].save(<span class="hljs-string">"motorcycle_controlnet_depth.png"</span>) | |
| controlnet = diffusers.ControlNetModel.from_pretrained( | |
| <span class="hljs-string">"diffusers/controlnet-depth-sdxl-1.0"</span>, torch_dtype=torch.float16, variant=<span class="hljs-string">"fp16"</span> | |
| ).to(device) | |
| pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained( | |
| <span class="hljs-string">"SG161222/RealVisXL_V4.0"</span>, torch_dtype=torch.float16, variant=<span class="hljs-string">"fp16"</span>, controlnet=controlnet | |
| ).to(device) | |
| pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=<span class="hljs-literal">True</span>) | |
| controlnet_out = pipe( | |
| prompt=<span class="hljs-string">"high quality photo of a sports bike, city"</span>, | |
| negative_prompt=<span class="hljs-string">""</span>, | |
| guidance_scale=<span class="hljs-number">6.5</span>, | |
| num_inference_steps=<span class="hljs-number">25</span>, | |
| image=depth_image, | |
| controlnet_conditioning_scale=<span class="hljs-number">0.7</span>, | |
| control_guidance_end=<span class="hljs-number">0.7</span>, | |
| generator=generator, | |
| ).images | |
| controlnet_out[<span class="hljs-number">0</span>].save(<span class="hljs-string">"motorcycle_controlnet_out.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex gap-4" data-svelte-h="svelte-bddy4e"><div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Input image</figcaption></div> <div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_depth.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Depth in the format compatible with ControlNet</figcaption></div> <div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_out.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city"</figcaption></div></div> <h2 class="relative group"><a id="quantitative-evaluation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#quantitative-evaluation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Quantitative Evaluation</span></h2> <p data-svelte-h="svelte-1r9j6ij">To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), | |
| follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values | |
| for <code>num_inference_steps</code> and <code>ensemble_size</code>. | |
| Optionally seed randomness to ensure reproducibility. | |
| Maximizing <code>batch_size</code> will deliver maximum device utilization.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| device = <span class="hljs-string">"cuda"</span> | |
| seed = <span class="hljs-number">2024</span> | |
| generator = torch.Generator(device=device).manual_seed(seed) | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained(<span class="hljs-string">"prs-eth/marigold-depth-v1-1"</span>).to(device) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| depth = pipe( | |
| image, | |
| num_inference_steps=<span class="hljs-number">4</span>, <span class="hljs-comment"># set according to the evaluation protocol from the paper</span> | |
| ensemble_size=<span class="hljs-number">10</span>, <span class="hljs-comment"># set according to the evaluation protocol from the paper</span> | |
| generator=generator, | |
| ) | |
| <span class="hljs-comment"># evaluate metrics</span><!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="using-predictive-uncertainty" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#using-predictive-uncertainty"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Using Predictive Uncertainty</span></h2> <p data-svelte-h="svelte-yvw9s0">The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random | |
| latents. | |
| As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify <code>ensemble_size</code> greater | |
| or equal than 3 and set <code>output_uncertainty=True</code>. | |
| The resulting uncertainty will be available in the <code>uncertainty</code> field of the output. | |
| It can be visualized as follows:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> diffusers | |
| <span class="hljs-keyword">import</span> torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| <span class="hljs-string">"prs-eth/marigold-depth-v1-1"</span>, variant=<span class="hljs-string">"fp16"</span>, torch_dtype=torch.float16 | |
| ).to(<span class="hljs-string">"cuda"</span>) | |
| image = diffusers.utils.load_image(<span class="hljs-string">"https://marigoldmonodepth.github.io/images/einstein.jpg"</span>) | |
| depth = pipe( | |
| image, | |
| ensemble_size=<span class="hljs-number">10</span>, <span class="hljs-comment"># any number >= 3</span> | |
| output_uncertainty=<span class="hljs-literal">True</span>, | |
| ) | |
| uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty) | |
| uncertainty[<span class="hljs-number">0</span>].save(<span class="hljs-string">"einstein_depth_uncertainty.png"</span>)<!-- HTML_TAG_END --></pre></div> <div class="flex gap-4" data-svelte-h="svelte-a7wlst"><div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_depth_uncertainty.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Depth uncertainty</figcaption></div> <div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals_uncertainty.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Surface normals uncertainty</figcaption></div> <div style="flex: 1 1 33%; max-width: 33%;"><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/4f83035d84a24e5ec44fdda129b1d51eba12ce04/marigold/marigold_einstein_albedo_uncertainty.png"> <figcaption class="mt-1 text-center text-sm text-gray-500">Albedo uncertainty</figcaption></div></div> <p data-svelte-h="svelte-1kdua20">The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to | |
| make consistent predictions.</p> <ul data-svelte-h="svelte-41v41f"><li>The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly.</li> <li>The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the | |
| collar area.</li> <li>Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel, | |
| unlike depth and surface normals. It is also higher in shaded regions and at discontinuities.</li></ul> <h2 class="relative group"><a id="conclusion" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#conclusion"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Conclusion</span></h2> <p data-svelte-h="svelte-rdb8rm">We hope Marigold proves valuable for your downstream tasks, whether as part of a broader generative workflow or for | |
| perception-based applications like 3D reconstruction.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/marigold_usage.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_225bzv = { | |
| assets: "/docs/diffusers/pr_11234/en", | |
| base: "/docs/diffusers/pr_11234/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/diffusers/pr_11234/en/_app/immutable/entry/start.5a617280.js"), | |
| import("/docs/diffusers/pr_11234/en/_app/immutable/entry/app.1a1de8b6.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 276], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 74.2 kB
- Xet hash:
- 6b086f177c79ee971cf364a7c5de324be2bb9e828712a5faee707451a9b296ac
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.