| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | library_name: diffusers |
| | pipeline_tag: image-to-image |
| | tags: |
| | - Image-to-Image |
| | - ControlNet |
| | - Diffusers |
| | - QwenImageControlNetPipeline |
| | - Qwen-Image |
| | base_model: Qwen/Qwen-Image |
| | --- |
| | |
| | # Qwen-Image-ControlNet-Union |
| | This repository provides a unified ControlNet that supports 4 common control types (canny, soft edge, depth, pose) for [Qwen-Image](https://github.com/QwenLM/Qwen-Image). |
| |
|
| |
|
| | # Model Cards |
| | - This ControlNet consists of 5 double blocks copied from the pretrained transformer layers. |
| | - We train the model from scratch for 50K steps using a dataset of 10M high-quality general and human images. |
| | - We train at 1328x1328 resolution in BFloat16, batch size=64, learning rate=4e-5. We set the text drop ratio to 0.10. |
| | - This model supports multiple control modes, including canny, soft edge, depth, pose. You can use it just as a normal ControlNet. |
| |
|
| | # Showcases |
| | <table style="width:100%; table-layout:fixed;"> |
| | <tr> |
| | <td><img src="./conds/canny1.png" alt="canny"></td> |
| | <td><img src="./outputs/canny1.png" alt="canny"></td> |
| | </tr> |
| | <tr> |
| | <td><img src="./conds/soft_edge.png" alt="soft_edge"></td> |
| | <td><img src="./outputs/soft_edge.png" alt="soft_edge"></td> |
| | </tr> |
| | <tr> |
| | <td><img src="./conds/depth.png" alt="depth"></td> |
| | <td><img src="./outputs/depth.png" alt="depth"></td> |
| | </tr> |
| | <tr> |
| | <td><img src="./conds/pose.png" alt="pose"></td> |
| | <td><img src="./outputs/pose.png" alt="pose"></td> |
| | </tr> |
| | </table> |
| | |
| | # Inference |
| | ```python |
| | import torch |
| | from diffusers.utils import load_image |
| | |
| | # https://github.com/huggingface/diffusers/pull/12215 |
| | # pip install git+https://github.com/huggingface/diffusers |
| | from diffusers import QwenImageControlNetPipeline, QwenImageControlNetModel |
| | |
| | base_model = "Qwen/Qwen-Image" |
| | controlnet_model = "InstantX/Qwen-Image-ControlNet-Union" |
| | |
| | controlnet = QwenImageControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16) |
| | |
| | pipe = QwenImageControlNetPipeline.from_pretrained( |
| | base_model, controlnet=controlnet, torch_dtype=torch.bfloat16 |
| | ) |
| | pipe.to("cuda") |
| | |
| | # canny |
| | # it is highly suggested to add 'TEXT' into prompt if there are text elements |
| | control_image = load_image("conds/canny.png") |
| | prompt = "Aesthetics art, traditional asian pagoda, elaborate golden accents, sky blue and white color palette, swirling cloud pattern, digital illustration, east asian architecture, ornamental rooftop, intricate detailing on building, cultural representation." |
| | controlnet_conditioning_scale = 1.0 |
| | |
| | # soft edge |
| | # control_image = load_image("conds/soft_edge.png") |
| | # prompt = "Photograph of a young man with light brown hair jumping mid-air off a large, reddish-brown rock. He's wearing a navy blue sweater, light blue shirt, gray pants, and brown shoes. His arms are outstretched, and he has a slight smile on his face. The background features a cloudy sky and a distant, leafless tree line. The grass around the rock is patchy." |
| | # controlnet_conditioning_scale = 1.0 |
| | |
| | # depth |
| | # control_image = load_image("conds/depth.png") |
| | # prompt = "A swanky, minimalist living room with a huge floor-to-ceiling window letting in loads of natural light. A beige couch with white cushions sits on a wooden floor, with a matching coffee table in front. The walls are a soft, warm beige, decorated with two framed botanical prints. A potted plant chills in the corner near the window. Sunlight pours through the leaves outside, casting cool shadows on the floor." |
| | # controlnet_conditioning_scale = 1.0 |
| | |
| | # pose |
| | # control_image = load_image("conds/pose.png") |
| | # prompt = "Photograph of a young man with light brown hair and a beard, wearing a beige flat cap, black leather jacket, gray shirt, brown pants, and white sneakers. He's sitting on a concrete ledge in front of a large circular window, with a cityscape reflected in the glass. The wall is cream-colored, and the sky is clear blue. His shadow is cast on the wall." |
| | # controlnet_conditioning_scale = 1.0 |
| | |
| | image = pipe( |
| | prompt=prompt, |
| | negative_prompt=" ", |
| | control_image=control_image, |
| | controlnet_conditioning_scale=controlnet_conditioning_scale, |
| | width=control_image.size[0], |
| | height=control_image.size[1], |
| | num_inference_steps=30, |
| | true_cfg_scale=4.0, |
| | generator=torch.Generator(device="cuda").manual_seed(42), |
| | ).images[0] |
| | image.save(f"qwenimage_cn_union_result.png") |
| | ``` |
| |
|
| | # Inference Setting |
| | You can adjust control strength via controlnet_conditioning_scale. |
| | - Canny: use cv2.Canny, set controlnet_conditioning_scale in [0.8, 1.0] |
| | - Soft Edge: use [AnylineDetector](https://github.com/huggingface/controlnet_aux), set controlnet_conditioning_scale in [0.8, 1.0] |
| | - Depth: use [depth-anything](https://github.com/DepthAnything/Depth-Anything-V2), set controlnet_conditioning_scale in [0.8, 1.0] |
| | - Pose: use [DWPose](https://github.com/IDEA-Research/DWPose/tree/onnx), set controlnet_conditioning_scale in [0.8, 1.0] |
| |
|
| | We strongly recommend using detailed prompts, especially when include text elements. For example, use "a poster with text 'InstantX Team' on the top" instead of "a poster". |
| |
|
| | For multiple conditions inference, please refer to [PR](https://github.com/huggingface/diffusers/pull/12215). |
| |
|
| | # ComfyUI Support |
| | [ComfyUI](https://www.comfy.org/) offers native support for Qwen-Image-ControlNet-Union. Check the [blog](https://blog.comfy.org/p/day-1-support-of-qwen-image-instantx) for more details. |
| |
|
| | # Community Support |
| | [Liblib AI](https://www.liblib.art/) offers native support for Qwen-Image-ControlNet-Union. [Visit](https://www.liblib.art/sd) for online inference. |
| |
|
| | # Limitations |
| | We find that the model was unable to preserve some details without explicit 'TEXT' in prompt, such as small font text. |
| |
|
| | # Acknowledgements |
| | This model is developed by InstantX Team. All copyright reserved. |
| |
|