| --- |
| license: apache-2.0 |
| pipeline_tag: image-to-image |
| tags: |
| - comfyui |
| - image-editing |
| - joyai |
| - multi-image |
| --- |
| |
| # JoyAI-Image-Edit-Plus (ComfyUI weights) |
|
|
| Single-file `.safetensors` checkpoints of [JoyAI-Image-Edit-Plus](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers), repackaged for **native ComfyUI** support (no custom node required). |
|
|
| JoyAI-Image-Edit-Plus is the multi-image instruction-guided editing model of the [JoyAI-Image](https://github.com/jd-opensource/JoyAI-Image) family. It accepts **1β6 reference images** and a text instruction, and generates a new image that combines elements from the references according to the instruction. |
|
|
| ## Files |
|
|
| | File | Size | Goes into | Component | |
| |------|------|-----------|-----------| |
| | `diffusion_models/joy_image_edit_plus_bf16.safetensors` | ~31 GB | `ComfyUI/models/diffusion_models/` | `JoyImageEditPlusTransformer3DModel` (bf16) | |
| | `text_encoders/qwen3vl_joyimage_bf16.safetensors` | ~17 GB | `ComfyUI/models/text_encoders/` | Qwen3-VL-8B text encoder (bf16) | |
| | `vae/joy_image_edit_vae.safetensors` | ~243 MB | `ComfyUI/models/vae/` | `AutoencoderKLWan` | |
|
|
| The repo layout already matches `ComfyUI/models/`, so a single `hf download` into your models root drops every file where it needs to go. |
|
|
| ## Model architecture |
|
|
| - **Transformer**: 40-layer DiT, hidden size 4096, 32 heads, in/out channels 16, patch size `[1, 2, 2]`, 3D RoPE (`rope_dim_list = [16, 56, 56]`, theta 10000). Each reference image is patchified independently and concatenated on the sequence dimension with a per-image temporal offset in the 3D RoPE grid, so references may differ in resolution. |
| - **Text encoder**: `Qwen3VLForConditionalGeneration` (text dim 4096). The instruction is wrapped with one `<|vision_start|><|image_pad|><|vision_end|>` block per reference image. |
| - **VAE**: `AutoencoderKLWan` (z_dim 16, spatial downscale 8, temporal downscale 4) β the same VAE used by the single-image edit model. |
| - **Scheduler**: FlowMatch (Euler), sampling shift 1.5. |
| |
| Weight names are byte-identical to the diffusers checkpoint (894 transformer keys, zero renaming); ComfyUI auto-detects the model as `joyimage`. |
| |
| ## Installation |
| |
| The model runs natively in ComfyUI. Native support is proposed upstream in [Comfy-Org/ComfyUI#14428](https://github.com/Comfy-Org/ComfyUI/pull/14428); until it is merged, install the fork branch: |
| |
| ```bash |
| git clone -b joyimage-edit-pr https://github.com/feice-huang/ComfyUI.git |
| cd ComfyUI |
| pip install -r requirements.txt |
| ``` |
| |
| Once the PR is merged upstream, the stock ComfyUI release will run these weights with no fork needed. |
| |
| Then download the weights straight into `ComfyUI/models/`: |
| |
| ```bash |
| hf download jdopensource/JoyAI-Image-Edit-Plus-ComfyUI \ |
| --local-dir /path/to/ComfyUI/models |
| ``` |
| |
| Restart ComfyUI. |
| |
| ## Usage |
| |
| Example workflow: [workflow_joyimage_edit.json](https://github.com/user-attachments/files/29588811/workflow_joyimage_edit_plus.json) |
| |
| Build the graph from these native nodes: |
| |
| 1. **Load Diffusion Model** (`UNETLoader`) β `diffusion_models/joy_image_edit_plus_bf16.safetensors` |
| 2. **Load CLIP** (`CLIPLoader`) β `text_encoders/qwen3vl_joyimage_bf16.safetensors`, type `joyimage` |
| 3. **Load VAE** (`VAELoader`) β `vae/joy_image_edit_vae.safetensors` |
| 4. **Load Image** (`LoadImage`) for each reference (1β6) |
| 5. **TextEncodeJoyImageEditPlus** β feed `clip`, `vae`, the instruction, and the reference images into `image1`β¦`image6`. Wire one instance for the positive prompt and one (empty prompt, same images) for the negative. Each node bucket-resizes the references to the 1024-base buckets, VAE-encodes them, and appends the reference latents to the conditioning; its `image` output feeds `VAEDecode` / empty-latent sizing. |
| 6. **KSampler** β **VAEDecode** β **SaveImage** |
|
|
| ## Recommended parameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Steps | 30 | |
| | CFG | 4.0 | |
| | Sampler | `euler` | |
| | Scheduler | `simple` | |
| | dtype | bf16 | |
| | Resolution | auto (1024-base buckets, per reference) | |
|
|
| ## Example |
|
|
| **Prompt:** "The woman is lovingly holding the cute puppy in her arms" |
|
|
| | Input 0 | Input 1 | Output | |
| |---------|---------|--------| |
| |  |  |  | |
|
|
| ## Model details |
|
|
| - **Developed by**: JD.com |
| - **License**: Apache-2.0 |
| - **Framework**: PyTorch / ComfyUI |
|
|
| ## Links |
|
|
| - Source code and documentation: [github.com/jd-opensource/JoyAI-Image](https://github.com/jd-opensource/JoyAI-Image) |
| - Original Diffusers-format weights: [jdopensource/JoyAI-Image-Edit-Plus-Diffusers](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers) |
| - Single-image edit model (ComfyUI): [jdopensource/JoyAI-Image-Edit-ComfyUI](https://huggingface.co/jdopensource/JoyAI-Image-Edit-ComfyUI) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{joyai-image-2025, |
| title={JoyAI-Image: A Unified Multimodal Foundation Model for Image Understanding, Generation, and Editing}, |
| author={Joy Future Academy, JD}, |
| year={2025}, |
| url={https://github.com/jd-opensource/JoyAI-Image} |
| } |
| ``` |
|
|