Add pipeline tag and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -444
README.md CHANGED
@@ -1,14 +1,15 @@
1
  ---
 
 
 
2
  language:
3
- - en
4
  tags:
5
- - video-generation
6
- - video-editing
7
- - multi-modal
8
- - diffusion
9
- base_model:
10
- - Qwen/Qwen3-VL-8B-Instruct
11
- - Wan-AI/Wan2.2-TI2V-5B
12
  ---
13
 
14
  <p align="center">
@@ -25,6 +26,8 @@ base_model:
25
  <a href="https://msalab-pku.github.io/projects/LoomVideo/index.html" target="_blank"><img src="https://img.shields.io/badge/Project%20Page-333399.svg?logo=homepage" height="22px"></a>
26
  </p>
27
 
 
 
28
  # 🔥 News
29
 
30
  - [2026-06-05] We release LoomVideo [paper](https://arxiv.org/abs/2606.06042) on Arxiv!
@@ -33,20 +36,17 @@ base_model:
33
 
34
  # 📌 TL;DR
35
 
36
- **The Problem:** Existing unified video generation & editing models are massive (13B+) and rely on token concatenation for source conditioning — doubling sequence length and quadrupling attention cost.
 
 
 
37
 
38
- **The Method:** We present **LoomVideo**, a compact **5B-parameter** unified architecture built on MLLM + DiT that introduces three key designs:
39
- - **Deepstack Injection** — extracts features from every MLLM layer and injects them into corresponding DiT layers via cross-attention, enabling rich multi-granular semantic guidance.
40
- - **Scale-and-Add Conditioning** — a zero-overhead approach that scales the clean source latent by the current timestep and directly adds it to the noised target, completely bypassing token concatenation.
41
- - **Negative Temporal RoPE** — assigns negative temporal indices to reference images, seamlessly integrating multi-image conditions without architectural modification.
42
-
43
- **The Result:** Our 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, with at least **5.41×** inference speedup over models of similar capabilities — demonstrating that efficiency and quality can coexist.
44
 
45
  <p align="center">
46
  <img src="assets/architecture.png" width="90%">
47
  </p>
48
 
49
-
50
  # 🎯 Supported Tasks
51
 
52
  LoomVideo supports **four** unified video generation and editing tasks within a single model:
@@ -60,54 +60,24 @@ LoomVideo supports **four** unified video generation and editing tasks within a
60
 
61
  # 🔧 Preparation
62
 
63
- ## Step 1: Clone the Repository
64
 
65
  ```bash
66
  git clone https://github.com/MSALab-PKU/LoomVideo
67
  cd LoomVideo
68
  ```
69
 
70
- ## Step 2: Install Dependencies
71
-
72
- We recommend using [uv](https://github.com/astral-sh/uv) for a fast and fully reproducible environment setup.
73
 
74
  ```bash
75
  uv sync
76
  source .venv/bin/activate
77
-
78
- # (Optional) Include evaluation dependencies
79
- uv sync --extra eval
80
- ```
81
-
82
- Additionally, install [Flash Attention](https://github.com/Dao-AILab/flash-attention) for faster inference and reduced GPU memory consumption. (for reference, our environment uses v2.7.4)
83
-
84
- ## Step 3: Download Model Weights
85
-
86
- Download the pretrained LoomVideo checkpoint from [Hugging Face](https://huggingface.co/MSALab/LoomVideo) and place it under `checkpoints/LoomVideo/`:
87
-
88
- ```
89
- checkpoints/LoomVideo/
90
- └── gen_model.pth
91
  ```
92
 
93
- We provide a helper script to download the weights automatically:
94
-
95
- '''bash
96
- python hf_download.py
97
- '''
98
-
99
- You can also specify a custom path via the `--ckpt_path` argument at inference time.
100
-
101
- > 💡 Stage 3 model weights are now available. Higher-performance post-trained weights will be released as soon as possible!
102
-
103
  # 🎬 Inference
104
- LoomVideo provides a unified inference script that supports **four generation tasks** through a single entry point. Each task is selected via the `--task` flag.
105
 
106
- ### 1. Text-to-Video / Text-to-Image (`t2v`)
107
-
108
- Generate a video from a text description. Default resolution is **480×832** at **81 frames**. When `--num_frames` is set to `1`, the pipeline automatically switches to **image generation** mode and saves the output as a `.jpg` file.
109
-
110
- **Required:** `--prompt`
111
 
112
  ```bash
113
  NUM_GPUS=1
@@ -117,409 +87,17 @@ accelerate launch --num_processes=${NUM_GPUS} \
117
  --config_path configs/inference/generation.yaml \
118
  --ckpt_path checkpoints/LoomVideo \
119
  --task t2v \
120
- --prompt "Your prompt here" \
121
  --height 480 \
122
  --width 832 \
123
  --num_frames 97 \
124
  --num_inference_steps 50 \
125
  --seed 0 \
126
- --output_path outputs/t2v.mp4
127
- ```
128
-
129
- ### 2. Instruction Editing (`edit`)
130
-
131
- Edit an existing image or video based on a text instruction. The source can be either an image file (`.jpg`, `.png`, etc.) or a video file (`.mp4`). Resolution and frame count are automatically inferred from the source when not specified.
132
-
133
- **Required:** `--prompt` `--source_video_path`
134
-
135
- ```bash
136
- NUM_GPUS=1
137
-
138
- accelerate launch --num_processes=${NUM_GPUS} \
139
- scripts/inference/generate.py \
140
- --config_path configs/inference/generation.yaml \
141
- --ckpt_path checkpoints/LoomVideo \
142
- --task edit \
143
- --prompt "Your editing instruction here" \
144
- --source_video_path /path/to/source_video.mp4 \
145
- --num_inference_steps 50 \
146
- --seed 0 \
147
- --output_path outputs/edit.mp4
148
  ```
149
 
150
- ### 3. Instruction-Image Editing (`ref_edit`)
151
-
152
- Edit a source video with guidance from one or more reference images along with a text instruction.
153
-
154
- **Required:** `--prompt` `--source_video_path` `--ref_image_paths`
155
-
156
- ```bash
157
- NUM_GPUS=1
158
-
159
- accelerate launch --num_processes=${NUM_GPUS} \
160
- scripts/inference/generate.py \
161
- --config_path configs/inference/generation.yaml \
162
- --ckpt_path checkpoints/LoomVideo \
163
- --task ref_edit \
164
- --prompt "Your editing instruction" \
165
- --source_video_path /path/to/source_video.mp4 \
166
- --ref_image_paths /path/to/ref1.jpg /path/to/ref2.jpg \
167
- --num_inference_steps 50 \
168
- --seed 0 \
169
- --output_path outputs/ref_edit.mp4
170
- ```
171
-
172
- ### 4. Multi-Image-to-Video (`mi2v`)
173
-
174
- Generate a video conditioned on multiple reference images and a text prompt. We recommend using `@Image N` in the prompt to reference specific input images.
175
-
176
- **Required:** `--prompt` `--ref_image_paths`
177
-
178
- ```bash
179
- NUM_GPUS=1
180
-
181
- accelerate launch --num_processes=${NUM_GPUS} \
182
- scripts/inference/generate.py \
183
- --config_path configs/inference/generation.yaml \
184
- --ckpt_path checkpoints/LoomVideo \
185
- --task mi2v \
186
- --prompt "Your prompt here" \
187
- --ref_image_paths /path/to/img1.jpg /path/to/img2.jpg /path/to/img3.jpg \
188
- --num_frames 97 \
189
- --num_inference_steps 50 \
190
- --seed 0 \
191
- --output_path outputs/mi2v.mp4
192
- ```
193
-
194
-
195
- ## Additional Arguments
196
-
197
- The following arguments can be appended to any task command for further customization:
198
-
199
- ### Generation Control
200
-
201
- <table>
202
- <thead>
203
- <tr><th>Argument</th><th>Type</th><th>Default</th><th>Description</th></tr>
204
- </thead>
205
- <tbody>
206
- <tr><td nowrap><code>--num_inference_steps</code></td><td>int</td><td><code>50</code></td><td>Number of denoising steps.</td></tr>
207
- <tr><td nowrap><code>--guidance_scale</code></td><td>float</td><td><code>5.0</code> / <code>2.5</code></td><td>Text CFG scale. <code>5.0</code> for t2v/mi2v, <code>2.5</code> for edit/ref_edit.</td></tr>
208
- <tr><td nowrap><code>--guidance_scale_visual</code></td><td>float</td><td><code>1.5</code></td><td>Visual CFG scale for source/reference conditioning.</td></tr>
209
- <tr><td nowrap><code>--negative_prompt</code></td><td>str</td><td><em>(from config)</em></td><td>Negative prompt for quality improvement.</td></tr>
210
- <tr><td nowrap><code>--seed</code></td><td>int</td><td><code>0</code></td><td>Random seed. Set to <code>-1</code> for random generation.</td></tr>
211
- </tbody>
212
- </table>
213
-
214
- ### Resolution & Frames
215
-
216
- <table>
217
- <thead>
218
- <tr><th>Argument</th><th>Type</th><th>Default</th><th>Description</th></tr>
219
- </thead>
220
- <tbody>
221
- <tr><td nowrap><code>--height</code></td><td>int</td><td><em>auto</em></td><td>Output height. <code>480</code> for t2v; inferred from source for edit.</td></tr>
222
- <tr><td nowrap><code>--width</code></td><td>int</td><td><em>auto</em></td><td>Output width. <code>832</code> for t2v; inferred from source for edit.</td></tr>
223
- <tr><td nowrap><code>--num_frames</code></td><td>int</td><td><em>auto</em></td><td>Output frames. <code>81</code> for t2v/mi2v; inferred for edit.</td></tr>
224
- <tr><td nowrap><code>--fps</code></td><td>int</td><td><code>24</code></td><td>Output video FPS.</td></tr>
225
- </tbody>
226
- </table>
227
-
228
-
229
- # 📦 Data Preparation
230
-
231
- Since our training relies heavily on proprietary datasets, we are unable to release the original data directly. However, we provide a **flexible data organization framework** that makes it easy to plug in your own data or publicly available datasets.
232
-
233
- ## Open-Source Datasets
234
-
235
- Below are the open-source datasets used in our training. You can download them or substitute with your own data:
236
-
237
- | Category | Dataset |
238
- |---|---|
239
- | Video Generation | [Koala-36M](https://huggingface.co/datasets/Koala-36M/Koala-36M-v1), [OpenVid-1M](https://huggingface.co/datasets/nkp37/OpenVid-1M) |
240
- | Image Editing | [CrispEdit-2M](https://huggingface.co/datasets/WeiChow/CrispEdit-2M), [OmniGen-2-Edit](https://huggingface.co/OmniGen2), [GPT-Image-Edit-1.5M](https://huggingface.co/datasets/UCSC-VLAA/GPT-Image-Edit-1.5M), [NHR-Edit](https://huggingface.co/datasets/iitolstykh/NHR-Edit), [Pico-Banana](https://github.com/apple/pico-banana-400k), [ShareGPT-4o-Image](https://huggingface.co/datasets/FreedomIntelligence/ShareGPT-4o-Image) |
241
- | Video Editing | [KIWI-Edit](https://huggingface.co/datasets/linyq/kiwi_edit_training_data) |
242
- | Video Ref Editing / MI2V | [RefVIE](https://huggingface.co/datasets/linyq/kiwi_edit_training_data), [Phantom-Data](https://huggingface.co/datasets/ZhuoweiChen/Phantom-data-Koala36M) |
243
-
244
- ## Organize Data as Single JSON Files
245
-
246
- Each data sample should be stored as an **individual JSON file**, placed in a single directory (e.g., `single_jsons/`), and named sequentially starting from `0.json`:
247
-
248
- ```
249
- your_dataset/
250
- └── single_jsons/
251
- ├── 0.json
252
- ├── 1.json
253
- ├── 2.json
254
- ├── ...
255
- ```
256
-
257
- ## JSON Format for Each Task
258
-
259
- Each task type expects a specific set of keys in its JSON file. Below are the templates — fill in according to your data:
260
-
261
- **Text-to-Video** (`process_t2v_data`):
262
- ```json
263
- {
264
- "text": "A caption describing the video content.",
265
- "path": "relative/path/to/video.mp4"
266
- }
267
- ```
268
-
269
- **Text-to-Image** (`process_t2i_data`):
270
- ```json
271
- {
272
- "caption": "A caption describing the image content.",
273
- "image_path": "relative/path/to/image.jpg"
274
- }
275
- ```
276
-
277
- **Video Editing** (`process_video_edit_data`):
278
- ```json
279
- {
280
- "source_video_path": "relative/path/to/source_video.mp4",
281
- "instruction": "The editing instruction.",
282
- "target_video_path": "relative/path/to/target_video.mp4"
283
- }
284
- ```
285
-
286
- **Image Editing** (`process_image_edit_data`):
287
- ```json
288
- {
289
- "source_image_path": "relative/path/to/source_image.jpg",
290
- "instruction": "The editing instruction.",
291
- "target_image_path": "relative/path/to/target_image.jpg"
292
- }
293
- ```
294
-
295
- **Multi-Image-to-Video** (`process_t2v_data_withref`):
296
- ```json
297
- {
298
- "instruction": "A prompt describing the video to generate with reference images.",
299
- "reference_image_paths": [
300
- "relative/path/to/ref1.jpg",
301
- "relative/path/to/ref2.jpg"
302
- ],
303
- "target_video_path": "relative/path/to/target_video.mp4"
304
- }
305
- ```
306
-
307
- **Reference-Guided Video Editing** (`process_video_edit_data_withref`):
308
- ```json
309
- {
310
- "source_video_path": "relative/path/to/source_video.mp4",
311
- "reference_image_paths": [
312
- "relative/path/to/ref1.jpg"
313
- ],
314
- "instruction": "The editing instruction with reference guidance.",
315
- "target_video_path": "relative/path/to/target_video.mp4"
316
- }
317
- ```
318
-
319
- > 💡 All paths in JSON files are **relative** to the `data_root` specified in the dataset config.
320
-
321
- ## Custom Process Functions (Optional)
322
-
323
- You may also organize your JSON files in any format you prefer, as long as you implement a corresponding `process_*` function. We provide several reference implementations in `src/dataset/processors.py`. Each process function takes `(dataset_info, data_info)` and returns a list of segments describing the data flow. See the existing functions for examples.
324
-
325
- ## Dataset Config
326
-
327
- Create a YAML config file to register your datasets. See `configs/dataset/train_demo.yaml` as a reference. The config is organized into `train`, `val`, and `eval` sections, each containing dataset entries with the following arguments:
328
-
329
- | Argument | Description |
330
- |---|---|
331
- | `task_weight` | Controls the sampling probability of this task group relative to others during training. |
332
- | `process_func_name` | Name of the processing function in `src/dataset/processors.py` that parses each JSON sample. |
333
- | `data_root` | Base directory for resolving relative paths in JSON files. |
334
- | `data_json_dir` | Directory containing the JSON files (`0.json`, `1.json`, ...). |
335
- | `num_samples` | Total number of samples in the directory. |
336
- | `sample_weight` | Sampling weight of this dataset within its task group. |
337
-
338
-
339
- # 🏋️ Training
340
-
341
- ## Training Config
342
-
343
- The training behavior is fully controlled by a YAML config file (e.g., `configs/train/stage3.yaml`).
344
-
345
- **Key arguments:**
346
-
347
- | Argument | Description |
348
- |---|---|
349
- | `log_dir` | Directory for saving logs, checkpoints, and generated samples. |
350
- | `dataset_config_path` | Path to the dataset config YAML file. |
351
- | `train_steps` | Total number of training iterations. |
352
- | `checkpointing_interval` | Save a checkpoint every N steps. |
353
- | `validation_interval` | Run validation every N steps. |
354
- | `evaluation_interval` | Run evaluation benchmarks every N steps. |
355
-
356
- **Model settings:**
357
-
358
- | Argument | Description |
359
- |---|---|
360
- | `model.trainable_modules.gen_model` | Which modules to train. `"all"` trains the full generation model. |
361
- | `model.gradient_checkpointing` | Enable gradient checkpointing to reduce GPU memory usage. |
362
- | `model.und.pretrained_model_path` | Path to the pretrained understanding backbone. |
363
- | `model.gen.pretrained_model_path` | Path to the pretrained generation backbone. |
364
- | `model.pretrained_ckpt_path` | *(Optional)* Load weights from a previous training stage for continued training. |
365
-
366
- **Data settings:**
367
-
368
- | Argument | Description |
369
- |---|---|
370
- | `data.train.resolution_buckets` | List of resolution buckets for dynamic batching. |
371
- | `data.train.num_frames` | Number of frames per training sample. |
372
- | `data.train.fps` | Video FPS for frame sampling. |
373
- | `data.train.all_dropout_rate` | Probability of dropping all conditions (for unconditional training). |
374
- | `data.train.text_dropout_rate` | Probability of dropping text condition (for classifier-free guidance). |
375
-
376
- ## Launch Training
377
-
378
- Once the data and configs are ready, you can simply start training with:
379
-
380
- ```bash
381
- NUM_GPUS=8
382
-
383
- accelerate launch --num_processes=${NUM_GPUS} \
384
- -m scripts.train.train \
385
- --config_path path/to/your/config.yaml
386
- ```
387
-
388
- > 💡 All training outputs — including checkpoints, EMA weights, logs, and generated samples — are saved under the `log_dir` directory specified in the config.
389
-
390
-
391
- # 📊 Evaluation
392
-
393
- ## Environment Setup
394
-
395
- ### Step 1: Prepare Benchmark Data
396
-
397
- We evaluate on the following benchmarks. Download each dataset and organize it into the same **single JSON** format used for training data (see [Data Preparation](#-data-preparation)):
398
-
399
- | Benchmark | Category | Samples |
400
- |---|---|---|
401
- | [GenEval](https://github.com/djghosh13/geneval) | Image Generation | 553 |
402
- | [ImgEdit-Bench](https://github.com/pku-yuangroup/imgedit) | Image Editing | 737 |
403
- | [VBench](https://github.com/Vchitect/VBench) | Video Generation | 165 |
404
- | [OpenVE-Bench](https://huggingface.co/datasets/Lewandofski/OpenVE-Bench) | Video Editing | 431 |
405
- | [RefVIE-Bench](https://huggingface.co/datasets/linyq/RefVIE-Bench) | Reference Video Editing | 120 |
406
- | [Intelligent-VBench-MI2V](https://github.com/Tencent-Hunyuan/OmniWeaving) | Multi-Image-to-Video | 320 |
407
- | [Intelligent-VBench-TIV2V](https://github.com/Tencent-Hunyuan/OmniWeaving) | Text-Image-Video-to-Video | 210 |
408
-
409
- > 💡 For **Intelligent-VBench**, we split the original benchmark into two subsets based on task type — **MI2V** and **TIV2V**. Their JSON files should be placed in separate directories.
410
-
411
- After downloading, update the `data_root` and `data_json_dir` paths in `configs/dataset/benchmarks.yaml` to point to your local directories.
412
-
413
- ### Step 2: Install Evaluation Dependencies
414
-
415
- **VBench:**
416
-
417
- ```bash
418
- mkdir -p libs && cd libs
419
- git clone https://github.com/Vchitect/VBench.git
420
- ```
421
-
422
- Add the following to `libs/VBench/vbench/__init__.py`:
423
-
424
- ```python
425
- import sys, os
426
- local_lib_path = os.path.abspath("libs/VBench")
427
- if local_lib_path not in sys.path:
428
- sys.path.append(local_lib_path)
429
- ```
430
-
431
- If you encounter a NumPy 2.0 compatibility error (`np.sctypes was removed`), modify lines 45–47 of `[YOUR_PYTHON_LIBS]/imgaug/imgaug.py`:
432
-
433
- ```python
434
- # Replace:
435
- # NP_FLOAT_TYPES = set(np.sctypes["float"])
436
- # NP_INT_TYPES = set(np.sctypes["int"])
437
- # NP_UINT_TYPES = set(np.sctypes["uint"])
438
-
439
- # With:
440
- NP_FLOAT_TYPES = {np.float16, np.float32, np.float64, np.longdouble}
441
- NP_INT_TYPES = {np.int8, np.int16, np.int32, np.int64, np.longlong}
442
- NP_UINT_TYPES = {np.uint8, np.uint16, np.uint32, np.uint64, np.ulonglong}
443
- ```
444
-
445
- To save disk space, remove unnecessary files:
446
-
447
- ```bash
448
- rm -rf libs/VBench/VBench-2.0 libs/VBench/.git libs/VBench/asset libs/VBench/vbench2_beta_trustworthiness
449
- ```
450
-
451
- **GenEval:**
452
-
453
- ```bash
454
- cd libs
455
- git clone https://github.com/djghosh13/geneval.git
456
- cd geneval
457
- ./evaluation/download_models.sh "../../checkpoints/"
458
-
459
- cd ..
460
- pip install mmcv-full
461
- git clone https://github.com/open-mmlab/mmdetection.git
462
- cd mmdetection && git checkout 2.x
463
- pip install -v -e . --no-build-isolation
464
- ```
465
-
466
- The GenEval model paths are configured in `configs/evaluation/evaluation.yaml` under `model.evaluation.geneval`:
467
-
468
- ```yaml
469
- model:
470
- evaluation:
471
- geneval:
472
- model_path: checkpoints/evaluation/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.pth
473
- model_config_path: libs/mmdetection/configs/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py
474
- clip_path: checkpoints/evaluation/ViT-L-14.pt
475
- ```
476
-
477
- ### Step 3: Configure API Keys
478
-
479
- Some benchmarks (OpenVE-Bench, RefVIE-Bench, ImgEdit-Bench, Intelligent-VBench) require LLM API calls for metric computation. Configure your API keys in `configs/evaluation/evaluation.yaml` under `model.evaluation`:
480
-
481
- ```yaml
482
- model:
483
- evaluation:
484
- # For OpenVE-Bench, RefVIE-Bench, Intelligent-VBench
485
- gemini:
486
- api_key: "YOUR_GEMINI_API_KEY"
487
- base_url: "YOUR_GEMINI_BASE_URL"
488
- model: "gemini-2.5-pro-06-17"
489
- # For ImgEdit-Bench
490
- openai:
491
- api_key: "YOUR_OPENAI_API_KEY"
492
- base_url: "YOUR_OPENAI_BASE_URL"
493
- model: "gpt-4.1"
494
- ```
495
-
496
-
497
- ## Run Evaluation
498
-
499
- Once the environment is set up, you can simply run evaluation with:
500
-
501
- ```bash
502
- NUM_GPUS=8
503
-
504
- accelerate launch --num_processes=${NUM_GPUS} \
505
- -m scripts.evaluation.evaluate \
506
- --config configs/evaluation/evaluation.yaml \
507
- --checkpoint_dir checkpoints/LoomVideo \
508
- --generation_configs configs/dataset/benchmarks.yaml \
509
- --output_dir results/evaluation \
510
- --calculate_metrics
511
- ```
512
-
513
-
514
- # 📧 Contact
515
-
516
- Jianzong Wu (吴健宗): jzwu@stu.pku.edu.cn
517
-
518
-
519
  # 📄 Citation
520
 
521
- If you find our work helpful, please consider giving us a ⭐ on this repo and citing our paper as follows:
522
-
523
  ```bibtex
524
  @article{wu2026loomvideo,
525
  title={LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing},
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-8B-Instruct
4
+ - Wan-AI/Wan2.2-TI2V-5B
5
  language:
6
+ - en
7
  tags:
8
+ - video-generation
9
+ - video-editing
10
+ - multi-modal
11
+ - diffusion
12
+ pipeline_tag: text-to-video
 
 
13
  ---
14
 
15
  <p align="center">
 
26
  <a href="https://msalab-pku.github.io/projects/LoomVideo/index.html" target="_blank"><img src="https://img.shields.io/badge/Project%20Page-333399.svg?logo=homepage" height="22px"></a>
27
  </p>
28
 
29
+ This repository contains the weights for **LoomVideo**, a compact 5B-parameter unified architecture for both video generation and editing. For more details, see the paper: [LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing](https://arxiv.org/abs/2606.06042).
30
+
31
  # 🔥 News
32
 
33
  - [2026-06-05] We release LoomVideo [paper](https://arxiv.org/abs/2606.06042) on Arxiv!
 
36
 
37
  # 📌 TL;DR
38
 
39
+ LoomVideo is a compact **5B-parameter** unified architecture built on MLLM + DiT that introduces three key designs:
40
+ - **Deepstack Injection** — extracts features from every MLLM layer and injects them into corresponding DiT layers via cross-attention.
41
+ - **Scale-and-Add Conditioning** — a zero-overhead approach for video editing that eliminates the need for token concatenation.
42
+ - **Negative Temporal RoPE** — seamlessly integrates multiple reference images without architectural modification.
43
 
44
+ Our 5B model achieves state-of-the-art performance across benchmarks, with at least **5.41×** inference speedup over models of similar capabilities.
 
 
 
 
 
45
 
46
  <p align="center">
47
  <img src="assets/architecture.png" width="90%">
48
  </p>
49
 
 
50
  # 🎯 Supported Tasks
51
 
52
  LoomVideo supports **four** unified video generation and editing tasks within a single model:
 
60
 
61
  # 🔧 Preparation
62
 
63
+ ### 1. Clone the Repository
64
 
65
  ```bash
66
  git clone https://github.com/MSALab-PKU/LoomVideo
67
  cd LoomVideo
68
  ```
69
 
70
+ ### 2. Install Dependencies
 
 
71
 
72
  ```bash
73
  uv sync
74
  source .venv/bin/activate
75
+ pip install flash-attn --no-build-isolation
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ```
77
 
 
 
 
 
 
 
 
 
 
 
78
  # 🎬 Inference
 
79
 
80
+ LoomVideo provides a unified inference script. Below is an example for **Text-to-Video** generation. For other tasks (editing, reference-guided editing), please refer to the [GitHub README](https://github.com/MSALab-PKU/LoomVideo).
 
 
 
 
81
 
82
  ```bash
83
  NUM_GPUS=1
 
87
  --config_path configs/inference/generation.yaml \
88
  --ckpt_path checkpoints/LoomVideo \
89
  --task t2v \
90
+ --prompt "Vampire makeup face of beautiful girl, red contact lenses." \
91
  --height 480 \
92
  --width 832 \
93
  --num_frames 97 \
94
  --num_inference_steps 50 \
95
  --seed 0 \
96
+ --output_path outputs/t2v_demo.mp4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ```
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  # 📄 Citation
100
 
 
 
101
  ```bibtex
102
  @article{wu2026loomvideo,
103
  title={LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing},