| --- |
| datasets: |
| - clevr-change |
| - image-editing-request |
| - spot-the-diff |
| license: mit |
| metrics: |
| - bleu |
| - meteor |
| - rouge |
| pipeline_tag: image-to-text |
| tags: |
| - change captioning |
| - vision-language |
| - image-to-text |
| - procedural reasoning |
| - multimodal |
| - pytorch |
| --- |
| |
| # ProCap: Experiment Materials |
|
|
| This repository contains the **official experimental materials** for the paper: |
|
|
| > **Imagine How to Change: Explicit Procedure Modeling for Change Captioning** |
|
|
| [[Paper](https://huggingface.co/papers/2603.05969)] [[Code](https://github.com/BlueberryOreo/ProCap)] |
|
|
| ProCap is a framework that reformulates change modeling from static image comparison to dynamic procedure modeling. It features a two-stage design: |
| 1. **Explicit Procedure Modeling**: Trains a procedure encoder to learn the change procedure from a sparse set of keyframes. |
| 2. **Implicit Procedure Captioning**: Integrates the trained encoder within an encoder-decoder model for captioning using learnable procedure queries. |
|
|
| It provides **processed datasets**, **pre-trained model weights**, and **evaluation tools** for reproducing the results reported in the paper. |
|
|
| 📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A) |
| **Extraction Code:** `5h7w` |
|
|
| --- |
|
|
| ## Contents |
|
|
| - [Data](#data) |
| - [Model Weights](#model-weights) |
| - [Evaluation](#evaluation) |
| - [Usage](#usage) |
| - [License](#license) |
|
|
| --- |
|
|
| ## Data |
|
|
| All datasets are preprocessed into **pseudo-sequence format** (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer). |
|
|
| ### Included Datasets |
|
|
| - **`CLEVR-data`** |
| Processed pseudo-sequences for the **CLEVR-Change** dataset |
|
|
| - **`edit-data`** |
| Processed pseudo-sequences for the **Image-Editing-Request** dataset |
|
|
| - **`spot-data`** |
| Processed pseudo-sequences for the **Spot-the-Diff** dataset |
|
|
| - **`filter_files`** |
| Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC) |
| |
| - **`filtered-spot-captions`** |
| Refined captions for the Spot-the-Diff dataset |
| |
| --- |
| |
| ## Model Weights |
| |
| This repository provides pre-trained weights for both stages in the paper. |
| |
| ### Explicit Procedure Modeling (Stage 1) |
| |
| - `pretrained_vqgan` – VQGAN models for each dataset |
| - `stage1_clevr_best` |
| - `stage1_edit_best` |
| - `stage1_spot_best` |
| |
| ### Implicit Procedure Captioning (Stage 2) |
| |
| - `clevr_best` |
| - `edit_best` |
| - `spot_best` |
| |
| > **Note:** Stage 1 checkpoints can be directly reused to initialize Stage 2 training. |
| |
| --- |
| |
| ## Evaluation |
| |
| - **`densevid_eval`** |
| Evaluation tools used for quantitative assessment |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### 1. Data Preparation |
|
|
| 1. Move caption files in `filtered-spot-captions` to the original caption directory of the **Spot-the-Diff** dataset. |
| 2. Copy the processed data folders to the original dataset root and rename them as follows: |
|
|
| | Dataset | Folder | Rename To | |
| |------|------|------| |
| | CLEVR-Change | `CLEVR-data` | `CLEVR_processed` | |
| | Image-Editing-Request | `edit-data` | `edit_processed` | |
| | Spot-the-Diff | `spot-data` | `spot_processed` | |
|
|
| 3. Place `filter_files` in the project root directory. |
|
|
| --- |
|
|
| ### 2. Model Weights |
|
|
| - Place `pretrained_vqgan` in the project root directory. |
| - To reuse Stage 1 weights during training, set `symlink_path` in training scripts as: |
|
|
| ```bash |
| symlink_path="/path/to/stage1/weight/dalle.pt" |
| ``` |
|
|
| - To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as: |
|
|
| ```bash |
| resume_path="/path/to/pretrained/model/model.chkpt" |
| ``` |
|
|
| ### 3. Evaluation Tool |
|
|
| Place the `densevid_eval` directory in the project root before evaluation. |
|
|
| ## Citation |
|
|
| If you find our work or this repository useful, please consider citing our paper: |
| ```bibtex |
| @inproceedings{ |
| sun2026imagine, |
| title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning}, |
| author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma}, |
| booktitle={The Fourteenth International Conference on Learning Representations}, |
| year={2026}, |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| This repository is released under the MIT License. |