ProCap / README.md

Link paper and GitHub repository (#1)

bfaac09 7 days ago

4.07 kB

	---
	datasets:
	- clevr-change
	- image-editing-request
	- spot-the-diff
	license: mit
	metrics:
	- bleu
	- meteor
	- rouge
	pipeline_tag: image-to-text
	tags:
	- change captioning
	- vision-language
	- image-to-text
	- procedural reasoning
	- multimodal
	- pytorch
	---

	# ProCap: Experiment Materials

	This repository contains the official experimental materials for the paper:

	> Imagine How to Change: Explicit Procedure Modeling for Change Captioning

	[[Paper](https://huggingface.co/papers/2603.05969)] [[Code](https://github.com/BlueberryOreo/ProCap)]

	ProCap is a framework that reformulates change modeling from static image comparison to dynamic procedure modeling. It features a two-stage design:
	1. Explicit Procedure Modeling: Trains a procedure encoder to learn the change procedure from a sparse set of keyframes.
	2. Implicit Procedure Captioning: Integrates the trained encoder within an encoder-decoder model for captioning using learnable procedure queries.

	It provides processed datasets, pre-trained model weights, and evaluation tools for reproducing the results reported in the paper.

	📦 All materials are also available via [Baidu Netdisk](https://pan.baidu.com/s/1t_YXB6J_vkuPxByn2hat2A)
	Extraction Code: `5h7w`

	---

	## Contents

	- [Data](#data)
	- [Model Weights](#model-weights)
	- [Evaluation](#evaluation)
	- [Usage](#usage)
	- [License](#license)

	---

	## Data

	All datasets are preprocessed into pseudo-sequence format (`.h5` files) generated by [VFIformer](https://github.com/JIA-Lab-research/VFIformer).

	### Included Datasets

	- `CLEVR-data`
	Processed pseudo-sequences for the CLEVR-Change dataset

	- `edit-data`
	Processed pseudo-sequences for the Image-Editing-Request dataset

	- `spot-data`
	Processed pseudo-sequences for the Spot-the-Diff dataset

	- `filter_files`
	Confidence scores computed using [CLIP4IDC](https://github.com/sushizixin/CLIP4IDC)

	- `filtered-spot-captions`
	Refined captions for the Spot-the-Diff dataset

	---

	## Model Weights

	This repository provides pre-trained weights for both stages in the paper.

	### Explicit Procedure Modeling (Stage 1)

	- `pretrained_vqgan` – VQGAN models for each dataset
	- `stage1_clevr_best`
	- `stage1_edit_best`
	- `stage1_spot_best`

	### Implicit Procedure Captioning (Stage 2)

	- `clevr_best`
	- `edit_best`
	- `spot_best`

	> Note: Stage 1 checkpoints can be directly reused to initialize Stage 2 training.

	---

	## Evaluation

	- `densevid_eval`
	Evaluation tools used for quantitative assessment

	---

	## Usage

	### 1. Data Preparation

	1. Move caption files in `filtered-spot-captions` to the original caption directory of the Spot-the-Diff dataset.
	2. Copy the processed data folders to the original dataset root and rename them as follows:

	\| Dataset \| Folder \| Rename To \|
	\|------\|------\|------\|
	\| CLEVR-Change \| `CLEVR-data` \| `CLEVR_processed` \|
	\| Image-Editing-Request \| `edit-data` \| `edit_processed` \|
	\| Spot-the-Diff \| `spot-data` \| `spot_processed` \|

	3. Place `filter_files` in the project root directory.

	---

	### 2. Model Weights

	- Place `pretrained_vqgan` in the project root directory.
	- To reuse Stage 1 weights during training, set `symlink_path` in training scripts as:

	```bash
	symlink_path="/path/to/stage1/weight/dalle.pt"
	```

	- To evaluate with pre-trained checkpoints, set `resume_path` in evaluation scripts as:

	```bash
	resume_path="/path/to/pretrained/model/model.chkpt"
	```

	### 3. Evaluation Tool

	Place the `densevid_eval` directory in the project root before evaluation.

	## Citation

	If you find our work or this repository useful, please consider citing our paper:
	```bibtex
	@inproceedings{
	sun2026imagine,
	title={Imagine How To Change: Explicit Procedure Modeling for Change Captioning},
	author={Sun, Jiayang and Guo, Zixin and Cao, Min and Zhu, Guibo and Laaksonen, Jorma},
	booktitle={The Fourteenth International Conference on Learning Representations},
	year={2026},
	}
	```

	---

	## License

	This repository is released under the MIT License.