File size: 10,851 Bytes
e4b8dd8 259fb7d e4b8dd8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | # ReMoMask: Retrieval-Augmented Masked Motion Generation<br>
This is the official repository for the paper:
> **ReMoMask: Retrieval-Augmented Masked Motion Generation**
>
> Zhengdao Li\*, Siheng Wang\*, [Zeyu Zhang](https://steve-zeyu-zhang.github.io/)\*<sup>β </sup>, and [Hao Tang](https://ha0tang.github.io/)<sup>#</sup>
>
> \*Equal contribution. <sup>β </sup>Project lead. <sup>#</sup>Corresponding author.
>
> ### [Paper](https://arxiv.org/abs/2508.02605) | [Website](https://aigeeksgroup.github.io/ReMoMask) | [Model](https://huggingface.co/lycnight/ReMoMask) | [HF Paper](https://huggingface.co/papers/2508.02605)
# βοΈ Citation
```
@article{li2025remomask,
title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
journal={arXiv preprint arXiv:2508.02605},
year={2025}
}
```
---
# π Introduction
Retrieval-Augmented Text-to-Motion (RAG-T2M) models have demonstrated superior performance over conventional T2M approaches, particularly in handling uncommon and complex textual descriptions by leveraging external motion knowledge. Despite these gains, existing RAG-T2M models remain limited by two closely related factors: (1) coarse-grained text-motion retrieval that overlooks the hierarchical structure of human motion, and (2) underexplored mechanisms for effectively fusing retrieved information into the generative process. In this work, we present **ReMoMask**, a structure-aware RAG framework for text-to-motion generation that addresses these limitations. To improve retrieval, we propose **Hierarchical Bidirectional Momentum** (HBM) Contrastive Learning, which employs dual contrastive objectives to jointly align global motion semantics and fine-grained part-level motion features with text. To address the fusion gap, we first conduct a systematic study on motion representations and information fusion strategies in RAG-T2M, revealing that a 2D motion representation combined with cross-attention-based fusion yields superior performance. Based on these findings, we design **Semantic Spatial-Temporal Attention** (SSTA), a motion-tailored fusion module that more effectively integrates retrieved motion knowledge into the generative backbone. Extensive experiments on HumanML3D, KIT-ML, and SnapMoGen demonstrate that ReMoMask consistently outperforms prior methods on both text-motion retrieval and text-to-motion generation benchmarks.
## TODO List
- [x] Upload our paper to arXiv and build project pages.
- [x] Upload the code.
- [x] Release TMR model.
- [x] Release T2M model.
# π€ Prerequisite
<details>
<summary>details</summary>
## Environment
```bash
conda create -n remomask python=3.10
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
conda activate remomask
```
We tested our environment on both A800 and H20.
## Dependencies
### 1. pretrained models
Dwonload the models from [HuggingFace](https://huggingface.co/lycnight/ReMoMask), and place them like:
```
remomask_models.zip
βββ checkpoints/ # Evaluation Models and Gloves
βββ Part_TMR/
β βββ checkpoints/ # RAG pretrained checkpoints
βββ logs/ # T2M pretrained checkpoints
βββ database/ # RAG database
βββ ViT-B-32.pt # CLIP model
```
### 2. Prepare training dataset
Follow the instruction in [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git), then place the result dataset to `./dataset/HumanML3D`.
</details>
# π Demo
<details>
<summary>details</summary>
```bash
python demo.py \
--gpu_id 0 \
--ext exp_demo \
--text_prompt "A person is playing the drum set." \
--checkpoints_dir logs \
--dataset_name humanml3d \
--mtrans_name pretrain_mtrans \
--rtrans_name pretrain_rtrans
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
```
explanation:
* `--repeat_times`: number of replications for generation, default `1`.
* `--motion_length`: specify the number of poses for generation.
output will be in `./outputs/`
</details>
# π οΈ Train your own models
<details>
<summary>details</summary>
## Stage1: train a Motion Retriever
```bash
python Part_TMR/scripts/train.py \
device=cuda:0 \
train=train \
dataset.train_split_filename=train.txt \
exp_name=exp \
train.optimizer.motion_lr=1.0e-05 \
train.optimizer.text_lr=1.0e-05 \
train.optimizer.head_lr=1.0e-05
# change the exp_name to your rag name
```
then build a rag database for training t2m model:
```bash
python build_rag_database.py \
--config-name=config \
device=cuda:0 \
train=train \
dataset.train_split_filename=train.txt \
exp_name=exp_for_mtrans
```
you will get `./database`
## Stage2: train a Retrieval Augmented Mask Model
### tarin a 2D RVQ-VAE Quantizer
```bash
bash run_rvq.sh \
vq \
0 \
humanml3d \
--batch_size 256 \
--num_quantizers 6 \
--max_epoch 50 \
--quantize_dropout_prob 0.2 \
--gamma 0.1 \
--code_dim2d 1024 \
--nb_code2d 256
# vq means the save dir
# 0 means gpu_0
# humanml3d means dataset
# change the vq_name to your vq name
```
### train a 2D Retrieval-Augmented Mask Transformer
```bash
bash run_mtrans.sh \
mtrans \
1 \
0 \
11247 \
humanml3d \
--vq_name pretrain_vq \
--batch_size 64 \
--max_epoch 2000 \
--attnj \
--attnt \
--latent_dim 512 \
--n_heads 8 \
--train_split train.txt \
--val_split val.txt
# 1 means using one gpu
# 0 means using gpu_0
# 11247 means ddp master port
# change the mtrans to your mtrans name
```
### train a 2D Retrieval-Augmented Residual Transformer
```bash
bash run_rtrans.sh \
rtrans \
2 \
humanml3d \
--batch_size 64 \
--vq_name pretrain_vq \
--cond_drop_prob 0.01 \
--share_weight \
--max_epoch 2000 \
--attnj \
--attnt
# here, 2 means cuda:0,1
# --vq_name: the vq model you want to use
# change the rtrans to your vq rtrans
```
</details>
# πͺ Evalution
<details>
<summary>details</summary>
## Evaluate the RAG
```bash
python Part_TMR/scripts/test.py \
device=cuda:0 \
train=train \
exp_name=exp_pretrain
# change exp_pretrain to your rag model
```
## Evaluate the T2M
### 1. Evaluate the 2D RVQ-VAE Quantizer
```bash
python eval_vq.py \
--gpu_id 0 \
--name pretrain_vq \
--dataset_name humanml3d \
--ext eval \
--which_epoch net_best_fid.tar
# change pretrain_vq to your vq
```
### 2. Evaluate the 2D Retrieval-Augmented Masked Transformer
```bash
python eval_mask.py \
--dataset_name humanml3d \
--mtrans_name pretrain_mtrans \
--gpu_id 0 \
--cond_scale 4 \
--time_steps 10 \
--ext eval \
--repeat_times 1 \
--which_epoch net_best_fid.tar
# change pretrain_mtrans to your mtrans
```
### 3. Evaluate the 2D Residual Transformer
HumanML3D:
```bash
python eval_res.py \
--gpu_id 0 \
--dataset_name humanml3d \
--mtrans_name pretrain_mtrans \
--rtrans_name pretrain_rtrans \
--cond_scale 4 \
--time_steps 10 \
--ext eval \
--which_ckpt net_best_fid.tar \
--which_epoch fid \
--traverse_res
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans
```
</details>
# π€ Visualization
<details>
<summary>details</summary>
## 1. download and set up blender
<details>
<summary>details</summary>
You can download the blender from [instructions](https://www.blender.org/download/lts/2-93/). Please install exactly this version. For our paper, we use `blender-2.93.18-linux-x64`.
>
### a. unzip it:
```bash
tar -xvf blender-2.93.18-linux-x64.tar.xz
```
### b. check if you have installed the blender successfully or not:
```bash
cd blender-2.93.18-linux-x64
./blender --background --version
```
you should see: `Blender 2.93.18 (hash cb886axxxx built 2023-05-22 23:33:27)`
```bash
./blender --background --python-expr "import sys; import os; print('\nThe version of python is ' + sys.version.split(' ')[0])"
```
you should see: `The version of python is 3.9.2`
### c. get the blender-python path
```bash
./blender --background --python-expr "import sys; import os; print('\nThe path to the installation of python is\n' + sys.executable)"
```
you should see: ` The path to the installation of python is /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9s`
### d. install pip for blender-python
```bash
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m ensurepip --upgrade
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install --upgrade pip
```
### e. prepare env for blender-python
```bash
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install numpy==2.0.2
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install matplotlib==3.9.4
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra-core==1.3.2
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra_colorlog==1.2.0
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install moviepy==1.0.3
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install shortuuid==1.0.13
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install natsort==8.4.0
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install pytest-shutil==1.8.1
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==4.67.1
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==1.17.0
```
</details>
## 2. calulate SMPL mesh:
```bash
python -m fit --dir new_test_npy --save_folder new_temp_npy --cuda cuda:0
```
## 3. render to video or sequence
```bash
/xxx/blender-2.93.18-linux-x64/blender --background --python render.py -- --cfg=./configs/render_mld.yaml --dir=test_npy --mode=video --joint_type=HumanML3D
```
- `--mode=video`: render to mp4 video
- `--mode=sequence`: render to a png image, calle sequence.
</details>
# π Acknowlegements
We sincerely thank the open-sourcing of these works where our code is based on:
[MoMask](https://github.com/EricGuo5513/momask-codes),
[MoGenTS](https://github.com/weihaosky/mogents),
[ReMoDiffuse](https://github.com/mingyuan-zhang/ReMoDiffuse),
[MDM](https://github.com/GuyTevet/motion-diffusion-model),
[TMR](https://github.com/Mathux/TMR),
[ReMoGPT](https://ojs.aaai.org/index.php/AAAI/article/view/33044)
## π License
This code is distributed under an [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.
|