File size: 10,851 Bytes
e4b8dd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259fb7d
e4b8dd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# ReMoMask: Retrieval-Augmented Masked Motion Generation<br>

This is the official repository for the paper:
> **ReMoMask: Retrieval-Augmented Masked Motion Generation**
>
> Zhengdao Li\*, Siheng Wang\*, [Zeyu Zhang](https://steve-zeyu-zhang.github.io/)\*<sup>†</sup>, and [Hao Tang](https://ha0tang.github.io/)<sup>#</sup>
>
> \*Equal contribution. <sup>†</sup>Project lead. <sup>#</sup>Corresponding author.
>
> ### [Paper](https://arxiv.org/abs/2508.02605) | [Website](https://aigeeksgroup.github.io/ReMoMask) | [Model](https://huggingface.co/lycnight/ReMoMask) | [HF Paper](https://huggingface.co/papers/2508.02605)


# ✏️ Citation

```
@article{li2025remomask,
  title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
  author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2508.02605},
  year={2025}
}
```

---

# πŸ‘‹ Introduction
Retrieval-Augmented Text-to-Motion (RAG-T2M) models have demonstrated superior performance over conventional T2M approaches, particularly in handling uncommon and complex textual descriptions by leveraging external motion knowledge. Despite these gains, existing RAG-T2M models remain limited by two closely related factors: (1) coarse-grained text-motion retrieval that overlooks the hierarchical structure of human motion, and (2) underexplored mechanisms for effectively fusing retrieved information into the generative process. In this work, we present **ReMoMask**, a structure-aware RAG framework for text-to-motion generation that addresses these limitations. To improve retrieval, we propose **Hierarchical Bidirectional Momentum** (HBM) Contrastive Learning, which employs dual contrastive objectives to jointly align global motion semantics and fine-grained part-level motion features with text. To address the fusion gap, we first conduct a systematic study on motion representations and information fusion strategies in RAG-T2M, revealing that a 2D motion representation combined with cross-attention-based fusion yields superior performance. Based on these findings, we design **Semantic Spatial-Temporal Attention** (SSTA), a motion-tailored fusion module that more effectively integrates retrieved motion knowledge into the generative backbone. Extensive experiments on HumanML3D, KIT-ML, and SnapMoGen demonstrate that ReMoMask consistently outperforms prior methods on both text-motion retrieval and text-to-motion generation benchmarks.



## TODO List

- [x] Upload our paper to arXiv and build project pages.
- [x] Upload the code.
- [x] Release TMR model.
- [x] Release T2M model.

# πŸ€— Prerequisite
<details> 
<summary>details</summary>
  
## Environment
```bash
conda create -n remomask python=3.10
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
conda activate remomask
```
We tested our environment on both A800 and H20.

## Dependencies
### 1. pretrained models
Dwonload the models from [HuggingFace](https://huggingface.co/lycnight/ReMoMask), and place them like:

```
remomask_models.zip
    β”œβ”€β”€ checkpoints/              #   Evaluation Models and Gloves
    β”œβ”€β”€ Part_TMR/
    β”‚   └── checkpoints/        # RAG pretrained checkpoints
    β”œβ”€β”€ logs/                   # T2M pretrained checkpoints
    β”œβ”€β”€ database/               # RAG database
    └── ViT-B-32.pt            # CLIP model
```

### 2. Prepare training dataset 
Follow the instruction in [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git), then place the result dataset to `./dataset/HumanML3D`.
</details>

# πŸš€ Demo
<details> 
<summary>details</summary>
  
```bash
python demo.py \
    --gpu_id 0 \
    --ext exp_demo \
    --text_prompt "A person is playing the drum set." \
    --checkpoints_dir logs \
    --dataset_name humanml3d \
    --mtrans_name pretrain_mtrans \
    --rtrans_name pretrain_rtrans
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
```
explanation:
* `--repeat_times`: number of replications for generation, default `1`.
* `--motion_length`: specify the number of poses for generation.

output will be in `./outputs/`
</details> 


# πŸ› οΈ Train your own models
<details>
<summary>details</summary>
  
## Stage1: train a Motion Retriever
```bash
python Part_TMR/scripts/train.py \
    device=cuda:0 \
    train=train \
    dataset.train_split_filename=train.txt \
    exp_name=exp \
    train.optimizer.motion_lr=1.0e-05 \
    train.optimizer.text_lr=1.0e-05 \
    train.optimizer.head_lr=1.0e-05
# change the exp_name to your rag name
```
then build a rag database for training t2m model:
```bash
python build_rag_database.py \
    --config-name=config \
    device=cuda:0 \
    train=train \
    dataset.train_split_filename=train.txt \
    exp_name=exp_for_mtrans
```
you will get `./database`


##  Stage2: train a Retrieval Augmented Mask Model

### tarin a 2D RVQ-VAE Quantizer
```bash
bash run_rvq.sh \
    vq \
    0 \
    humanml3d \
    --batch_size 256 \
    --num_quantizers 6 \
    --max_epoch 50 \
    --quantize_dropout_prob 0.2 \
    --gamma 0.1 \
    --code_dim2d 1024 \
    --nb_code2d 256
# vq means the save dir
# 0 means gpu_0
# humanml3d means dataset
# change the vq_name to your vq name
```

### train a 2D Retrieval-Augmented Mask Transformer
```bash
bash run_mtrans.sh \
    mtrans \
    1 \
    0 \
    11247 \
    humanml3d \
    --vq_name pretrain_vq \
    --batch_size 64 \
    --max_epoch 2000 \
    --attnj \
    --attnt \
    --latent_dim 512 \
    --n_heads 8 \
    --train_split train.txt \
    --val_split val.txt
# 1 means using one gpu
# 0 means using gpu_0
# 11247 means ddp master port
# change the mtrans to your mtrans name
```


### train a 2D Retrieval-Augmented Residual Transformer
```bash
bash run_rtrans.sh \
    rtrans \
    2 \
    humanml3d \
    --batch_size 64 \
    --vq_name pretrain_vq \
    --cond_drop_prob 0.01 \
    --share_weight \
    --max_epoch 2000 \
    --attnj \
    --attnt
# here, 2 means cuda:0,1
# --vq_name: the vq model you want to use
# change the rtrans to your vq rtrans
```

</details>



# πŸ’ͺ Evalution
<details>
<summary>details</summary>
  
## Evaluate the RAG  
```bash
python Part_TMR/scripts/test.py \
    device=cuda:0 \
    train=train \
    exp_name=exp_pretrain
# change exp_pretrain to your rag model
```


## Evaluate the T2M

### 1. Evaluate the 2D RVQ-VAE Quantizer
```bash
python eval_vq.py \
--gpu_id 0 \
--name pretrain_vq \
--dataset_name humanml3d \
--ext eval \
--which_epoch net_best_fid.tar
# change pretrain_vq to your vq
```

### 2. Evaluate the 2D Retrieval-Augmented Masked Transformer
```bash
python eval_mask.py \
    --dataset_name humanml3d \
    --mtrans_name pretrain_mtrans \
    --gpu_id 0 \
    --cond_scale 4 \
    --time_steps 10 \
    --ext eval \
    --repeat_times 1 \
    --which_epoch net_best_fid.tar
# change pretrain_mtrans to your mtrans
```


### 3. Evaluate the 2D Residual Transformer
HumanML3D:
```bash
python eval_res.py \
    --gpu_id 0 \
    --dataset_name humanml3d \
    --mtrans_name pretrain_mtrans \
    --rtrans_name pretrain_rtrans \
    --cond_scale 4 \
    --time_steps 10 \
    --ext eval \
    --which_ckpt net_best_fid.tar \
    --which_epoch fid \
    --traverse_res
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans
```
</details>



# πŸ€– Visualization
<details>
<summary>details</summary>
  
## 1. download and set up blender
<details>
<summary>details</summary>
You can download the blender from [instructions](https://www.blender.org/download/lts/2-93/). Please install exactly this version. For our paper, we use `blender-2.93.18-linux-x64`. 
> 
### a. unzip it:
```bash
tar -xvf blender-2.93.18-linux-x64.tar.xz
```

### b. check if you have installed the blender successfully or not:
```bash
cd blender-2.93.18-linux-x64
./blender --background --version
```
you should see: `Blender 2.93.18 (hash cb886axxxx built 2023-05-22 23:33:27)`
```bash
./blender --background --python-expr "import sys; import os; print('\nThe version of python is ' + sys.version.split(' ')[0])"
```
you should see: `The version of python is 3.9.2`

### c. get the blender-python path
```bash
./blender --background --python-expr "import sys; import os; print('\nThe path to the installation of python is\n' + sys.executable)"
```
you should see: `	The path to the installation of python is /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9s`

### d. install pip for blender-python
```bash
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m ensurepip --upgrade
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install --upgrade pip
```

### e. prepare env for blender-python
```bash 
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install numpy==2.0.2
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install matplotlib==3.9.4
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra-core==1.3.2
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra_colorlog==1.2.0
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install moviepy==1.0.3
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install shortuuid==1.0.13
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install natsort==8.4.0
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install pytest-shutil==1.8.1
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==4.67.1
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==1.17.0
```
</details>


## 2. calulate SMPL mesh:
```bash
python -m fit --dir new_test_npy --save_folder new_temp_npy --cuda cuda:0
```

## 3. render to video or sequence
```bash
/xxx/blender-2.93.18-linux-x64/blender --background --python render.py -- --cfg=./configs/render_mld.yaml --dir=test_npy --mode=video --joint_type=HumanML3D
```
- `--mode=video`: render to mp4 video
- `--mode=sequence`: render to a png image, calle sequence.

</details>

# πŸ‘ Acknowlegements
We sincerely thank the open-sourcing of these works where our code is based on:

[MoMask](https://github.com/EricGuo5513/momask-codes),
[MoGenTS](https://github.com/weihaosky/mogents),
[ReMoDiffuse](https://github.com/mingyuan-zhang/ReMoDiffuse),
[MDM](https://github.com/GuyTevet/motion-diffusion-model),
[TMR](https://github.com/Mathux/TMR),
[ReMoGPT](https://ojs.aaai.org/index.php/AAAI/article/view/33044)

## πŸ”’ License
This code is distributed under an [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).

Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.