| # P2DFlow | |
| P2DFlow is a protein ensemble generative model with SE(3) flow matching based on ESMFold, the ensembles generated by P2DFlow could aid in understanding protein functions across various scenarios. | |
| Technical details and evaluation results are provided in our paper: | |
| * [P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching](https://pubs.acs.org/doi/abs/10.1021/acs.jctc.4c01620) (JCTC) | |
| * [P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching](https://arxiv.org/abs/2411.17196) (arxiv) | |
| ## Table of Contents | |
| 1. [Installation](#Installation) | |
| 2. [Prepare Dataset](#Prepare-Dataset) | |
| 3. [Model weights](#Model-weights) | |
| 4. [Training](#Training) | |
| 5. [Inference](#Inference) | |
| 6. [Evaluation](#Evaluation) | |
| 7. [License](#License) | |
| 8. [Citation](#Citation) | |
| ## Installation | |
| In an environment with cuda 11.7, run: | |
| ``` | |
| conda env create -f environment.yml | |
| ``` | |
| To activate the environment, run: | |
| ``` | |
| conda activate P2DFlow | |
| ``` | |
| ## Prepare Dataset | |
| #### (tips: If you want to use the data we have preprocessed, please go directly to `3. Process selected dataset`; if you prefer to process the data from scratch or work with your own data, please start from the beginning) | |
| #### 1. Download raw ATLAS dataset | |
| (i) Download the `Analysis & MDs` dataset from [ATLAS](https://www.dsimb.inserm.fr/ATLAS/), or you can use `./dataset/download.py` by running: | |
| ``` | |
| python ./dataset/download.py | |
| ``` | |
| We will use `.pdb` and `.xtc` files for the following calculation. | |
| #### 2. Calculate the 'approximate energy' and select representative structures | |
| (i) Use `gaussian_kde` to calculate the 'approximate energy' (You need to put all files above in `./dataset`, just like `ATLAS_init_example` in [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing)): | |
| ``` | |
| python ./dataset/traj_analyse_select.py | |
| ``` | |
| And you will get selected representative structures in `select` dir and `traj_info_select.csv` for 'approximate energy'. | |
| #### 3. Process selected dataset | |
| (i) Download the selected dataset (or get it from the two steps above) from [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing) whose filename is `selected_dataset_v1.tar` or `selected_dataset_v2.tar` ('v1' selects ~10 structures from MD, 'v2' selects ~100 structures from MD), and decompress it using: | |
| ``` | |
| tar -xzvf select_dataset_v1.tar | |
| ``` | |
| (ii) Preprocess `.pdb` files to get `.pkl` files, compute node representation and pair representation using ESM-2, predict static structure using ESMFold, and get merged `.csv` file: | |
| ``` | |
| python ./data/process_pdb_files.py --pdb_dir ${pdb_dir} --write_dir ${write_dir} --traj_info_file ${traj_info_file} --valid_seq_file ${valid_seq_file} --merged_output_file ${merged_output_file} | |
| ``` | |
| And you will get `.pkl` files (large file size) and `metadata_merged.csv`. (if you are using your own data, you need to split dataset to get validation set as ${valid_seq_file} first, an example is `./inference/valid_seq.csv`). | |
| Processed data will be similar to `ATLAS_processed_example.tar.gz` in [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing) | |
| ## Model weights | |
| Download the pretrained checkpoint from [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing) whose filename is `pretrained.ckpt`, and put it into `./weights` folder. You can use the pretrained weight for inference. | |
| ## Training | |
| To train P2DFlow, firstly make sure you have prepared the dataset according to `Prepare Dataset`, and put it in the right folder, then modify `./configs/base.yaml` (especially for `csv_path`). After this, you can run: | |
| ``` | |
| python experiments/train_se3_flows.py | |
| ``` | |
| And you will get the checkpoints in `./ckpt`. | |
| ## Inference | |
| To infer for specified protein sequence, firstly modify `./configs/inference.yaml` (especially for `ckpt_path` and `validset_path`), then run: | |
| ``` | |
| python experiments/inference_se3_flows.py | |
| ``` | |
| And you will get the results in `./inference_outputs/weights/`. | |
| ## Evaluation | |
| To evaluate metrics related to validity, fidelity and dynamics, run: | |
| ``` | |
| python ./analysis/eval_result.py --pred_org_dir ${pred_org_dir} --valid_csv_file ${valid_csv_file} --pred_merge_dir ${pred_merge_dir} --target_dir ${target_dir} --crystal_dir ${crystal_dir} | |
| ``` | |
| To evaluate PCA, run: | |
| ``` | |
| python ./analysis/pca_analyse.py --pred_pdb_dir ${pred_pdb_dir} --target_dir ${target_dir} --crystal_dir ${crystal_dir} | |
| ``` | |
| Evaluation results will be similar to `evaluation_example` in [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing) | |
| ## License | |
| This project is licensed under the terms of the GPL-3.0 license. | |
| ## Citation | |
| ``` | |
| @article{jin2025p2dflow, | |
| title={P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching}, | |
| author={Yaowei Jin, Qi Huang, Ziyang Song, Mingyue Zheng, Dan Teng, Qian Shi}, | |
| journal={Journal of Chemical Theory and Computation}, | |
| year={2025} | |
| } | |
| ``` | |