weicaijaden commited on
Commit
6d6b8ca
·
verified ·
1 Parent(s): 3584099

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -3
README.md CHANGED
@@ -1,3 +1,204 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # ContextTAD
6
+
7
+ ContextTAD is a deep-learning TAD caller that learns boundary evidence from broader local Hi-C windows that capture TAD-scale structural context. Instead of treating boundary prediction as an isolated per-bin classification problem, ContextTAD uses a context-aware representation to produce left- and right-boundary tracks that are explicitly optimized for downstream TAD assembly.
8
+
9
+ Our github repo: https://github.com/ai4nucleome/ContextTAD
10
+
11
+ ## Environment setup
12
+
13
+ Create a conda environment named `contexttad`.
14
+
15
+ ```bash
16
+ conda create -n contexttad python=3.12 -y
17
+ conda activate contexttad
18
+ pip install -r requirements.txt
19
+ ```
20
+
21
+ `requirements.txt` is exported from the working training environment (`3dgenome`), and is provided at:
22
+
23
+ - `requirements.txt`
24
+
25
+ Additional external tools required by some evaluation/plotting scripts:
26
+
27
+ - `Rscript` (for structural protein enrichment, `exp2_struct_protein`)
28
+ - `coolpup.py` (for coolpup pileups, `exp5_coolpup`)
29
+ - `pyGenomeTracks` (for genome track visualizations)
30
+
31
+ ### Download SAM3 configuration and weights
32
+
33
+ Download SAM3 model files from:
34
+
35
+ - https://huggingface.co/facebook/sam3/tree/main
36
+
37
+ ## Data preparation
38
+
39
+ **Note: Most of our data have be uploaded in Zenodo: https://doi.org/10.5281/zenodo.19062598, you only need download `.mcool` data from 4DN.**
40
+
41
+ Detailed data layout is documented in:
42
+
43
+ - `0-data/README.md`
44
+
45
+ The pipeline expects:
46
+
47
+ - `0-data/1_dp_train_infer_data` (training/inference arrays and labels)
48
+ - `0-data/2_eval_tads_data` (evaluation assets)
49
+
50
+ ### Step 1: build GM12878 training/inference arrays
51
+
52
+ ```bash
53
+ export TAD_DATA_DIR=/path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data
54
+ export MCOOL_TEMPLATE="/path/to/mcool/4DNFIXP4QG5B_Rao2014_GM12878_frac{frac}.mcool"
55
+
56
+ python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py
57
+ ```
58
+
59
+ Optional modes:
60
+
61
+ ```bash
62
+ python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py --only-4000M
63
+ python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py --skip-4000M
64
+ ```
65
+
66
+ ### Step 2: build other-celltype inference windows (optional, for cross-cell evaluation)
67
+
68
+ ```bash
69
+ python 1-prepare_data/step1_process_data/scripts/prepare_othercell_inference_data.py \
70
+ --mcool /path/to/K562_or_IMR90.mcool::/resolutions/5000 \
71
+ --out_data_dir /path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data/other_celltypes/K562 \
72
+ --coverage_tag K562
73
+ ```
74
+
75
+ Repeat for `IMR90` with `--coverage_tag IMR90`.
76
+
77
+ ### Step 3: build merged GT BED from labels
78
+
79
+ ```bash
80
+ export TAD_DATA_DIR=/path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data
81
+ python 1-prepare_data/step3_build_gt/scripts/build_ground_truth.py
82
+ ```
83
+
84
+ ## Data sources and accessions
85
+
86
+ Reference paper used to align data sourcing style:
87
+
88
+ - RefHiC: https://www.nature.com/articles/s41467-022-35231-3
89
+
90
+ The following identifiers/files are used in this project data tree.
91
+
92
+ | Category | Dataset / Cell line | Identifier or file used | Source |
93
+ |---|---|---|---|
94
+ | Hi-C mcool | GM12878 (Rao2014) | `4DNFIXP4QG5B_Rao2014_GM12878_frac1.mcool` (+ downsampled fractions) | 4DN Data Portal |
95
+ | Hi-C mcool | K562 (Rao2014) | `4DNFI4DGNY7J_Rao2014_K562_300M.mcool` | 4DN Data Portal |
96
+ | Hi-C mcool | IMR90 (Rao2014) | `4DNFIJTOIGOI_Rao2014_IMR90_1000M.mcool` | 4DN Data Portal |
97
+ | CTCF ChIP-seq | GM12878 | `ENCFF796WRU_GM12878.bed_CTCF_5kb+.bed`, `ENCFF796WRU_GM12878.bed_CTCF_5kb-.bed` | ENCODE |
98
+ | CTCF ChIP-seq | K562 | `ENCFF901CBP_K562.bed_CTCF_5kb+.bed`, `ENCFF901CBP_K562.bed_CTCF_5kb-.bed` | ENCODE |
99
+ | CTCF ChIP-seq | IMR90 | `ENCFF203SRF_IMR90.bed_CTCF_5kb+.bed`, `ENCFF203SRF_IMR90.bed_CTCF_5kb-.bed` | ENCODE |
100
+ | CTCF ChIA-PET | GM12878 | `gm12878.tang.ctcf-chiapet.hg38.bedpe` | Processed benchmark resource / ENCODE |
101
+ | CTCF ChIA-PET | K562 | `k562.encode.ctcf-chiapet.5k.hg38.bedpe` | ENCODE |
102
+ | CTCF ChIA-PET | IMR90 | `imr90_ctcf_chiapet_hg38_ENCFF682YFU.bedpe` | ENCODE |
103
+ | Structural protein peaks | GM12878 | `CTCF_peaks.bed`, `RAD21_peaks.bed`, `SMC3_peaks.bed` | TAD benchmarking resources |
104
+
105
+ ## How to run (step-by-step)
106
+
107
+ ### 1) Train ContextTAD base model
108
+
109
+ ```bash
110
+ bash 2-training/step1_train/scripts/run_train_base.sh \
111
+ 0 \
112
+ train_base_$(date +%Y%m%d_%H%M%S) \
113
+ none \
114
+ 10 \
115
+ 2
116
+ ```
117
+
118
+ Output:
119
+
120
+ - `2-training/step1_train/outputs/<run_id>/train_outputs/`
121
+
122
+ ### 2) Inference + decode on GM12878
123
+
124
+ ```bash
125
+ bash 2-training/step2_infer_decode/scripts/run_infer_decode_gm12878.sh \
126
+ /path/to/checkpoint_epoch_005.pt \
127
+ 0 \
128
+ infer_gm12878_$(date +%Y%m%d_%H%M%S) \
129
+ auto \
130
+ default
131
+ ```
132
+
133
+ Output:
134
+
135
+ - `2-training/step2_infer_decode/outputs/<run_id>/beds/`
136
+
137
+ ### 3) Inference + decode on K562/IMR90 (optional)
138
+
139
+ ```bash
140
+ bash 2-training/step2_infer_decode/scripts/run_infer_decode_othercell.sh \
141
+ /path/to/checkpoint_epoch_005.pt \
142
+ 0 \
143
+ infer_othercell_$(date +%Y%m%d_%H%M%S) \
144
+ auto \
145
+ default
146
+ ```
147
+
148
+ ### 4) Evaluation
149
+
150
+ Main results:
151
+
152
+ ```bash
153
+ bash 3-evaluation/step1_main_results_vs_tools/scripts/run_main_results.sh \
154
+ /path/to/gm12878_beds_dir \
155
+ /path/to/othercell_beds_dir \
156
+ main_results_$(date +%Y%m%d_%H%M%S)
157
+ ```
158
+
159
+ Model-ablation-style evaluation (ours-focused):
160
+
161
+ ```bash
162
+ bash 3-evaluation/step2_model_ablation_ours_only/scripts/run_model_ablation_eval.sh \
163
+ /path/to/gm12878_beds_dir \
164
+ ablation_eval_$(date +%Y%m%d_%H%M%S)
165
+ ```
166
+
167
+ ## 4-pipeline one-command run
168
+
169
+ In this snapshot, the directory is currently named `5-fullpipeline` and will be renamed to `4-pipeline`.
170
+
171
+ Default run (`exp1/exp3/exp4/exp6`):
172
+
173
+ ```bash
174
+ bash 5-fullpipeline/run_full_pipeline.sh 0 0
175
+ ```
176
+
177
+ Full run (all experiments):
178
+
179
+ ```bash
180
+ bash 5-fullpipeline/run_full_pipeline.sh 0 0 full_$(date +%Y%m%d_%H%M%S) 2 29600 --all-exps
181
+ ```
182
+
183
+ ## Ablation usage (module and loss only)
184
+
185
+ Module ablations (examples):
186
+
187
+ ```bash
188
+ bash 2-training/step1_train/scripts/run_train_no_tofe.sh 0
189
+ bash 2-training/step1_train/scripts/run_train_no_text.sh 0
190
+ bash 2-training/step1_train/scripts/run_train_no_pairloss.sh 0
191
+ bash 2-training/step1_train/scripts/run_train_obs_input.sh 0
192
+ ```
193
+
194
+ Loss ablation (count loss off):
195
+
196
+ ```bash
197
+ bash 2-training/step1_train/scripts/run_train_experiment.sh \
198
+ no_count \
199
+ 0 \
200
+ train_no_count_$(date +%Y%m%d_%H%M%S) \
201
+ none \
202
+ 10 \
203
+ 2
204
+ ```