File size: 6,124 Bytes
6d6b8ca | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: mit
---
# ContextTAD
ContextTAD is a deep-learning TAD caller that learns boundary evidence from broader local Hi-C windows that capture TAD-scale structural context. Instead of treating boundary prediction as an isolated per-bin classification problem, ContextTAD uses a context-aware representation to produce left- and right-boundary tracks that are explicitly optimized for downstream TAD assembly.
Our github repo: https://github.com/ai4nucleome/ContextTAD
## Environment setup
Create a conda environment named `contexttad`.
```bash
conda create -n contexttad python=3.12 -y
conda activate contexttad
pip install -r requirements.txt
```
`requirements.txt` is exported from the working training environment (`3dgenome`), and is provided at:
- `requirements.txt`
Additional external tools required by some evaluation/plotting scripts:
- `Rscript` (for structural protein enrichment, `exp2_struct_protein`)
- `coolpup.py` (for coolpup pileups, `exp5_coolpup`)
- `pyGenomeTracks` (for genome track visualizations)
### Download SAM3 configuration and weights
Download SAM3 model files from:
- https://huggingface.co/facebook/sam3/tree/main
## Data preparation
**Note: Most of our data have be uploaded in Zenodo: https://doi.org/10.5281/zenodo.19062598, you only need download `.mcool` data from 4DN.**
Detailed data layout is documented in:
- `0-data/README.md`
The pipeline expects:
- `0-data/1_dp_train_infer_data` (training/inference arrays and labels)
- `0-data/2_eval_tads_data` (evaluation assets)
### Step 1: build GM12878 training/inference arrays
```bash
export TAD_DATA_DIR=/path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data
export MCOOL_TEMPLATE="/path/to/mcool/4DNFIXP4QG5B_Rao2014_GM12878_frac{frac}.mcool"
python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py
```
Optional modes:
```bash
python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py --only-4000M
python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py --skip-4000M
```
### Step 2: build other-celltype inference windows (optional, for cross-cell evaluation)
```bash
python 1-prepare_data/step1_process_data/scripts/prepare_othercell_inference_data.py \
--mcool /path/to/K562_or_IMR90.mcool::/resolutions/5000 \
--out_data_dir /path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data/other_celltypes/K562 \
--coverage_tag K562
```
Repeat for `IMR90` with `--coverage_tag IMR90`.
### Step 3: build merged GT BED from labels
```bash
export TAD_DATA_DIR=/path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data
python 1-prepare_data/step3_build_gt/scripts/build_ground_truth.py
```
## Data sources and accessions
Reference paper used to align data sourcing style:
- RefHiC: https://www.nature.com/articles/s41467-022-35231-3
The following identifiers/files are used in this project data tree.
| Category | Dataset / Cell line | Identifier or file used | Source |
|---|---|---|---|
| Hi-C mcool | GM12878 (Rao2014) | `4DNFIXP4QG5B_Rao2014_GM12878_frac1.mcool` (+ downsampled fractions) | 4DN Data Portal |
| Hi-C mcool | K562 (Rao2014) | `4DNFI4DGNY7J_Rao2014_K562_300M.mcool` | 4DN Data Portal |
| Hi-C mcool | IMR90 (Rao2014) | `4DNFIJTOIGOI_Rao2014_IMR90_1000M.mcool` | 4DN Data Portal |
| CTCF ChIP-seq | GM12878 | `ENCFF796WRU_GM12878.bed_CTCF_5kb+.bed`, `ENCFF796WRU_GM12878.bed_CTCF_5kb-.bed` | ENCODE |
| CTCF ChIP-seq | K562 | `ENCFF901CBP_K562.bed_CTCF_5kb+.bed`, `ENCFF901CBP_K562.bed_CTCF_5kb-.bed` | ENCODE |
| CTCF ChIP-seq | IMR90 | `ENCFF203SRF_IMR90.bed_CTCF_5kb+.bed`, `ENCFF203SRF_IMR90.bed_CTCF_5kb-.bed` | ENCODE |
| CTCF ChIA-PET | GM12878 | `gm12878.tang.ctcf-chiapet.hg38.bedpe` | Processed benchmark resource / ENCODE |
| CTCF ChIA-PET | K562 | `k562.encode.ctcf-chiapet.5k.hg38.bedpe` | ENCODE |
| CTCF ChIA-PET | IMR90 | `imr90_ctcf_chiapet_hg38_ENCFF682YFU.bedpe` | ENCODE |
| Structural protein peaks | GM12878 | `CTCF_peaks.bed`, `RAD21_peaks.bed`, `SMC3_peaks.bed` | TAD benchmarking resources |
## How to run (step-by-step)
### 1) Train ContextTAD base model
```bash
bash 2-training/step1_train/scripts/run_train_base.sh \
0 \
train_base_$(date +%Y%m%d_%H%M%S) \
none \
10 \
2
```
Output:
- `2-training/step1_train/outputs/<run_id>/train_outputs/`
### 2) Inference + decode on GM12878
```bash
bash 2-training/step2_infer_decode/scripts/run_infer_decode_gm12878.sh \
/path/to/checkpoint_epoch_005.pt \
0 \
infer_gm12878_$(date +%Y%m%d_%H%M%S) \
auto \
default
```
Output:
- `2-training/step2_infer_decode/outputs/<run_id>/beds/`
### 3) Inference + decode on K562/IMR90 (optional)
```bash
bash 2-training/step2_infer_decode/scripts/run_infer_decode_othercell.sh \
/path/to/checkpoint_epoch_005.pt \
0 \
infer_othercell_$(date +%Y%m%d_%H%M%S) \
auto \
default
```
### 4) Evaluation
Main results:
```bash
bash 3-evaluation/step1_main_results_vs_tools/scripts/run_main_results.sh \
/path/to/gm12878_beds_dir \
/path/to/othercell_beds_dir \
main_results_$(date +%Y%m%d_%H%M%S)
```
Model-ablation-style evaluation (ours-focused):
```bash
bash 3-evaluation/step2_model_ablation_ours_only/scripts/run_model_ablation_eval.sh \
/path/to/gm12878_beds_dir \
ablation_eval_$(date +%Y%m%d_%H%M%S)
```
## 4-pipeline one-command run
In this snapshot, the directory is currently named `5-fullpipeline` and will be renamed to `4-pipeline`.
Default run (`exp1/exp3/exp4/exp6`):
```bash
bash 5-fullpipeline/run_full_pipeline.sh 0 0
```
Full run (all experiments):
```bash
bash 5-fullpipeline/run_full_pipeline.sh 0 0 full_$(date +%Y%m%d_%H%M%S) 2 29600 --all-exps
```
## Ablation usage (module and loss only)
Module ablations (examples):
```bash
bash 2-training/step1_train/scripts/run_train_no_tofe.sh 0
bash 2-training/step1_train/scripts/run_train_no_text.sh 0
bash 2-training/step1_train/scripts/run_train_no_pairloss.sh 0
bash 2-training/step1_train/scripts/run_train_obs_input.sh 0
```
Loss ablation (count loss off):
```bash
bash 2-training/step1_train/scripts/run_train_experiment.sh \
no_count \
0 \
train_no_count_$(date +%Y%m%d_%H%M%S) \
none \
10 \
2
``` |