ContextTAD

ContextTAD is a deep-learning TAD caller that learns boundary evidence from broader local Hi-C windows that capture TAD-scale structural context. Instead of treating boundary prediction as an isolated per-bin classification problem, ContextTAD uses a context-aware representation to produce left- and right-boundary tracks that are explicitly optimized for downstream TAD assembly.

Our github repo: https://github.com/ai4nucleome/ContextTAD

Environment setup

Create a conda environment named contexttad.

conda create -n contexttad python=3.12 -y
conda activate contexttad
pip install -r requirements.txt

requirements.txt is exported from the working training environment (3dgenome), and is provided at:

requirements.txt

Additional external tools required by some evaluation/plotting scripts:

Rscript (for structural protein enrichment, exp2_struct_protein)
coolpup.py (for coolpup pileups, exp5_coolpup)
pyGenomeTracks (for genome track visualizations)

Download SAM3 configuration and weights

Download SAM3 model files from:

https://huggingface.co/facebook/sam3/tree/main

Data preparation

Note: Most of our data have be uploaded in Zenodo: https://doi.org/10.5281/zenodo.19062598, you only need download .mcool data from 4DN.

Detailed data layout is documented in:

0-data/README.md

The pipeline expects:

0-data/1_dp_train_infer_data (training/inference arrays and labels)
0-data/2_eval_tads_data (evaluation assets)

Step 1: build GM12878 training/inference arrays

export TAD_DATA_DIR=/path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data
export MCOOL_TEMPLATE="/path/to/mcool/4DNFIXP4QG5B_Rao2014_GM12878_frac{frac}.mcool"

python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py

Optional modes:

python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py --only-4000M
python 1-prepare_data/step2_prepare_labels/scripts/prepare_data.py --skip-4000M

Step 2: build other-celltype inference windows (optional, for cross-cell evaluation)

python 1-prepare_data/step1_process_data/scripts/prepare_othercell_inference_data.py \
  --mcool /path/to/K562_or_IMR90.mcool::/resolutions/5000 \
  --out_data_dir /path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data/other_celltypes/K562 \
  --coverage_tag K562

Repeat for IMR90 with --coverage_tag IMR90.

Step 3: build merged GT BED from labels

export TAD_DATA_DIR=/path/to/TADAnno_for_publish/0-data/1_dp_train_infer_data
python 1-prepare_data/step3_build_gt/scripts/build_ground_truth.py

Data sources and accessions

Reference paper used to align data sourcing style:

RefHiC: https://www.nature.com/articles/s41467-022-35231-3

The following identifiers/files are used in this project data tree.

Category	Dataset / Cell line	Identifier or file used	Source
Hi-C mcool	GM12878 (Rao2014)	`4DNFIXP4QG5B_Rao2014_GM12878_frac1.mcool` (+ downsampled fractions)	4DN Data Portal
Hi-C mcool	K562 (Rao2014)	`4DNFI4DGNY7J_Rao2014_K562_300M.mcool`	4DN Data Portal
Hi-C mcool	IMR90 (Rao2014)	`4DNFIJTOIGOI_Rao2014_IMR90_1000M.mcool`	4DN Data Portal
CTCF ChIP-seq	GM12878	`ENCFF796WRU_GM12878.bed_CTCF_5kb+.bed`, `ENCFF796WRU_GM12878.bed_CTCF_5kb-.bed`	ENCODE
CTCF ChIP-seq	K562	`ENCFF901CBP_K562.bed_CTCF_5kb+.bed`, `ENCFF901CBP_K562.bed_CTCF_5kb-.bed`	ENCODE
CTCF ChIP-seq	IMR90	`ENCFF203SRF_IMR90.bed_CTCF_5kb+.bed`, `ENCFF203SRF_IMR90.bed_CTCF_5kb-.bed`	ENCODE
CTCF ChIA-PET	GM12878	`gm12878.tang.ctcf-chiapet.hg38.bedpe`	Processed benchmark resource / ENCODE
CTCF ChIA-PET	K562	`k562.encode.ctcf-chiapet.5k.hg38.bedpe`	ENCODE
CTCF ChIA-PET	IMR90	`imr90_ctcf_chiapet_hg38_ENCFF682YFU.bedpe`	ENCODE
Structural protein peaks	GM12878	`CTCF_peaks.bed`, `RAD21_peaks.bed`, `SMC3_peaks.bed`	TAD benchmarking resources

How to run (step-by-step)

1) Train ContextTAD base model

bash 2-training/step1_train/scripts/run_train_base.sh \
  0 \
  train_base_$(date +%Y%m%d_%H%M%S) \
  none \
  10 \
  2

Output:

2-training/step1_train/outputs/<run_id>/train_outputs/

2) Inference + decode on GM12878

bash 2-training/step2_infer_decode/scripts/run_infer_decode_gm12878.sh \
  /path/to/checkpoint_epoch_005.pt \
  0 \
  infer_gm12878_$(date +%Y%m%d_%H%M%S) \
  auto \
  default

Output:

2-training/step2_infer_decode/outputs/<run_id>/beds/

3) Inference + decode on K562/IMR90 (optional)

bash 2-training/step2_infer_decode/scripts/run_infer_decode_othercell.sh \
  /path/to/checkpoint_epoch_005.pt \
  0 \
  infer_othercell_$(date +%Y%m%d_%H%M%S) \
  auto \
  default

4) Evaluation

Main results:

bash 3-evaluation/step1_main_results_vs_tools/scripts/run_main_results.sh \
  /path/to/gm12878_beds_dir \
  /path/to/othercell_beds_dir \
  main_results_$(date +%Y%m%d_%H%M%S)

Model-ablation-style evaluation (ours-focused):

bash 3-evaluation/step2_model_ablation_ours_only/scripts/run_model_ablation_eval.sh \
  /path/to/gm12878_beds_dir \
  ablation_eval_$(date +%Y%m%d_%H%M%S)

4-pipeline one-command run

In this snapshot, the directory is currently named 5-fullpipeline and will be renamed to 4-pipeline.

Default run (exp1/exp3/exp4/exp6):

bash 5-fullpipeline/run_full_pipeline.sh 0 0

Full run (all experiments):

bash 5-fullpipeline/run_full_pipeline.sh 0 0 full_$(date +%Y%m%d_%H%M%S) 2 29600 --all-exps

Ablation usage (module and loss only)

Module ablations (examples):

bash 2-training/step1_train/scripts/run_train_no_tofe.sh 0
bash 2-training/step1_train/scripts/run_train_no_text.sh 0
bash 2-training/step1_train/scripts/run_train_no_pairloss.sh 0
bash 2-training/step1_train/scripts/run_train_obs_input.sh 0

Loss ablation (count loss off):

bash 2-training/step1_train/scripts/run_train_experiment.sh \
  no_count \
  0 \
  train_no_count_$(date +%Y%m%d_%H%M%S) \
  none \
  10 \
  2

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support