COSTE + DST-GNN Manuscript Code
The repository path hutaobo/ccst-spatial-clustering is retained for continuity from an earlier private working name. The slug is historical; the public contents in this release are a cleaned DST-GNN implementation for the lung fibrosis manuscript workflow and are not a CCST release.
Overview
This repository contains a manuscript-aligned public implementation of the diffusion-based spatio-temporal graph neural network (DST-GNN) analysis described in:
- Cophenetic Spatial Topology Embedding reveals multiscale tissue architecture in spatial omics
The implementation was reconstructed from the original analysis notebooks and cross-checked against the manuscript text. In particular, the public code follows the manuscript-level assumptions that:
- the cohort contains 45 tissue samples grouped into three stages:
T1 = HD,T2 = LA,T3 = MA - graphs are defined over 47 predefined cell types
- edge values are COSTE/SSS-style spatial separation scores in
[0, 1], where smaller values indicate closer proximity - missing cell-type pairs are treated as absent and filled with
SSS = 1.0 - node features are one-hot identity vectors
- optimization uses Adam with an MSE objective
- explainability is performed with PyTorch Geometric's
GNNExplainer
What Is In This Repository
data/inputs/: bundled direct DST-GNN input CSV for the lung fibrosis cohortdata/outputs/formal_release/: bundled formal release outputs generated by the cleaned public pipelinedata/DATA_PROVENANCE.md: provenance notes for the bundled input and output datasrc/dst_gnn/: cleaned Python package for data loading, temporal graph construction, model definition, training, and explainabilityscripts/run_dst_gnn.py: end-to-end CLI that builds stage graphs, trains DST-GNN, ranks dynamic nodes and edges, and optionally runsGNNExplainerscripts/verify_repro.py: output comparison helper for checking a rerun against the bundled formal releasereferences/RECOVERY_NOTES.md: provenance notes describing the recovered notebooks and the manuscript-alignment corrections applied hereupload_to_hf.py: helper for synchronizing this folder to the Hugging Face Hub
Bundled Data
The repository now includes the direct DST-GNN input table and a formal output release so users can rerun and verify the pipeline without searching for additional intermediate files.
- Input:
data/inputs/cophenetic_distances_searcher_D_score_in_all_samples.csv
- Formal release outputs:
data/outputs/formal_release/
The input CSV is the recovered cohort-level COSTE/SSS table with 45 samples, 47 cell types, and stage labels mapped as HD -> T1, LA -> T2, MA -> T3.
See data/DATA_PROVENANCE.md for the input lineage and the exact formal release settings.
Public Implementation Choices
The recovered notebook material contained exploratory and partially duplicated cells. This public release keeps the original analytical intent but makes several choices explicit:
- Sample-level COSTE/SSS matrices are reconstructed into full
47 x 47directed graphs per sample. - Missing pairs are filled with
1.0before aggregation, matching the manuscript description of absent spatial associations. - Stage graphs are formed by averaging per-sample matrices within
HD,LA, andMA. - Message passing uses an affinity transform
1 - SSS, so stronger spatial association yields stronger graph connectivity. - Temporal structure is modeled explicitly as the ordered sequence
T1 -> T2 -> T3, with a GCN-based encoder, GRU-style temporal state update, and pairwise decoder trained to predict later-stage spatial relationships.
Expected Input
scripts/run_dst_gnn.py expects a CSV with these columns:
rowcolumnvaluesamplegroup
This matches the recovered manuscript input table cophenetic_distances_searcher_D_score_in_all_samples.csv.
Usage
Install PyTorch and PyTorch Geometric with versions appropriate for your CPU/CUDA environment, then install the remaining dependencies:
pip install -r requirements.txt
Run the end-to-end analysis:
python scripts/run_dst_gnn.py \
--csv /path/to/cophenetic_distances_searcher_D_score_in_all_samples.csv \
--output-dir outputs/lung_fibrosis_dst_gnn \
--epochs 400 \
--run-explainer
The script writes stage-level matrices, training history, predicted next-stage graphs, top changing nodes, top changing edges, and optional explainer outputs.
Reproducibility
The bundled formal release was generated with the cleaned public implementation using:
device=cpuseed=0hidden_channels=32dropout=0.0lr=0.01weight_decay=5e-4epochs=400top_k=20run_explainer=true
Reproduce the public release from the bundled input:
python scripts/run_dst_gnn.py \
--csv data/inputs/cophenetic_distances_searcher_D_score_in_all_samples.csv \
--output-dir outputs/repro_check \
--device cpu \
--seed 0 \
--hidden-channels 32 \
--dropout 0.0 \
--lr 0.01 \
--weight-decay 5e-4 \
--epochs 400 \
--top-k 20 \
--run-explainer
Verify the rerun against the bundled formal release:
python scripts/verify_repro.py \
--expected-dir data/outputs/formal_release \
--actual-dir outputs/repro_check
The repository intentionally bundles the direct DST-GNN input and the formal output release, but it does not bundle the original raw Xenium data. Manuscript conclusions should still be cited to the associated paper.
Recovery Notes
The primary recovered sources were:
Y:/long/publication_datasets/Vannan_2023_Lung_Fibrosis/notebook/GNN modelling.ipynbY:/long/publication_datasets/Vannan_2023_Lung_Fibrosis/notebook/Expression Distance Similarity.ipynb
Mirrored copies were also located on the connected A100 server under /mnt/taobo.hu/long/publication_datasets/Vannan_2023_Lung_Fibrosis/.
See references/RECOVERY_NOTES.md for the full provenance summary.
Citation
Please cite the associated manuscript for biological findings and figure-level conclusions. The repository-level citation metadata is provided in CITATION.cff.
License
The currently published repository contents are distributed under a non-commercial license. See LICENSE.md for details.