[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb) > **Or run from Hugging Face:** open → *File* → *Open notebook* → *URL* tab → paste > `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb` ## Demo A self-contained, CPU-only Colab notebook is provided at [`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix, ~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score distribution and per-subset means, and writes A3M files ready for AF2. Expected end-to-end runtime on a free Colab CPU instance: **~2 minutes**. # SF-Cluster (workshop OSS release) Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction. This is the open-source workshop distribution of two subset methods from the SF-Cluster benchmark: - **mosaic** — each subset mixes high / mid / low contrast-FI sequences. - **gradient** — each subset is homogeneous within a contrast-FI quartile. The contrast score is computed from a per-residue Frustration Index (FI) matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq) (HF model: `leuschj/FrustrAI-Seq`). This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is designed to be a drop-in replacement for random / uniform MSA subsampling in [AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines. ## Algorithm Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a per-residue FI matrix `F ∈ ℝ^{N×L}`: 1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences. 2. **High-variance mask**: `HV = {l : v_l ≥ percentile(v, 80)}`, `LV = ¬HV`. 3. **Contrast score** per sequence: ``` contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} − mean_{l ∈ LV} F_{i,l} ``` 4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32): sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles; for each subset `s ∈ {0..11}`, draw `11 high + 11 low + 10 mid` with `np.random.default_rng(seed=s)`. 5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32): split sorted pool into 4 quartiles; for each bin `b ∈ {0..3}` and `s ∈ {0..2}` draw 32 sequences from that bin only with `np.random.default_rng(seed=10*b + s)`. ## Install ```bash pip install -e . ``` Python ≥ 3.10. Dependencies: `numpy`, `scipy`. ## Inputs You need two files per case: 1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters are preserved verbatim in output subsets; only match-state (uppercase) columns are scored. 2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is the number of sequences in the A3M and `L` is the number of match-state columns. The FI matrix is produced by FrustrAI-Seq. We do not bundle weights — see `https://github.com/leuschj/FrustrAI-Seq` (model card: `https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions. A reference usage pattern is documented in `examples/run_demo.sh`. ## CLI ```bash sf-cluster build \ --a3m path/to/filtered.a3m \ --fi path/to/fi_matrix.npy \ --method mosaic \ --n-subsets 12 \ --subset-size 32 \ --seed 20260422 \ --out subsets/kaib_mosaic/ ``` Outputs: ``` subsets/kaib_mosaic/ ├── mosaic_subset_000.a3m ├── mosaic_subset_001.a3m ├── ... ├── mosaic_subset_011.a3m ├── mosaic_subset_index.tsv # subset_id, pool_index, header, score └── mosaic_meta.json # provenance + score stats ``` ## Library ```python from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient pool = pool_msa("filtered.a3m", "fi_matrix.npy") score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence subsets = method_mosaic(score) # list[list[int]] of 12 × 32 # or subsets = method_gradient(score) ``` Each subset is a list of indices into `pool.headers` / `pool.sequences`. ## Reproducibility All RNG draws use `np.random.default_rng(seed=...)` with method-specific deterministic seeds (see Algorithm §4–§5). Re-running the same A3M + FI matrix yields byte-identical subset assignments. The CLI also records a provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the package version. ## LIMITATIONS - **No frustration model included.** You must run FrustrAI-Seq separately to obtain the `(N_seq, L)` FI matrix. This package only handles the scoring + subset-construction stage. - **No AF2 runner included.** The package emits A3M files; downstream inference (AF2 / ColabFold) is the user's responsibility. - **Only `mosaic` and `gradient` arms are open-sourced here.** The other SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional feature pipelines and are intentionally excluded from this workshop release. - **No re-sampling guarantee across subsets.** A sequence can appear in multiple subsets (gradient draws from a single quartile with replacement if the quartile is smaller than `subset_size`). - **Empirical caveat (read this).** Controlled comparison shows uniform subsampling performs equivalently on most Main-21 cases — see paper for boundary conditions under which contrast-FI stratification yields a measurable lift over random subsampling. Treat this package as a research baseline, not a turnkey accuracy improvement. ## Citation If you use this code, please cite the SF-Cluster paper (forthcoming) and [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq). ## License MIT. See `LICENSE`.