| [](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb) |
|
|
| > **Or run from Hugging Face:** open <https://colab.research.google.com/> β *File* β *Open notebook* β *URL* tab β paste |
| > `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb` |
|
|
| ## Demo |
|
|
| A self-contained, CPU-only Colab notebook is provided at |
| [`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the |
| package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix, |
| ~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score |
| distribution and per-subset means, and writes A3M files ready for AF2. Expected |
| end-to-end runtime on a free Colab CPU instance: **~2 minutes**. |
|
|
| # SF-Cluster (workshop OSS release) |
|
|
| Frustration-guided MSA subset builders for AlphaFold2 multi-conformer |
| prediction. This is the open-source workshop distribution of two subset |
| methods from the SF-Cluster benchmark: |
|
|
| - **mosaic** β each subset mixes high / mid / low contrast-FI sequences. |
| - **gradient** β each subset is homogeneous within a contrast-FI quartile. |
|
|
| The contrast score is computed from a per-residue Frustration Index (FI) |
| matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq) |
| (HF model: `leuschj/FrustrAI-Seq`). |
|
|
| This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is |
| designed to be a drop-in replacement for random / uniform MSA subsampling in |
| [AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines. |
|
|
| ## Algorithm |
|
|
| Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a |
| per-residue FI matrix `F β β^{NΓL}`: |
|
|
| 1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences. |
| 2. **High-variance mask**: `HV = {l : v_l β₯ percentile(v, 80)}`, |
| `LV = Β¬HV`. |
| 3. **Contrast score** per sequence: |
| ``` |
| contrast_hvlv(i) = mean_{l β HV} F_{i,l} β mean_{l β LV} F_{i,l} |
| ``` |
| 4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32): |
| sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles; |
| for each subset `s β {0..11}`, draw `11 high + 11 low + 10 mid` with |
| `np.random.default_rng(seed=s)`. |
| 5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32): |
| split sorted pool into 4 quartiles; for each bin `b β {0..3}` and |
| `s β {0..2}` draw 32 sequences from that bin only with |
| `np.random.default_rng(seed=10*b + s)`. |
|
|
| ## Install |
|
|
| ```bash |
| pip install -e . |
| ``` |
|
|
| Python β₯ 3.10. Dependencies: `numpy`, `scipy`. |
|
|
| ## Inputs |
|
|
| You need two files per case: |
|
|
| 1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters |
| are preserved verbatim in output subsets; only match-state (uppercase) |
| columns are scored. |
| 2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is |
| the number of sequences in the A3M and `L` is the number of match-state |
| columns. |
|
|
| The FI matrix is produced by FrustrAI-Seq. We do not bundle weights β see |
| `https://github.com/leuschj/FrustrAI-Seq` (model card: |
| `https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions. |
| A reference usage pattern is documented in `examples/run_demo.sh`. |
|
|
| ## CLI |
|
|
| ```bash |
| sf-cluster build \ |
| --a3m path/to/filtered.a3m \ |
| --fi path/to/fi_matrix.npy \ |
| --method mosaic \ |
| --n-subsets 12 \ |
| --subset-size 32 \ |
| --seed 20260422 \ |
| --out subsets/kaib_mosaic/ |
| ``` |
|
|
| Outputs: |
| ``` |
| subsets/kaib_mosaic/ |
| βββ mosaic_subset_000.a3m |
| βββ mosaic_subset_001.a3m |
| βββ ... |
| βββ mosaic_subset_011.a3m |
| βββ mosaic_subset_index.tsv # subset_id, pool_index, header, score |
| βββ mosaic_meta.json # provenance + score stats |
| ``` |
|
|
| ## Library |
|
|
| ```python |
| from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient |
| |
| pool = pool_msa("filtered.a3m", "fi_matrix.npy") |
| score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence |
| subsets = method_mosaic(score) # list[list[int]] of 12 Γ 32 |
| # or |
| subsets = method_gradient(score) |
| ``` |
|
|
| Each subset is a list of indices into `pool.headers` / `pool.sequences`. |
|
|
| ## Reproducibility |
|
|
| All RNG draws use `np.random.default_rng(seed=...)` with method-specific |
| deterministic seeds (see Algorithm Β§4βΒ§5). Re-running the same A3M + FI |
| matrix yields byte-identical subset assignments. The CLI also records a |
| provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the |
| package version. |
|
|
| ## LIMITATIONS |
|
|
| - **No frustration model included.** You must run FrustrAI-Seq separately to |
| obtain the `(N_seq, L)` FI matrix. This package only handles the |
| scoring + subset-construction stage. |
| - **No AF2 runner included.** The package emits A3M files; downstream |
| inference (AF2 / ColabFold) is the user's responsibility. |
| - **Only `mosaic` and `gradient` arms are open-sourced here.** The other |
| SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional |
| feature pipelines and are intentionally excluded from this workshop |
| release. |
| - **No re-sampling guarantee across subsets.** A sequence can appear in |
| multiple subsets (gradient draws from a single quartile with replacement |
| if the quartile is smaller than `subset_size`). |
| - **Empirical caveat (read this).** Controlled comparison shows uniform |
| subsampling performs equivalently on most Main-21 cases β see paper for |
| boundary conditions under which contrast-FI stratification yields a |
| measurable lift over random subsampling. Treat this package as a research |
| baseline, not a turnkey accuracy improvement. |
|
|
| ## Citation |
|
|
| If you use this code, please cite the SF-Cluster paper (forthcoming) and |
| [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq). |
|
|
| ## License |
|
|
| MIT. See `LICENSE`. |
|
|