SF-Cluster / README.md
chq1155's picture
Add Colab badge + demo link
b1a1b0a verified
|
Raw
History Blame Contribute Delete
5.8 kB
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb)
> **Or run from Hugging Face:** open <https://colab.research.google.com/> β†’ *File* β†’ *Open notebook* β†’ *URL* tab β†’ paste
> `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb`
## Demo
A self-contained, CPU-only Colab notebook is provided at
[`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the
package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix,
~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score
distribution and per-subset means, and writes A3M files ready for AF2. Expected
end-to-end runtime on a free Colab CPU instance: **~2 minutes**.
# SF-Cluster (workshop OSS release)
Frustration-guided MSA subset builders for AlphaFold2 multi-conformer
prediction. This is the open-source workshop distribution of two subset
methods from the SF-Cluster benchmark:
- **mosaic** β€” each subset mixes high / mid / low contrast-FI sequences.
- **gradient** β€” each subset is homogeneous within a contrast-FI quartile.
The contrast score is computed from a per-residue Frustration Index (FI)
matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq)
(HF model: `leuschj/FrustrAI-Seq`).
This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is
designed to be a drop-in replacement for random / uniform MSA subsampling in
[AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines.
## Algorithm
Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a
per-residue FI matrix `F ∈ ℝ^{NΓ—L}`:
1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences.
2. **High-variance mask**: `HV = {l : v_l β‰₯ percentile(v, 80)}`,
`LV = Β¬HV`.
3. **Contrast score** per sequence:
```
contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} βˆ’ mean_{l ∈ LV} F_{i,l}
```
4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32):
sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles;
for each subset `s ∈ {0..11}`, draw `11 high + 11 low + 10 mid` with
`np.random.default_rng(seed=s)`.
5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32):
split sorted pool into 4 quartiles; for each bin `b ∈ {0..3}` and
`s ∈ {0..2}` draw 32 sequences from that bin only with
`np.random.default_rng(seed=10*b + s)`.
## Install
```bash
pip install -e .
```
Python β‰₯ 3.10. Dependencies: `numpy`, `scipy`.
## Inputs
You need two files per case:
1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters
are preserved verbatim in output subsets; only match-state (uppercase)
columns are scored.
2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is
the number of sequences in the A3M and `L` is the number of match-state
columns.
The FI matrix is produced by FrustrAI-Seq. We do not bundle weights β€” see
`https://github.com/leuschj/FrustrAI-Seq` (model card:
`https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions.
A reference usage pattern is documented in `examples/run_demo.sh`.
## CLI
```bash
sf-cluster build \
--a3m path/to/filtered.a3m \
--fi path/to/fi_matrix.npy \
--method mosaic \
--n-subsets 12 \
--subset-size 32 \
--seed 20260422 \
--out subsets/kaib_mosaic/
```
Outputs:
```
subsets/kaib_mosaic/
β”œβ”€β”€ mosaic_subset_000.a3m
β”œβ”€β”€ mosaic_subset_001.a3m
β”œβ”€β”€ ...
β”œβ”€β”€ mosaic_subset_011.a3m
β”œβ”€β”€ mosaic_subset_index.tsv # subset_id, pool_index, header, score
└── mosaic_meta.json # provenance + score stats
```
## Library
```python
from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient
pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence
subsets = method_mosaic(score) # list[list[int]] of 12 Γ— 32
# or
subsets = method_gradient(score)
```
Each subset is a list of indices into `pool.headers` / `pool.sequences`.
## Reproducibility
All RNG draws use `np.random.default_rng(seed=...)` with method-specific
deterministic seeds (see Algorithm Β§4–§5). Re-running the same A3M + FI
matrix yields byte-identical subset assignments. The CLI also records a
provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the
package version.
## LIMITATIONS
- **No frustration model included.** You must run FrustrAI-Seq separately to
obtain the `(N_seq, L)` FI matrix. This package only handles the
scoring + subset-construction stage.
- **No AF2 runner included.** The package emits A3M files; downstream
inference (AF2 / ColabFold) is the user's responsibility.
- **Only `mosaic` and `gradient` arms are open-sourced here.** The other
SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional
feature pipelines and are intentionally excluded from this workshop
release.
- **No re-sampling guarantee across subsets.** A sequence can appear in
multiple subsets (gradient draws from a single quartile with replacement
if the quartile is smaller than `subset_size`).
- **Empirical caveat (read this).** Controlled comparison shows uniform
subsampling performs equivalently on most Main-21 cases β€” see paper for
boundary conditions under which contrast-FI stratification yields a
measurable lift over random subsampling. Treat this package as a research
baseline, not a turnkey accuracy improvement.
## Citation
If you use this code, please cite the SF-Cluster paper (forthcoming) and
[FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq).
## License
MIT. See `LICENSE`.