Add Colab badge + demo link

b1a1b0a verified 15 days ago

5.8 kB

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb)

	> Or run from Hugging Face: open <https://colab.research.google.com/> → File → Open notebook → URL tab → paste
	> `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb`

	## Demo

	A self-contained, CPU-only Colab notebook is provided at
	[`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the
	package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix,
	~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score
	distribution and per-subset means, and writes A3M files ready for AF2. Expected
	end-to-end runtime on a free Colab CPU instance: ~2 minutes.

	# SF-Cluster (workshop OSS release)

	Frustration-guided MSA subset builders for AlphaFold2 multi-conformer
	prediction. This is the open-source workshop distribution of two subset
	methods from the SF-Cluster benchmark:

	- mosaic — each subset mixes high / mid / low contrast-FI sequences.
	- gradient — each subset is homogeneous within a contrast-FI quartile.

	The contrast score is computed from a per-residue Frustration Index (FI)
	matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq)
	(HF model: `leuschj/FrustrAI-Seq`).

	This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is
	designed to be a drop-in replacement for random / uniform MSA subsampling in
	[AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines.

	## Algorithm

	Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a
	per-residue FI matrix `F ∈ ℝ^{N×L}`:

	1. Column variance: `v_l = Var_i(F_{i,l})` over sequences.
	2. High-variance mask: `HV = {l : v_l ≥ percentile(v, 80)}`,
	`LV = ¬HV`.
	3. Contrast score per sequence:
	```
	contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} − mean_{l ∈ LV} F_{i,l}
	```
	4. Mosaic (N_SUBSETS = 12, TARGET_SIZE = 32):
	sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles;
	for each subset `s ∈ {0..11}`, draw `11 high + 11 low + 10 mid` with
	`np.random.default_rng(seed=s)`.
	5. Gradient (N_SUBSETS = 12, TARGET_SIZE = 32):
	split sorted pool into 4 quartiles; for each bin `b ∈ {0..3}` and
	`s ∈ {0..2}` draw 32 sequences from that bin only with
	`np.random.default_rng(seed=10*b + s)`.

	## Install

	```bash
	pip install -e .
	```

	Python ≥ 3.10. Dependencies: `numpy`, `scipy`.

	## Inputs

	You need two files per case:

	1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters
	are preserved verbatim in output subsets; only match-state (uppercase)
	columns are scored.
	2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is
	the number of sequences in the A3M and `L` is the number of match-state
	columns.

	The FI matrix is produced by FrustrAI-Seq. We do not bundle weights — see
	`https://github.com/leuschj/FrustrAI-Seq` (model card:
	`https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions.
	A reference usage pattern is documented in `examples/run_demo.sh`.

	## CLI

	```bash
	sf-cluster build \
	--a3m path/to/filtered.a3m \
	--fi path/to/fi_matrix.npy \
	--method mosaic \
	--n-subsets 12 \
	--subset-size 32 \
	--seed 20260422 \
	--out subsets/kaib_mosaic/
	```

	Outputs:
	```
	subsets/kaib_mosaic/
	├── mosaic_subset_000.a3m
	├── mosaic_subset_001.a3m
	├── ...
	├── mosaic_subset_011.a3m
	├── mosaic_subset_index.tsv # subset_id, pool_index, header, score
	└── mosaic_meta.json # provenance + score stats
	```

	## Library

	```python
	from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

	pool = pool_msa("filtered.a3m", "fi_matrix.npy")
	score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence
	subsets = method_mosaic(score) # list[list[int]] of 12 × 32
	# or
	subsets = method_gradient(score)
	```

	Each subset is a list of indices into `pool.headers` / `pool.sequences`.

	## Reproducibility

	All RNG draws use `np.random.default_rng(seed=...)` with method-specific
	deterministic seeds (see Algorithm §4–§5). Re-running the same A3M + FI
	matrix yields byte-identical subset assignments. The CLI also records a
	provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the
	package version.

	## LIMITATIONS

	- No frustration model included. You must run FrustrAI-Seq separately to
	obtain the `(N_seq, L)` FI matrix. This package only handles the
	scoring + subset-construction stage.
	- No AF2 runner included. The package emits A3M files; downstream
	inference (AF2 / ColabFold) is the user's responsibility.
	- Only `mosaic` and `gradient` arms are open-sourced here. The other
	SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional
	feature pipelines and are intentionally excluded from this workshop
	release.
	- No re-sampling guarantee across subsets. A sequence can appear in
	multiple subsets (gradient draws from a single quartile with replacement
	if the quartile is smaller than `subset_size`).
	- Empirical caveat (read this). Controlled comparison shows uniform
	subsampling performs equivalently on most Main-21 cases — see paper for
	boundary conditions under which contrast-FI stratification yields a
	measurable lift over random subsampling. Treat this package as a research
	baseline, not a turnkey accuracy improvement.

	## Citation

	If you use this code, please cite the SF-Cluster paper (forthcoming) and
	[FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq).

	## License

	MIT. See `LICENSE`.