ChatterjeeLab
/

SF-Cluster

Model card Files Files and versions

SF-Cluster / examples /data /provenance.md

chq1155's picture

Add Colab demo notebook + KaiB demo data

021edb3 verified 15 days ago

|

History Blame Contribute Delete

2.88 kB

	# KaiB demo data — provenance

	This directory contains a concatenated demo asset for the SF-Cluster Colab
	notebook. It is derived from the SF-Cluster Phase II benchmark's KaiB
	`diverse_sf` arm and the FrustrAI-Seq per-residue Frustration Index (FI)
	outputs.

	## Files

	\| File \| Shape / size \| Description \|
	\|-----------------------\|------------------------\|-------------\|
	\| `KaiB_filtered.a3m` \| 364 records, L=91 \| Subset of the KaiB filtered MSA. Query (`>101`, UniProt Q79V61 residues 5–95) is row 0. Lowercase insertion-state letters preserved. \|
	\| `KaiB_fi_matrix.npy` \| (364, 91) float32 \| Per-residue FI matrix. Row `i` corresponds to record `i` in the A3M. \|
	\| `KaiB_seq_ids.txt` \| 364 lines \| One short sequence ID per line, in the same order as the A3M / FI matrix. \|

	## Source paths (private dev repo, read-only)

	- Filtered MSA:
	`/data1/hanqun/SF-Design/SF-Cluster/data/processed/msa/KaiB/KaiB/KaiB_KaiBTE_91aa_UniProt_Q79V61_5to95_2QKE_chainB.filtered.a3m`
	(depth 6821, L=91)
	- FI artifacts (per-subset):
	`/data1/hanqun/SF-Design/SF-Cluster/results/frustai_artifacts/KaiB/diverse_sf/KaiB/{000..011}/`
	with files `fi_matrix.npy` ((32, 91) float32), `metadata.json`,
	`fi_residual_matrix.npy`, `entropy_matrix.npy`.
	- Source subset A3Ms (used to map FI rows → sequence IDs):
	`/data1/hanqun/SF-Design/SF-Cluster/results/baseline_p8/diverse_sf/KaiB/KaiB/screen/diversesf_KaiB_KaiB_seed{000..011}.a3m`.

	## Construction recipe

	1. For each of the 12 `diverse_sf` subsets, load `fi_matrix.npy` ((32, 91)
	float32) and the corresponding `diversesf_KaiB_KaiB_seed{NNN}.a3m`.
	2. Concatenate rows in subset-index order; track the parallel sequence-ID
	list from the A3M records.
	3. Deduplication policy: first occurrence wins. A sequence ID seen in an
	earlier subset is skipped (both in the FI matrix and the ID list). This
	reduces 12 × 32 = 384 raw rows to 364 unique rows.
	4. Extract the corresponding sequences (with their full headers and
	lowercase insertion states) from the filtered MSA, preserving the order
	established in step 3. The query (`>101`) is always row 0.

	All 364 unique IDs were found in the filtered MSA (0 missing).

	## Models

	- FrustrAI-Seq weights: HF repo `leuschj/FrustrAI-Seq`, commit
	`ee5a01a29fde00630f4a1157f0e6cb8343ac434b`. Inference in fp16 with LoRA
	adapters merged.

	## License

	This demo asset is released under MIT alongside the SF-Cluster OSS package.
	The KaiB sequence (UniProt Q79V61, Thermosynechococcus elongatus) and its
	MSA neighbors are public-domain sequence records via UniRef100 / Mgnify;
	no proprietary structures are included. FrustrAI-Seq outputs are derived
	features (floating-point FI values) and are released by the FrustrAI-Seq
	authors under their own license — see
	https://huggingface.co/leuschj/FrustrAI-Seq.