SF-Cluster / examples /data /provenance.md
chq1155's picture
Add Colab demo notebook + KaiB demo data
021edb3 verified
|
Raw
History Blame Contribute Delete
2.88 kB
# KaiB demo data — provenance
This directory contains a concatenated demo asset for the SF-Cluster Colab
notebook. It is derived from the SF-Cluster Phase II benchmark's KaiB
`diverse_sf` arm and the FrustrAI-Seq per-residue Frustration Index (FI)
outputs.
## Files
| File | Shape / size | Description |
|-----------------------|------------------------|-------------|
| `KaiB_filtered.a3m` | 364 records, L=91 | Subset of the KaiB filtered MSA. Query (`>101`, UniProt Q79V61 residues 5–95) is row 0. Lowercase insertion-state letters preserved. |
| `KaiB_fi_matrix.npy` | (364, 91) float32 | Per-residue FI matrix. Row `i` corresponds to record `i` in the A3M. |
| `KaiB_seq_ids.txt` | 364 lines | One short sequence ID per line, in the same order as the A3M / FI matrix. |
## Source paths (private dev repo, read-only)
- Filtered MSA:
`/data1/hanqun/SF-Design/SF-Cluster/data/processed/msa/KaiB/KaiB/KaiB_KaiBTE_91aa_UniProt_Q79V61_5to95_2QKE_chainB.filtered.a3m`
(depth 6821, L=91)
- FI artifacts (per-subset):
`/data1/hanqun/SF-Design/SF-Cluster/results/frustai_artifacts/KaiB/diverse_sf/KaiB/{000..011}/`
with files `fi_matrix.npy` ((32, 91) float32), `metadata.json`,
`fi_residual_matrix.npy`, `entropy_matrix.npy`.
- Source subset A3Ms (used to map FI rows → sequence IDs):
`/data1/hanqun/SF-Design/SF-Cluster/results/baseline_p8/diverse_sf/KaiB/KaiB/screen/diversesf_KaiB_KaiB_seed{000..011}.a3m`.
## Construction recipe
1. For each of the 12 `diverse_sf` subsets, load `fi_matrix.npy` ((32, 91)
float32) and the corresponding `diversesf_KaiB_KaiB_seed{NNN}.a3m`.
2. Concatenate rows in subset-index order; track the parallel sequence-ID
list from the A3M records.
3. **Deduplication policy**: first occurrence wins. A sequence ID seen in an
earlier subset is skipped (both in the FI matrix and the ID list). This
reduces 12 × 32 = 384 raw rows to 364 unique rows.
4. Extract the corresponding sequences (with their full headers and
lowercase insertion states) from the filtered MSA, preserving the order
established in step 3. The query (`>101`) is always row 0.
All 364 unique IDs were found in the filtered MSA (0 missing).
## Models
- **FrustrAI-Seq weights**: HF repo `leuschj/FrustrAI-Seq`, commit
`ee5a01a29fde00630f4a1157f0e6cb8343ac434b`. Inference in fp16 with LoRA
adapters merged.
## License
This demo asset is released under MIT alongside the SF-Cluster OSS package.
The KaiB sequence (UniProt Q79V61, *Thermosynechococcus elongatus*) and its
MSA neighbors are public-domain sequence records via UniRef100 / Mgnify;
no proprietary structures are included. FrustrAI-Seq outputs are derived
features (floating-point FI values) and are released by the FrustrAI-Seq
authors under their own license — see
https://huggingface.co/leuschj/FrustrAI-Seq.