KaiB demo data — provenance
This directory contains a concatenated demo asset for the SF-Cluster Colab
notebook. It is derived from the SF-Cluster Phase II benchmark's KaiB
diverse_sf arm and the FrustrAI-Seq per-residue Frustration Index (FI)
outputs.
Files
| File | Shape / size | Description |
|---|---|---|
KaiB_filtered.a3m |
364 records, L=91 | Subset of the KaiB filtered MSA. Query (>101, UniProt Q79V61 residues 5–95) is row 0. Lowercase insertion-state letters preserved. |
KaiB_fi_matrix.npy |
(364, 91) float32 | Per-residue FI matrix. Row i corresponds to record i in the A3M. |
KaiB_seq_ids.txt |
364 lines | One short sequence ID per line, in the same order as the A3M / FI matrix. |
Source paths (private dev repo, read-only)
- Filtered MSA:
/data1/hanqun/SF-Design/SF-Cluster/data/processed/msa/KaiB/KaiB/KaiB_KaiBTE_91aa_UniProt_Q79V61_5to95_2QKE_chainB.filtered.a3m(depth 6821, L=91) - FI artifacts (per-subset):
/data1/hanqun/SF-Design/SF-Cluster/results/frustai_artifacts/KaiB/diverse_sf/KaiB/{000..011}/with filesfi_matrix.npy((32, 91) float32),metadata.json,fi_residual_matrix.npy,entropy_matrix.npy. - Source subset A3Ms (used to map FI rows → sequence IDs):
/data1/hanqun/SF-Design/SF-Cluster/results/baseline_p8/diverse_sf/KaiB/KaiB/screen/diversesf_KaiB_KaiB_seed{000..011}.a3m.
Construction recipe
- For each of the 12
diverse_sfsubsets, loadfi_matrix.npy((32, 91) float32) and the correspondingdiversesf_KaiB_KaiB_seed{NNN}.a3m. - Concatenate rows in subset-index order; track the parallel sequence-ID list from the A3M records.
- Deduplication policy: first occurrence wins. A sequence ID seen in an earlier subset is skipped (both in the FI matrix and the ID list). This reduces 12 × 32 = 384 raw rows to 364 unique rows.
- Extract the corresponding sequences (with their full headers and
lowercase insertion states) from the filtered MSA, preserving the order
established in step 3. The query (
>101) is always row 0.
All 364 unique IDs were found in the filtered MSA (0 missing).
Models
- FrustrAI-Seq weights: HF repo
leuschj/FrustrAI-Seq, commitee5a01a29fde00630f4a1157f0e6cb8343ac434b. Inference in fp16 with LoRA adapters merged.
License
This demo asset is released under MIT alongside the SF-Cluster OSS package. The KaiB sequence (UniProt Q79V61, Thermosynechococcus elongatus) and its MSA neighbors are public-domain sequence records via UniRef100 / Mgnify; no proprietary structures are included. FrustrAI-Seq outputs are derived features (floating-point FI values) and are released by the FrustrAI-Seq authors under their own license — see https://huggingface.co/leuschj/FrustrAI-Seq.