| # KaiB demo data — provenance |
|
|
| This directory contains a concatenated demo asset for the SF-Cluster Colab |
| notebook. It is derived from the SF-Cluster Phase II benchmark's KaiB |
| `diverse_sf` arm and the FrustrAI-Seq per-residue Frustration Index (FI) |
| outputs. |
|
|
| ## Files |
|
|
| | File | Shape / size | Description | |
| |-----------------------|------------------------|-------------| |
| | `KaiB_filtered.a3m` | 364 records, L=91 | Subset of the KaiB filtered MSA. Query (`>101`, UniProt Q79V61 residues 5–95) is row 0. Lowercase insertion-state letters preserved. | |
| | `KaiB_fi_matrix.npy` | (364, 91) float32 | Per-residue FI matrix. Row `i` corresponds to record `i` in the A3M. | |
| | `KaiB_seq_ids.txt` | 364 lines | One short sequence ID per line, in the same order as the A3M / FI matrix. | |
|
|
| ## Source paths (private dev repo, read-only) |
|
|
| - Filtered MSA: |
| `/data1/hanqun/SF-Design/SF-Cluster/data/processed/msa/KaiB/KaiB/KaiB_KaiBTE_91aa_UniProt_Q79V61_5to95_2QKE_chainB.filtered.a3m` |
| (depth 6821, L=91) |
| - FI artifacts (per-subset): |
| `/data1/hanqun/SF-Design/SF-Cluster/results/frustai_artifacts/KaiB/diverse_sf/KaiB/{000..011}/` |
| with files `fi_matrix.npy` ((32, 91) float32), `metadata.json`, |
| `fi_residual_matrix.npy`, `entropy_matrix.npy`. |
| - Source subset A3Ms (used to map FI rows → sequence IDs): |
| `/data1/hanqun/SF-Design/SF-Cluster/results/baseline_p8/diverse_sf/KaiB/KaiB/screen/diversesf_KaiB_KaiB_seed{000..011}.a3m`. |
|
|
| ## Construction recipe |
|
|
| 1. For each of the 12 `diverse_sf` subsets, load `fi_matrix.npy` ((32, 91) |
| float32) and the corresponding `diversesf_KaiB_KaiB_seed{NNN}.a3m`. |
| 2. Concatenate rows in subset-index order; track the parallel sequence-ID |
| list from the A3M records. |
| 3. **Deduplication policy**: first occurrence wins. A sequence ID seen in an |
| earlier subset is skipped (both in the FI matrix and the ID list). This |
| reduces 12 × 32 = 384 raw rows to 364 unique rows. |
| 4. Extract the corresponding sequences (with their full headers and |
| lowercase insertion states) from the filtered MSA, preserving the order |
| established in step 3. The query (`>101`) is always row 0. |
|
|
| All 364 unique IDs were found in the filtered MSA (0 missing). |
|
|
| ## Models |
|
|
| - **FrustrAI-Seq weights**: HF repo `leuschj/FrustrAI-Seq`, commit |
| `ee5a01a29fde00630f4a1157f0e6cb8343ac434b`. Inference in fp16 with LoRA |
| adapters merged. |
|
|
| ## License |
|
|
| This demo asset is released under MIT alongside the SF-Cluster OSS package. |
| The KaiB sequence (UniProt Q79V61, *Thermosynechococcus elongatus*) and its |
| MSA neighbors are public-domain sequence records via UniRef100 / Mgnify; |
| no proprietary structures are included. FrustrAI-Seq outputs are derived |
| features (floating-point FI values) and are released by the FrustrAI-Seq |
| authors under their own license — see |
| https://huggingface.co/leuschj/FrustrAI-Seq. |
|
|