SF-Cluster / examples /data /provenance.md
chq1155's picture
Add Colab demo notebook + KaiB demo data
021edb3 verified
|
Raw
History Blame Contribute Delete
2.88 kB

KaiB demo data — provenance

This directory contains a concatenated demo asset for the SF-Cluster Colab notebook. It is derived from the SF-Cluster Phase II benchmark's KaiB diverse_sf arm and the FrustrAI-Seq per-residue Frustration Index (FI) outputs.

Files

File Shape / size Description
KaiB_filtered.a3m 364 records, L=91 Subset of the KaiB filtered MSA. Query (>101, UniProt Q79V61 residues 5–95) is row 0. Lowercase insertion-state letters preserved.
KaiB_fi_matrix.npy (364, 91) float32 Per-residue FI matrix. Row i corresponds to record i in the A3M.
KaiB_seq_ids.txt 364 lines One short sequence ID per line, in the same order as the A3M / FI matrix.

Source paths (private dev repo, read-only)

  • Filtered MSA: /data1/hanqun/SF-Design/SF-Cluster/data/processed/msa/KaiB/KaiB/KaiB_KaiBTE_91aa_UniProt_Q79V61_5to95_2QKE_chainB.filtered.a3m (depth 6821, L=91)
  • FI artifacts (per-subset): /data1/hanqun/SF-Design/SF-Cluster/results/frustai_artifacts/KaiB/diverse_sf/KaiB/{000..011}/ with files fi_matrix.npy ((32, 91) float32), metadata.json, fi_residual_matrix.npy, entropy_matrix.npy.
  • Source subset A3Ms (used to map FI rows → sequence IDs): /data1/hanqun/SF-Design/SF-Cluster/results/baseline_p8/diverse_sf/KaiB/KaiB/screen/diversesf_KaiB_KaiB_seed{000..011}.a3m.

Construction recipe

  1. For each of the 12 diverse_sf subsets, load fi_matrix.npy ((32, 91) float32) and the corresponding diversesf_KaiB_KaiB_seed{NNN}.a3m.
  2. Concatenate rows in subset-index order; track the parallel sequence-ID list from the A3M records.
  3. Deduplication policy: first occurrence wins. A sequence ID seen in an earlier subset is skipped (both in the FI matrix and the ID list). This reduces 12 × 32 = 384 raw rows to 364 unique rows.
  4. Extract the corresponding sequences (with their full headers and lowercase insertion states) from the filtered MSA, preserving the order established in step 3. The query (>101) is always row 0.

All 364 unique IDs were found in the filtered MSA (0 missing).

Models

  • FrustrAI-Seq weights: HF repo leuschj/FrustrAI-Seq, commit ee5a01a29fde00630f4a1157f0e6cb8343ac434b. Inference in fp16 with LoRA adapters merged.

License

This demo asset is released under MIT alongside the SF-Cluster OSS package. The KaiB sequence (UniProt Q79V61, Thermosynechococcus elongatus) and its MSA neighbors are public-domain sequence records via UniRef100 / Mgnify; no proprietary structures are included. FrustrAI-Seq outputs are derived features (floating-point FI values) and are released by the FrustrAI-Seq authors under their own license — see https://huggingface.co/leuschj/FrustrAI-Seq.