| ---
|
| license: mit
|
| ---
|
|
|
| # ProtCompass Embeddings
|
|
|
| Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
|
|
|
| ## Dataset Structure
|
|
|
| ```
|
| embeddings/
|
| βββ secondary_structure/ # CB513 dataset (29 GB)
|
| βββ mutation_effect/ # ProteinGym DMS assays (4.5 GB)
|
| βββ contact_prediction/ # ProteinNet (2.9 GB)
|
| βββ stability/ # TAPE stability (1.6 GB)
|
| βββ ppi_site/ # PPI site prediction (1.4 GB)
|
| βββ fluorescence/ # GFP fluorescence (841 MB)
|
| βββ metal_binding/ # Metal binding sites (570 MB)
|
| βββ go_bp/ # GO Biological Process (214 MB)
|
| βββ go_mf/ # GO Molecular Function (68 MB)
|
| βββ remote_homology/ # SCOPe fold classification (20 MB)
|
| βββ ec_classification/ # Enzyme classification (18 MB)
|
| βββ membrane_soluble/ # Membrane/soluble (17 MB)
|
| βββ subcellular_localization/ # Subcellular location (17 MB)
|
| ```
|
|
|
| ## File Format
|
|
|
| Each encoder directory contains:
|
| - `train_embeddings.npy`: Training set embeddings (N Γ D)
|
| - `test_embeddings.npy`: Test set embeddings (M Γ D)
|
| - `train_labels.npy`: Training labels
|
| - `test_labels.npy`: Test labels
|
| - `train_ids.txt`: Protein IDs for training set
|
| - `test_ids.txt`: Protein IDs for test set
|
| - `meta.json`: Metadata (encoder name, dimensions, dataset info)
|
|
|
| ## Usage
|
|
|
| ```python
|
| import numpy as np
|
| from huggingface_hub import hf_hub_download
|
|
|
| # Download specific encoder embeddings
|
| train_emb = np.load(hf_hub_download(
|
| repo_id="Anonymoususer2223/ProtCompass_Embeddings",
|
| filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
|
| repo_type="dataset"
|
| ))
|
|
|
| test_emb = np.load(hf_hub_download(
|
| repo_id="Anonymoususer2223/ProtCompass_Embeddings",
|
| filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
|
| repo_type="dataset"
|
| ))
|
|
|
| # Use for downstream tasks
|
| from sklearn.linear_model import Ridge
|
| model = Ridge()
|
| model.fit(train_emb, train_labels)
|
| score = model.score(test_emb, test_labels)
|
| ```
|
|
|
| ## Encoders Included
|
|
|
| ### Sequence Encoders (8)
|
| ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
|
|
|
| ### Structure Encoders (50+)
|
| GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
|
|
|
| ### Multimodal Encoders (5)
|
| SaProt, ESM-IF, FoldVision
|
|
|
| ### Baselines
|
| Random, Length, Torsion, One-hot, BLOSUM
|
|
|
| ## Dataset Statistics
|
|
|
| - **Total size**: 41 GB
|
| - **Total encoders**: 70+
|
| - **Total tasks**: 13
|
| - **Total proteins**: ~500K across all tasks
|
|
|
| ## Citation
|
|
|
| If you use these embeddings, please cite:
|
|
|
| ```bibtex
|
| @article{protcompass2026,
|
| title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
|
| author={Your Name et al.},
|
| journal={NeurIPS},
|
| year={2026}
|
| }
|
| ```
|
|
|
| ## License
|
|
|
| MIT License
|
|
|
| ## Contact
|
|
|
| For questions or issues, please open an issue on the repository.
|
|
|