--- license: mit --- # ProtCompass Embeddings Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks. ## Dataset Structure ``` embeddings/ ├── secondary_structure/ # CB513 dataset (29 GB) ├── mutation_effect/ # ProteinGym DMS assays (4.5 GB) ├── contact_prediction/ # ProteinNet (2.9 GB) ├── stability/ # TAPE stability (1.6 GB) ├── ppi_site/ # PPI site prediction (1.4 GB) ├── fluorescence/ # GFP fluorescence (841 MB) ├── metal_binding/ # Metal binding sites (570 MB) ├── go_bp/ # GO Biological Process (214 MB) ├── go_mf/ # GO Molecular Function (68 MB) ├── remote_homology/ # SCOPe fold classification (20 MB) ├── ec_classification/ # Enzyme classification (18 MB) ├── membrane_soluble/ # Membrane/soluble (17 MB) └── subcellular_localization/ # Subcellular location (17 MB) ``` ## File Format Each encoder directory contains: - `train_embeddings.npy`: Training set embeddings (N × D) - `test_embeddings.npy`: Test set embeddings (M × D) - `train_labels.npy`: Training labels - `test_labels.npy`: Test labels - `train_ids.txt`: Protein IDs for training set - `test_ids.txt`: Protein IDs for test set - `meta.json`: Metadata (encoder name, dimensions, dataset info) ## Usage ```python import numpy as np from huggingface_hub import hf_hub_download # Download specific encoder embeddings train_emb = np.load(hf_hub_download( repo_id="Anonymoususer2223/ProtCompass_Embeddings", filename="embeddings/mutation_effect/esm2/train_embeddings.npy", repo_type="dataset" )) test_emb = np.load(hf_hub_download( repo_id="Anonymoususer2223/ProtCompass_Embeddings", filename="embeddings/mutation_effect/esm2/test_embeddings.npy", repo_type="dataset" )) # Use for downstream tasks from sklearn.linear_model import Ridge model = Ridge() model.fit(train_emb, train_labels) score = model.score(test_emb, test_labels) ``` ## Encoders Included ### Sequence Encoders (8) ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh ### Structure Encoders (50+) GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF ### Multimodal Encoders (5) SaProt, ESM-IF, FoldVision ### Baselines Random, Length, Torsion, One-hot, BLOSUM ## Dataset Statistics - **Total size**: 41 GB - **Total encoders**: 70+ - **Total tasks**: 13 - **Total proteins**: ~500K across all tasks ## Citation If you use these embeddings, please cite: ```bibtex @article{protcompass2026, title={ProtCompass: Systematic Evaluation of Protein Structure Encoders}, author={Your Name et al.}, journal={NeurIPS}, year={2026} } ``` ## License MIT License ## Contact For questions or issues, please open an issue on the repository.