Anonymoususer2223's picture
reamde
8c2c758
metadata
license: mit

ProtCompass Embeddings

Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.

Dataset Structure

embeddings/
β”œβ”€β”€ secondary_structure/     # CB513 dataset (29 GB)
β”œβ”€β”€ mutation_effect/         # ProteinGym DMS assays (4.5 GB)
β”œβ”€β”€ contact_prediction/      # ProteinNet (2.9 GB)
β”œβ”€β”€ stability/              # TAPE stability (1.6 GB)
β”œβ”€β”€ ppi_site/               # PPI site prediction (1.4 GB)
β”œβ”€β”€ fluorescence/           # GFP fluorescence (841 MB)
β”œβ”€β”€ metal_binding/          # Metal binding sites (570 MB)
β”œβ”€β”€ go_bp/                  # GO Biological Process (214 MB)
β”œβ”€β”€ go_mf/                  # GO Molecular Function (68 MB)
β”œβ”€β”€ remote_homology/        # SCOPe fold classification (20 MB)
β”œβ”€β”€ ec_classification/      # Enzyme classification (18 MB)
β”œβ”€β”€ membrane_soluble/       # Membrane/soluble (17 MB)
└── subcellular_localization/ # Subcellular location (17 MB)

File Format

Each encoder directory contains:

  • train_embeddings.npy: Training set embeddings (N Γ— D)
  • test_embeddings.npy: Test set embeddings (M Γ— D)
  • train_labels.npy: Training labels
  • test_labels.npy: Test labels
  • train_ids.txt: Protein IDs for training set
  • test_ids.txt: Protein IDs for test set
  • meta.json: Metadata (encoder name, dimensions, dataset info)

Usage

import numpy as np
from huggingface_hub import hf_hub_download

# Download specific encoder embeddings
train_emb = np.load(hf_hub_download(
    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
    filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
    repo_type="dataset"
))

test_emb = np.load(hf_hub_download(
    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
    filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
    repo_type="dataset"
))

# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)

Encoders Included

Sequence Encoders (8)

ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh

Structure Encoders (50+)

GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF

Multimodal Encoders (5)

SaProt, ESM-IF, FoldVision

Baselines

Random, Length, Torsion, One-hot, BLOSUM

Dataset Statistics

  • Total size: 41 GB
  • Total encoders: 70+
  • Total tasks: 13
  • Total proteins: ~500K across all tasks

Citation

If you use these embeddings, please cite:

@article{protcompass2026,
  title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
  author={Your Name et al.},
  journal={NeurIPS},
  year={2026}
}

License

MIT License

Contact

For questions or issues, please open an issue on the repository.