File size: 3,015 Bytes

96d15a8
 
 
8c2c758

---

license: mit
---


# ProtCompass Embeddings

Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.

## Dataset Structure

```

embeddings/

├── secondary_structure/     # CB513 dataset (29 GB)

├── mutation_effect/         # ProteinGym DMS assays (4.5 GB)

├── contact_prediction/      # ProteinNet (2.9 GB)

├── stability/              # TAPE stability (1.6 GB)

├── ppi_site/               # PPI site prediction (1.4 GB)

├── fluorescence/           # GFP fluorescence (841 MB)

├── metal_binding/          # Metal binding sites (570 MB)

├── go_bp/                  # GO Biological Process (214 MB)

├── go_mf/                  # GO Molecular Function (68 MB)

├── remote_homology/        # SCOPe fold classification (20 MB)

├── ec_classification/      # Enzyme classification (18 MB)

├── membrane_soluble/       # Membrane/soluble (17 MB)

└── subcellular_localization/ # Subcellular location (17 MB)

```

## File Format

Each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N × D)
- `test_embeddings.npy`: Test set embeddings (M × D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)

## Usage

```python

import numpy as np

from huggingface_hub import hf_hub_download



# Download specific encoder embeddings

train_emb = np.load(hf_hub_download(

    repo_id="Anonymoususer2223/ProtCompass_Embeddings",

    filename="embeddings/mutation_effect/esm2/train_embeddings.npy",

    repo_type="dataset"

))



test_emb = np.load(hf_hub_download(

    repo_id="Anonymoususer2223/ProtCompass_Embeddings",

    filename="embeddings/mutation_effect/esm2/test_embeddings.npy",

    repo_type="dataset"

))



# Use for downstream tasks

from sklearn.linear_model import Ridge

model = Ridge()

model.fit(train_emb, train_labels)

score = model.score(test_emb, test_labels)

```

## Encoders Included

### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh

### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF

### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision

### Baselines
Random, Length, Torsion, One-hot, BLOSUM

## Dataset Statistics

- **Total size**: 41 GB
- **Total encoders**: 70+
- **Total tasks**: 13
- **Total proteins**: ~500K across all tasks

## Citation

If you use these embeddings, please cite:

```bibtex

@article{protcompass2026,

  title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},

  author={Your Name et al.},

  journal={NeurIPS},

  year={2026}

}

```

## License

MIT License

## Contact

For questions or issues, please open an issue on the repository.