Anonymoususer2223's picture
reamde
8c2c758
---
license: mit
---
# ProtCompass Embeddings
Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
## Dataset Structure
```
embeddings/
β”œβ”€β”€ secondary_structure/ # CB513 dataset (29 GB)
β”œβ”€β”€ mutation_effect/ # ProteinGym DMS assays (4.5 GB)
β”œβ”€β”€ contact_prediction/ # ProteinNet (2.9 GB)
β”œβ”€β”€ stability/ # TAPE stability (1.6 GB)
β”œβ”€β”€ ppi_site/ # PPI site prediction (1.4 GB)
β”œβ”€β”€ fluorescence/ # GFP fluorescence (841 MB)
β”œβ”€β”€ metal_binding/ # Metal binding sites (570 MB)
β”œβ”€β”€ go_bp/ # GO Biological Process (214 MB)
β”œβ”€β”€ go_mf/ # GO Molecular Function (68 MB)
β”œβ”€β”€ remote_homology/ # SCOPe fold classification (20 MB)
β”œβ”€β”€ ec_classification/ # Enzyme classification (18 MB)
β”œβ”€β”€ membrane_soluble/ # Membrane/soluble (17 MB)
└── subcellular_localization/ # Subcellular location (17 MB)
```
## File Format
Each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N Γ— D)
- `test_embeddings.npy`: Test set embeddings (M Γ— D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)
## Usage
```python
import numpy as np
from huggingface_hub import hf_hub_download
# Download specific encoder embeddings
train_emb = np.load(hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
repo_type="dataset"
))
test_emb = np.load(hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
repo_type="dataset"
))
# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
```
## Encoders Included
### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision
### Baselines
Random, Length, Torsion, One-hot, BLOSUM
## Dataset Statistics
- **Total size**: 41 GB
- **Total encoders**: 70+
- **Total tasks**: 13
- **Total proteins**: ~500K across all tasks
## Citation
If you use these embeddings, please cite:
```bibtex
@article{protcompass2026,
title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
author={Your Name et al.},
journal={NeurIPS},
year={2026}
}
```
## License
MIT License
## Contact
For questions or issues, please open an issue on the repository.