File size: 3,015 Bytes
96d15a8 8c2c758 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | ---
license: mit
---
# ProtCompass Embeddings
Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
## Dataset Structure
```
embeddings/
βββ secondary_structure/ # CB513 dataset (29 GB)
βββ mutation_effect/ # ProteinGym DMS assays (4.5 GB)
βββ contact_prediction/ # ProteinNet (2.9 GB)
βββ stability/ # TAPE stability (1.6 GB)
βββ ppi_site/ # PPI site prediction (1.4 GB)
βββ fluorescence/ # GFP fluorescence (841 MB)
βββ metal_binding/ # Metal binding sites (570 MB)
βββ go_bp/ # GO Biological Process (214 MB)
βββ go_mf/ # GO Molecular Function (68 MB)
βββ remote_homology/ # SCOPe fold classification (20 MB)
βββ ec_classification/ # Enzyme classification (18 MB)
βββ membrane_soluble/ # Membrane/soluble (17 MB)
βββ subcellular_localization/ # Subcellular location (17 MB)
```
## File Format
Each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N Γ D)
- `test_embeddings.npy`: Test set embeddings (M Γ D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)
## Usage
```python
import numpy as np
from huggingface_hub import hf_hub_download
# Download specific encoder embeddings
train_emb = np.load(hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
repo_type="dataset"
))
test_emb = np.load(hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
repo_type="dataset"
))
# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
```
## Encoders Included
### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision
### Baselines
Random, Length, Torsion, One-hot, BLOSUM
## Dataset Statistics
- **Total size**: 41 GB
- **Total encoders**: 70+
- **Total tasks**: 13
- **Total proteins**: ~500K across all tasks
## Citation
If you use these embeddings, please cite:
```bibtex
@article{protcompass2026,
title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
author={Your Name et al.},
journal={NeurIPS},
year={2026}
}
```
## License
MIT License
## Contact
For questions or issues, please open an issue on the repository.
|