File size: 3,015 Bytes
96d15a8
 
 
8c2c758
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---

license: mit
---


# ProtCompass Embeddings

Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.

## Dataset Structure

```

embeddings/

β”œβ”€β”€ secondary_structure/     # CB513 dataset (29 GB)

β”œβ”€β”€ mutation_effect/         # ProteinGym DMS assays (4.5 GB)

β”œβ”€β”€ contact_prediction/      # ProteinNet (2.9 GB)

β”œβ”€β”€ stability/              # TAPE stability (1.6 GB)

β”œβ”€β”€ ppi_site/               # PPI site prediction (1.4 GB)

β”œβ”€β”€ fluorescence/           # GFP fluorescence (841 MB)

β”œβ”€β”€ metal_binding/          # Metal binding sites (570 MB)

β”œβ”€β”€ go_bp/                  # GO Biological Process (214 MB)

β”œβ”€β”€ go_mf/                  # GO Molecular Function (68 MB)

β”œβ”€β”€ remote_homology/        # SCOPe fold classification (20 MB)

β”œβ”€β”€ ec_classification/      # Enzyme classification (18 MB)

β”œβ”€β”€ membrane_soluble/       # Membrane/soluble (17 MB)

└── subcellular_localization/ # Subcellular location (17 MB)

```

## File Format

Each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N Γ— D)
- `test_embeddings.npy`: Test set embeddings (M Γ— D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)

## Usage

```python

import numpy as np

from huggingface_hub import hf_hub_download



# Download specific encoder embeddings

train_emb = np.load(hf_hub_download(

    repo_id="Anonymoususer2223/ProtCompass_Embeddings",

    filename="embeddings/mutation_effect/esm2/train_embeddings.npy",

    repo_type="dataset"

))



test_emb = np.load(hf_hub_download(

    repo_id="Anonymoususer2223/ProtCompass_Embeddings",

    filename="embeddings/mutation_effect/esm2/test_embeddings.npy",

    repo_type="dataset"

))



# Use for downstream tasks

from sklearn.linear_model import Ridge

model = Ridge()

model.fit(train_emb, train_labels)

score = model.score(test_emb, test_labels)

```

## Encoders Included

### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh

### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF

### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision

### Baselines
Random, Length, Torsion, One-hot, BLOSUM

## Dataset Statistics

- **Total size**: 41 GB
- **Total encoders**: 70+
- **Total tasks**: 13
- **Total proteins**: ~500K across all tasks

## Citation

If you use these embeddings, please cite:

```bibtex

@article{protcompass2026,

  title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},

  author={Your Name et al.},

  journal={NeurIPS},

  year={2026}

}

```

## License

MIT License

## Contact

For questions or issues, please open an issue on the repository.