ysy20020107's picture
Update README.md
0248453 verified
---
base_model: westlake-repl/SaProt_35M_AF2
library_name: peft
license: mit
metrics:
- accuracy
accuracy: 0.68
---
Base model: westlake-repl/SaProt_35M_AF2
Task type: protein-level classification
Dataset: This model classifies proteins into 6 major EC classes (EC1-EC6). EC7 was excluded due to only 31 samples available.
To address class imbalance, Label 4 (EC5) was duplicated 2 times and Label 5 (EC6) was duplicated 1 time in the training set.
Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833
Label mapping:
Label 0: Oxidoreductase (EC1)
Label 1: Transferase (EC2)
Label 2: Hydrolase (EC3)
Label 3: Lyase (EC4)
Label 4: Isomerase (EC5)
Label 5: Ligase (EC6)
Training set distribution:
- Label 0: 1497 (28.5%)
- Label 2: 1217 (23.2%)
- Label 1: 1050 (19.9%)
- Label 3: 512 (9.7%)
- Label 4: 496 (9.4%)
- Label 5: 483 (9.2%)
Total: 5255 samples
Validation set distribution:
- Label 0: 187 (32.0%)
- Label 2: 152 (26.0%)
- Label 1: 131 (22.4%)
- Label 3: 64 (10.9%)
- Label 4: 31 (5.3%)
- Label 5: 20 (3.4%)
Total: 585 samples
Test set distribution:
- Label 0: 188 (31.8%)
- Label 2: 153 (25.9%)
- Label 1: 132 (22.3%)
- Label 3: 65 (11.0%)
- Label 4: 32 (5.4%)
- Label 5: 21 (3.5%)
Total: 591 samples
Model input type: Amino acid sequence
Performance (on test set): 0.68 accuracy
LoRA config:
r: 8
lora_dropout: 0.1
lora_alpha: 16
target_modules: ["key", "value", "output.dense", "intermediate.dense", "query"]
modules_to_save: ["classifier"]
Training config:
optimizer:
class: AdamW
betas: (0.9, 0.98)
weight_decay: 0.01
learning rate: 0.0005
epoch: 25
batch size: 64
precision: 16-mixed