| --- |
| base_model: westlake-repl/SaProt_35M_AF2 |
| library_name: peft |
| license: mit |
| metrics: |
| - accuracy |
| accuracy: 0.68 |
| --- |
| |
| Base model: westlake-repl/SaProt_35M_AF2 |
|
|
| Task type: protein-level classification |
|
|
| Dataset: This model classifies proteins into 6 major EC classes (EC1-EC6). EC7 was excluded due to only 31 samples available. |
| To address class imbalance, Label 4 (EC5) was duplicated 2 times and Label 5 (EC6) was duplicated 1 time in the training set. |
| Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833 |
|
|
| Label mapping: |
| Label 0: Oxidoreductase (EC1) |
| Label 1: Transferase (EC2) |
| Label 2: Hydrolase (EC3) |
| Label 3: Lyase (EC4) |
| Label 4: Isomerase (EC5) |
| Label 5: Ligase (EC6) |
|
|
| Training set distribution: |
| - Label 0: 1497 (28.5%) |
| - Label 2: 1217 (23.2%) |
| - Label 1: 1050 (19.9%) |
| - Label 3: 512 (9.7%) |
| - Label 4: 496 (9.4%) |
| - Label 5: 483 (9.2%) |
| Total: 5255 samples |
|
|
| Validation set distribution: |
| - Label 0: 187 (32.0%) |
| - Label 2: 152 (26.0%) |
| - Label 1: 131 (22.4%) |
| - Label 3: 64 (10.9%) |
| - Label 4: 31 (5.3%) |
| - Label 5: 20 (3.4%) |
| Total: 585 samples |
|
|
| Test set distribution: |
| - Label 0: 188 (31.8%) |
| - Label 2: 153 (25.9%) |
| - Label 1: 132 (22.3%) |
| - Label 3: 65 (11.0%) |
| - Label 4: 32 (5.4%) |
| - Label 5: 21 (3.5%) |
| Total: 591 samples |
|
|
| Model input type: Amino acid sequence |
|
|
| Performance (on test set): 0.68 accuracy |
|
|
| LoRA config: |
| r: 8 |
| lora_dropout: 0.1 |
| lora_alpha: 16 |
| target_modules: ["key", "value", "output.dense", "intermediate.dense", "query"] |
| modules_to_save: ["classifier"] |
| |
| Training config: |
| optimizer: |
| class: AdamW |
| betas: (0.9, 0.98) |
| weight_decay: 0.01 |
| learning rate: 0.0005 |
| epoch: 25 |
| batch size: 64 |
| precision: 16-mixed |