| | --- |
| | language: en |
| | tags: |
| | - protein |
| | - protbert |
| | - masked-language-modeling |
| | - bioinformatics |
| | - sequence-prediction |
| | datasets: |
| | - custom |
| | license: mit |
| | library_name: transformers |
| | pipeline_tag: fill-mask |
| | --- |
| | |
| | # ProtBERT-Unmasking |
| |
|
| | This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model**: ProtBERT |
| | - **Task**: Protein Sequence Unmasking |
| | - **Training**: Fine-tuned on masked protein sequences |
| | - **Use Case**: Predicting missing or masked amino acids in protein sequences |
| | - **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M |
| |
|
| | For detailed information about the training methodology and approach, please refer to our paper: |
| | [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForMaskedLM, AutoTokenizer |
| | |
| | # Load model and tokenizer |
| | model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking") |
| | tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking") |
| | |
| | # Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M) |
| | sequence = "MALN[MASK]KFGP[MASK]LVRK" |
| | inputs = tokenizer(sequence, return_tensors="pt") |
| | outputs = model(**inputs) |
| | predictions = outputs.logits |
| | ``` |
| |
|
| | ## Inference API |
| |
|
| | The model is optimized for: |
| | - **Organism**: E. coli |
| | - **Known Amino Acids**: K, C, Y, H, S, M |
| | - **Task**: Predicting unknown amino acids in a sequence |
| |
|
| | Example API usage: |
| | ```python |
| | from transformers import pipeline |
| | |
| | unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking') |
| | sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S |
| | results = unmasker(sequence) |
| | |
| | for result in results: |
| | print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}") |
| | ``` |
| |
|
| | ## Limitations and Biases |
| |
|
| | - This model is specifically designed for protein sequence unmasking in E. coli |
| | - Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M |
| | - The model may not perform optimally for: |
| | - Sequences from other organisms |
| | - Sequences without the specified known amino acids |
| | - Other protein-related tasks |
| |
|
| | ## Training Details |
| |
|
| | The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: |
| | [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) |
| |
|