| --- |
| language: en |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - t5 |
| - molecule-to-protein |
| - smiles |
| - protein-generation |
| - binder |
| - ligand |
| license: apache-2.0 |
| datasets: |
| - AI4PD/Mol2Pro-Binder-Dataset |
| --- |
| |
| # Mol2Pro-base |
|
|
| ## Model description |
|
|
| - **Architecture:** T5-efficient-base https://huggingface.co/google/t5-efficient-base |
| - **Tokenization:** https://huggingface.co/AI4PD/Mol2Pro-tokenizer |
|
|
|
|
| - **Code:** https://github.com/AI4PDLab/Mol2Pro |
| - **Training data** https://huggingface.co/datasets/AI4PD/Mol2Pro-Binder-Dataset |
| - **Paper:** https://doi.org/10.64898/2026.02.06.704305 |
|
|
|
|
|
|
| ## How to use |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| import torch |
| |
| model_id = "AI4PD/Mol2Pro-base" |
| tokenizer_id = "AI4PD/Mol2Pro-tokenizer" |
| |
| # Load tokenizers |
| tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles") |
| tokenizer_aa = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa") |
| |
| # Load model |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
| ``` |
|
|
| ## Intended use |
| Research use only. The model generates candidate sequences conditioned on small-molecule inputs; it does not guarantee binding or function and must be validated experimentally. |
|
|
| ## Citation |
|
|
| If you find this work useful, please cite: |
|
|
| ```bibtex |
| @article{VicenteSola2026Generalise, |
| title = {Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data}, |
| author = {Vicente-Sola, Alex and Dornfeld, Lars and Coines, Joan and Ferruz, Noelia}, |
| journal = {bioRxiv}, |
| year = {2026}, |
| doi = {10.64898/2026.02.06.704305}, |
| } |
| |
| |