| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - dmis-lab/biobert-base-cased-v1.1 |
| pipeline_tag: token-classification |
| --- |
| [](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1) |
| [](https://github.com/omicsNLP/microbELP) |
| [](https://github.com/omicsNLP/microbELP/blob/main/LICENSE) |
|
|
| # π¦ MicrobELP β Microbiome Entity Recognition and Normalisation |
|
|
| MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text. |
| It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference. |
|
|
| This model enables automated extraction of microbiome names from unstructured text, facilitating microbiome-related text mining and literature curation. |
|
|
| We also provide a Named Entity Normalisation model on Hugging Face: |
|
|
| [](https://huggingface.co/omicsNLP/microbELP_NEN) |
|
|
| --- |
|
|
| ## π Quick Start (Hugging Face) |
|
|
| You can directly load and run the model with the Hugging Face `transformers` pipeline: |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
| |
| tokenizer = AutoTokenizer.from_pretrained("omicsNLP/microbELP_NER") |
| model = AutoModelForTokenClassification.from_pretrained("omicsNLP/microbELP_NER") |
| |
| nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
| |
| example = "The first microbiome I learned about is called Helicobacter pylori." |
| ner_results = nlp(example) |
| |
| print(ner_results) |
| ``` |
|
|
| Output: |
|
|
| ``` |
| [ |
| {'entity': 'LABEL_0', 'score': 0.9954, 'index': 1, 'word': 'the', 'start': 0, 'end': 3}, |
| ... |
| {'entity': 'LABEL_1', 'score': 0.9889, 'index': 11, 'word': 'he', 'start': 47, 'end': 49}, |
| {'entity': 'LABEL_2', 'score': 0.9710, 'index': 16, 'word': 'p', 'start': 60, 'end': 61}, |
| ... |
| ] |
| ``` |
|
|
| where: |
| - LABEL_0 β Outside (O) |
| - LABEL_1 β Begin-microbiome (B-microbiome) |
| - LABEL_2 β Inside-microbiome (I-microbiome) |
| |
| --- |
| |
| ## π§© Integration with the microbELP Python Package |
| |
| If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly. |
| |
| Installation: |
| ```bash |
| git clone https://github.com/omicsNLP/microbELP.git |
| pip install ./microbELP |
| ``` |
| |
| It is recommended to install in an isolated environment due to dependencies. |
| |
| Example Usage |
| |
| ```python |
| from microbELP import microbiome_DL_ner |
| |
| input_text = "The first microbiome I learned about is called Helicobacter pylori." |
| print(microbiome_DL_ner(input_text)) |
| ``` |
| |
| Output: |
| |
| ```python |
| [{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}] |
| ``` |
| |
| You can also process a list of texts for batch inference: |
| |
| ```python |
| input_list = [ |
| "The first microbiome I learned about is called Helicobacter pylori.", |
| "Then I learned about Eubacterium rectale." |
| ] |
| print(microbiome_DL_ner(input_list)) |
| ``` |
| |
| Output: |
|
|
| ```python |
| [ |
| [{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}], |
| [{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}] |
| ] |
| ``` |
| Each element in the output corresponds to one input text, containing recognised microbiome entities and their text locations. |
|
|
| There is one optional parameter to this function called `cpu` <type 'bool'>, the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use `microbiome_DL_ner(input_list, cpu = True)`. |
|
|
| --- |
|
|
| ## π Model Details |
|
|
| Find below some more information about this model. |
|
|
| | Property | Description | |
| | ----------------- | -------------------------------------- | |
| | **Task** | Named Entity Recognition (NER) | |
| | **Domain** | Microbiome / Biomedical Text Mining | |
| | **Entity Type** | `microbiome` | |
| | **Model Type** | Transformer-based token classification | |
| | **Framework** | Hugging Face π€ Transformers | |
| | **Optimised for** | GPU inference | |
|
|
|
|
| --- |
|
|
| ## π Citation |
|
|
| If you find this repository useful, please consider giving a like β€οΈ and a citation π: |
|
|
| ```bibtex |
| @article {Patel2025.08.29.671515, |
| author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.}, |
| title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis}, |
| elocation-id = {2025.08.29.671515}, |
| year = {2025}, |
| doi = {10.1101/2025.08.29.671515}, |
| publisher = {Cold Spring Harbor Laboratory}, |
| URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515}, |
| eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf}, |
| journal = {bioRxiv} |
| } |
| ``` |
|
|
| --- |
|
|
| ## π Resources |
|
|
| Find below some more resources associated with this model. |
|
|
| | Property | Description | |
| | ----------------- | -------------------------------------- | |
| | **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>| |
| | **Paper** |[](https://doi.org/10.1101/2025.08.29.671515)| |
| | **Data** |[](https://doi.org/10.5281/zenodo.17305411)| |
| | **Codiet** |[](https://www.codiet.eu)| |
|
|
| --- |
|
|
| ## βοΈ License |
|
|
| This model and code are released under the MIT License. |