Update README.md

9d3e648 verified 5 months ago

6.04 kB

	---
	license: mit
	language:
	- en
	base_model:
	- dmis-lab/biobert-base-cased-v1.1
	pipeline_tag: token-classification
	---
	[![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1)
	[![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP)
	[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE)

	# 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation

	MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text.
	It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference.

	This model enables automated extraction of microbiome names from unstructured text, facilitating microbiome-related text mining and literature curation.

	We also provide a Named Entity Normalisation model on Hugging Face:

	[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NEN-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NEN)

	---

	## 🚀 Quick Start (Hugging Face)

	You can directly load and run the model with the Hugging Face `transformers` pipeline:

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

	tokenizer = AutoTokenizer.from_pretrained("omicsNLP/microbELP_NER")
	model = AutoModelForTokenClassification.from_pretrained("omicsNLP/microbELP_NER")

	nlp = pipeline("ner", model=model, tokenizer=tokenizer)

	example = "The first microbiome I learned about is called Helicobacter pylori."
	ner_results = nlp(example)

	print(ner_results)
	```

	Output:

	```
	[
	{'entity': 'LABEL_0', 'score': 0.9954, 'index': 1, 'word': 'the', 'start': 0, 'end': 3},
	...
	{'entity': 'LABEL_1', 'score': 0.9889, 'index': 11, 'word': 'he', 'start': 47, 'end': 49},
	{'entity': 'LABEL_2', 'score': 0.9710, 'index': 16, 'word': 'p', 'start': 60, 'end': 61},
	...
	]
	```

	where:
	- LABEL_0 → Outside (O)
	- LABEL_1 → Begin-microbiome (B-microbiome)
	- LABEL_2 → Inside-microbiome (I-microbiome)

	---

	## 🧩 Integration with the microbELP Python Package

	If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly.

	Installation:
	```bash
	git clone https://github.com/omicsNLP/microbELP.git
	pip install ./microbELP
	```

	It is recommended to install in an isolated environment due to dependencies.

	Example Usage

	```python
	from microbELP import microbiome_DL_ner

	input_text = "The first microbiome I learned about is called Helicobacter pylori."
	print(microbiome_DL_ner(input_text))
	```

	Output:

	```python
	[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}]
	```

	You can also process a list of texts for batch inference:

	```python
	input_list = [
	"The first microbiome I learned about is called Helicobacter pylori.",
	"Then I learned about Eubacterium rectale."
	]
	print(microbiome_DL_ner(input_list))
	```

	Output:

	```python
	[
	[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}],
	[{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}]
	]
	```
	Each element in the output corresponds to one input text, containing recognised microbiome entities and their text locations.

	There is one optional parameter to this function called `cpu` <type 'bool'>, the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use `microbiome_DL_ner(input_list, cpu = True)`.

	---

	## 📘 Model Details

	Find below some more information about this model.

	\| Property \| Description \|
	\| ----------------- \| -------------------------------------- \|
	\| Task \| Named Entity Recognition (NER) \|
	\| Domain \| Microbiome / Biomedical Text Mining \|
	\| Entity Type \| `microbiome` \|
	\| Model Type \| Transformer-based token classification \|
	\| Framework \| Hugging Face 🤗 Transformers \|
	\| Optimised for \| GPU inference \|


	---

	## 📚 Citation

	If you find this repository useful, please consider giving a like ❤️ and a citation 📝:

	```bibtex
	@article {Patel2025.08.29.671515,
	author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
	title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
	elocation-id = {2025.08.29.671515},
	year = {2025},
	doi = {10.1101/2025.08.29.671515},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
	eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
	journal = {bioRxiv}
	}
	```

	---

	## 🔗 Resources

	Find below some more resources associated with this model.

	\| Property \| Description \|
	\| ----------------- \| -------------------------------------- \|
	\| GitHub Project\|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>\|
	\| Paper \|[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)\|
	\| Data \|[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)\|
	\| Codiet \|[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)\|

	---

	## ⚙️ License

	This model and code are released under the MIT License.