| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - aaditya/Llama3-OpenBioLLM-8B |
| --- |
| |
| # OpenBioLLM-Text2Graph-8B |
|
|
| This model is a biomedical annotation model designed to generate named entity annotations from unlabeled biomedical text. |
| It was introduced in the paper [GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition](https://arxiv.org/abs/2504.00676). |
|
|
| This model enables **high-throughput, cost-efficient synthetic biomedical NER data generation**, serving as the synthetic annotation backbone for [GLiNER-BioMed models](https://huggingface.co/collections/knowledgator/gliner-biomed-67ecf1b7cc62e673dbc8b57f). |
|
|
|
|
| ## Usage |
|
|
| To use the model with `transformer` package, see the example below: |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_name = "Ihor/OpenBioLLM-Text2Graph-8B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| tokenizer.chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|end_of_text|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}" |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| device_map="auto", |
| torch_dtype=torch.bfloat16 |
| ) |
| |
| |
| MESSAGES = [ |
| { |
| "role": "system", |
| "content": ( |
| "You are an advanced assistant trained to process biomedical text for Named Entity Recognition (NER) and Relation Extraction (RE). " |
| "Your task is to analyze user-provided text, identify all unique and contextually relevant entities, and infer directed relationships " |
| "between these entities based on the context. Ensure that all relations exist only between annotated entities. " |
| "Entities and relationships should be human-readable and natural, reflecting real-world concepts and connections. " |
| "Output the annotated data in JSON format, structured as follows:\n\n" |
| """{"entities": [{"id": 0, "text": "ner_string_0", "type": "ner_type_string_0"}, {"id": 1, "text": "ner_string_1", "type": "ner_type_string_1"}], "relations": [{"head": 0, "tail": 1, "type": "re_type_string_0"}]}""" |
| "\n\nEnsure that the output captures all significant entities and their directed relationships in a clear and concise manner." |
| ), |
| }, |
| { |
| "role": "user", |
| "content": ( |
| 'Here is a text input: "Subjects will receive a 100mL dose of IV saline every 6 hours for 24 hours. The first dose will be administered prior to anesthesia induction, approximately 30 minutes before skin incision. A total of 4 doses will be given." ' |
| "Analyze this text, select and classify the entities, and extract their relationships as per your instructions." |
| ), |
| }, |
| ] |
| |
| # Build prompt text |
| chat_prompt = tokenizer.apply_chat_template( |
| MESSAGES, tokenize=False, add_generation_prompt=True |
| ) |
| |
| # Tokenize |
| inputs = tokenizer(chat_prompt, return_tensors="pt").to(model.device) |
| |
| # Generate |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=3000, |
| do_sample=True, |
| eos_token_id=tokenizer.eos_token_id, |
| pad_token_id=tokenizer.eos_token_id, |
| return_dict_in_generate=True |
| ) |
| |
| # Decode ONLY the new tokens (skip the prompt tokens) |
| prompt_len = inputs["input_ids"].shape[-1] |
| generated_ids = outputs.sequences[0][prompt_len:] |
| response = tokenizer.decode(generated_ids, skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| To use the model with `vllm` package, please refer to the example below: |
| ```python |
| # !pip install vllm |
| |
| from vllm import LLM, SamplingParams |
| from transformers import AutoTokenizer |
| |
| MODEL_ID = "Ihor/OpenBioLLM-Text2Graph-8B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True) |
| tokenizer.chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|end_of_text|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}" |
| |
| llm = LLM(model=MODEL_ID) |
| |
| sampling_params = SamplingParams( |
| max_tokens=3000, |
| n=1, |
| best_of=1, |
| presence_penalty=0.0, |
| frequency_penalty=0.0, |
| repetition_penalty=1.0, |
| temperature=0.0, |
| top_p=1.0, |
| top_k=-1, |
| min_p=0.0, |
| seed=42, |
| ) |
| |
| |
| MESSAGES = [ |
| { |
| "role": "system", |
| "content": ( |
| "You are an advanced assistant trained to process biomedical text for Named Entity Recognition (NER) and Relation Extraction (RE). " |
| "Your task is to analyze user-provided text, identify all unique and contextually relevant entities, and infer directed relationships " |
| "between these entities based on the context. Ensure that all relations exist only between annotated entities. " |
| "Entities and relationships should be human-readable and natural, reflecting real-world concepts and connections. " |
| "Output the annotated data in JSON format, structured as follows:\n\n" |
| """{"entities": [{"id": 0, "text": "ner_string_0", "type": "ner_type_string_0"}, {"id": 1, "text": "ner_string_1", "type": "ner_type_string_1"}], "relations": [{"head": 0, "tail": 1, "type": "re_type_string_0"}]}""" |
| "\n\nEnsure that the output captures all significant entities and their directed relationships in a clear and concise manner." |
| ), |
| }, |
| { |
| "role": "user", |
| "content": ( |
| 'Here is a text input: "Subjects will receive a 100mL dose of IV saline every 6 hours for 24 hours. The first dose will be administered prior to anesthesia induction, approximately 30 minutes before skin incision. A total of 4 doses will be given." ' |
| "Analyze this text, select and classify the entities, and extract their relationships as per your instructions." |
| ), |
| }, |
| ] |
| |
| chat_prompt = tokenizer.apply_chat_template( |
| MESSAGES, |
| tokenize=False, |
| add_generation_prompt=True, |
| add_special_tokens=False, |
| ) |
| |
| outputs = llm.generate([chat_prompt], sampling_params) |
| response_text = outputs[0].outputs[0].text |
| print(response_text) |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{yazdani2025glinerbiomedsuiteefficientmodels, |
| title={GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition}, |
| author={Anthony Yazdani and Ihor Stepanov and Douglas Teodoro}, |
| year={2025}, |
| eprint={2504.00676}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2504.00676}, |
| } |
| ``` |