|
|
| # π§ Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`) |
|
|
| This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text. |
|
|
| --- |
|
|
| ## π§ Features |
|
|
| - β
Preprocessed dataset with BIO-tagged tokens |
| - β
Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers |
| - β
Token-label alignment |
| - β
Evaluation using `seqeval` metrics (Precision, Recall, F1) |
| - β
Inference pipeline to extract keyphrases |
| - β
CUDA-enabled for GPU acceleration |
|
|
| --- |
|
|
| ## π Dataset |
|
|
| **Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) |
|
|
| - Fields: |
| - `document`: List of tokenized words (already split) |
| - `doc_bio_tags`: BIO-format labels for keyphrases |
| - Splits: |
| - `train`: 1000 samples |
| - `validation`: 500 samples |
| - `test`: 500 samples |
|
|
| --- |
|
|
| ## π Setup & Installation |
|
|
| ```bash |
| git clone https://github.com/your-username/keyphrase-bert-inspec |
| cd keyphrase-bert-inspec |
| |
| pip install -r requirements.txt |
| ``` |
|
|
| ### `requirements.txt` |
|
|
| ```text |
| datasets |
| transformers |
| evaluate |
| seqeval |
| ``` |
|
|
| --- |
|
|
| ## π§ͺ Training |
|
|
| ```python |
| from datasets import load_dataset |
| from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer |
| ``` |
|
|
| 1. Load and preprocess data with aligned BIO labels |
| 2. Fine-tune `bert-base-cased` on the dataset |
| 3. Evaluate and save model artifacts |
|
|
| ### Training Script Overview: |
|
|
| ```python |
| trainer = Trainer( |
| model=model, |
| args=training_args, |
| train_dataset=tokenized_datasets["train"], |
| eval_dataset=tokenized_datasets["validation"], |
| tokenizer=tokenizer, |
| data_collator=data_collator, |
| compute_metrics=compute_metrics, |
| ) |
| |
| trainer.train() |
| trainer.save_model("keyphrase-bert-inspec") |
| ``` |
|
|
| --- |
|
|
| ## π Evaluation Metrics |
|
|
| ```python |
| { |
| "precision": 0.84, |
| "recall": 0.81, |
| "f1": 0.825, |
| "accuracy": 0.88 |
| } |
| ``` |
|
|
| --- |
|
|
| ## π Inference Example |
|
|
| ```python |
| from transformers import pipeline |
| |
| ner_pipeline = pipeline( |
| "ner", |
| model="keyphrase-bert-inspec", |
| tokenizer="keyphrase-bert-inspec", |
| aggregation_strategy="simple" |
| ) |
| |
| text = "Information-based semantics is a theory in the philosophy of mind." |
| results = ner_pipeline(text) |
| |
| for r in results: |
| print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}") |
| ``` |
|
|
| ### Sample Output |
|
|
| ``` |
| π’ Extracted Keyphrases: |
| - Information-based semantics (score: 0.94) |
| - philosophy of mind (score: 0.91) |
| ``` |
|
|
| --- |
|
|
| ## πΎ Model Artifacts |
|
|
| After training, the model and tokenizer are saved as: |
|
|
| ``` |
| keyphrase-bert-inspec/ |
| βββ config.json |
| βββ pytorch_model.bin |
| βββ tokenizer_config.json |
| βββ vocab.txt |
| ``` |
|
|
| --- |
|
|
| ## π Future Improvements |
|
|
| - Add postprocessing to group fragmented tokens |
| - Use a larger dataset (like `scientific_keyphrases`) |
| - Convert to a web app using Gradio or Streamlit |
|
|
| --- |
|
|
| ## π¨βπ¬ Author |
|
|
| **Your Name** |
| GitHub: [@your-username](https://github.com/your-username) |
| Contact: your.email@example.com |
|
|
| --- |
|
|
| ## π License |
|
|
| MIT License. See `LICENSE` file. |
|
|