Update README.md

0cd7de3 verified 6 months ago

4.55 kB

	---
	language:
	- en
	- pt
	license: mit
	library_name: transformers
	tags:
	- biology
	- science
	- text-classification
	- nlp
	- biomedical
	- filter
	- roberta
	- medical
	metrics:
	- f1
	- accuracy
	- recall
	datasets:
	- Madras1/BioClass80k
	base_model: roberta-base
	widget:
	- text: The mitochondria is the powerhouse of the cell and generates ATP.
	example_title: Biology Example 🧬
	- text: The stock market crashed today due to high inflation rates.
	example_title: Finance Example 💰
	- text: CRISPR-Cas9 technology allows for precise gene editing.
	example_title: Genetics Example 🔬
	pipeline_tag: text-classification
	---
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/)
	[![Task: Text Classification](https://img.shields.io/badge/Task-Text%20Classification-blueviolet.svg)](https://huggingface.co/tasks/text-classification)
	[![Language: Python](https://img.shields.io/badge/Language-Python-3776AB.svg?logo=python&logoColor=white)](https://www.python.org/)

	# RobertaBioClass 🧬

	RobertaBioClass is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured.

	## Model Details

	- Model Architecture: RoBERTa Base
	- Task: Binary Text Classification
	- Language: English (and Portuguese capabilities depending on training data mix)
	- Author: Madras1

	## Performance Metrics 📊

	The model was evaluated on a held-out validation set of ~16k samples. It is optimized for High Recall, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive.

	\| Metric \| Score \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Accuracy \| 86.8% \| Overall correctness \|
	\| F1-Score \| 78.5% \| Harmonic mean of precision and recall \|
	\| Recall (Bio) \| 83.1% \| Ability to find biological texts (Sensitivity) \|
	\| Precision \| 74.4% \| Correctness when predicting "Bio" \|

	## Label Mapping

	The model outputs the following labels:
	* `LABEL_0`: Non-Biology (General text, News, Finance, Sports, etc.)
	* `LABEL_1`: Biology (Genetics, Medicine, Anatomy, Ecology, etc.)

	## Training Data & Procedure

	### Data Overview
	The dataset consists of approximately 80,000 text samples aggregated from multiple sources.
	* Total Samples: ~79,700
	* Class Balance: The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class.
	* Preprocessing: Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation.

	### Training Procedure
	To address the class imbalance without discarding valuable data (undersampling), we employed a custom Weighted Cross-Entropy Loss.
	* Class Weights: Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score.

	### Hyperparameters
	The model was fine-tuned using the Hugging Face `Trainer` with the following configuration:
	* Optimizer: AdamW
	* Learning Rate: 2e-5
	* Batch Size: 16
	* Epochs: 2
	* Weight Decay: 0.01
	* Hardware: Trained on a NVIDIA T4 GPU

	## How to Use

	You can use this model directly with the Hugging Face `pipeline`:

	```python
	from transformers import pipeline

	# Load the pipeline
	classifier = pipeline("text-classification", model="Madras1/RobertaBioClass")

	# Test strings
	examples = [
	"The mitochondria is the powerhouse of the cell.",
	"The stock market crashed yesterday due to inflation."
	]

	# Get predictions
	predictions = classifier(examples)
	print(predictions)
	# Output:
	# [{'label': 'LABEL_1', 'score': 0.99...}, <- Biology
	# {'label': 'LABEL_0', 'score': 0.98...}] <- Non-Biology

	```

	![Sem título](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/rnZHf_r3p1m4SSNkr8nKc.png)

	Intended Use
	This model is ideal for:

	Filtering biological data from Common Crawl or other web datasets.

	Categorizing academic papers.

	Tagging educational content.

	Limitations
	Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.