Text Classification
Transformers
Safetensors
English
Portuguese
roberta
biology
science
nlp
biomedical
filter
medical
text-embeddings-inference
Instructions to use Madras1/RobertaBioClass with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Madras1/RobertaBioClass with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Madras1/RobertaBioClass")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Madras1/RobertaBioClass") model = AutoModelForSequenceClassification.from_pretrained("Madras1/RobertaBioClass") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - pt | |
| license: mit | |
| library_name: transformers | |
| tags: | |
| - biology | |
| - science | |
| - text-classification | |
| - nlp | |
| - biomedical | |
| - filter | |
| - roberta | |
| - medical | |
| metrics: | |
| - f1 | |
| - accuracy | |
| - recall | |
| datasets: | |
| - Madras1/BioClass80k | |
| base_model: roberta-base | |
| widget: | |
| - text: The mitochondria is the powerhouse of the cell and generates ATP. | |
| example_title: Biology Example 🧬 | |
| - text: The stock market crashed today due to high inflation rates. | |
| example_title: Finance Example 💰 | |
| - text: CRISPR-Cas9 technology allows for precise gene editing. | |
| example_title: Genetics Example 🔬 | |
| pipeline_tag: text-classification | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://pytorch.org/) | |
| [](https://huggingface.co/tasks/text-classification) | |
| [](https://www.python.org/) | |
| # RobertaBioClass 🧬 | |
| **RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured. | |
| ## Model Details | |
| - **Model Architecture:** RoBERTa Base | |
| - **Task:** Binary Text Classification | |
| - **Language:** English (and Portuguese capabilities depending on training data mix) | |
| - **Author:** Madras1 | |
| ## Performance Metrics 📊 | |
| The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive. | |
| | Metric | Score | Description | | |
| | :--- | :--- | :--- | | |
| | **Accuracy** | **86.8%** | Overall correctness | | |
| | **F1-Score** | **78.5%** | Harmonic mean of precision and recall | | |
| | **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) | | |
| | **Precision** | **74.4%** | Correctness when predicting "Bio" | | |
| ## Label Mapping | |
| The model outputs the following labels: | |
| * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.) | |
| * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.) | |
| ## Training Data & Procedure | |
| ### Data Overview | |
| The dataset consists of approximately **80,000 text samples** aggregated from multiple sources. | |
| * **Total Samples:** ~79,700 | |
| * **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class. | |
| * **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation. | |
| ### Training Procedure | |
| To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**. | |
| * **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score. | |
| ### Hyperparameters | |
| The model was fine-tuned using the Hugging Face `Trainer` with the following configuration: | |
| * **Optimizer:** AdamW | |
| * **Learning Rate:** 2e-5 | |
| * **Batch Size:** 16 | |
| * **Epochs:** 2 | |
| * **Weight Decay:** 0.01 | |
| * **Hardware:** Trained on a NVIDIA T4 GPU | |
| ## How to Use | |
| You can use this model directly with the Hugging Face `pipeline`: | |
| ```python | |
| from transformers import pipeline | |
| # Load the pipeline | |
| classifier = pipeline("text-classification", model="Madras1/RobertaBioClass") | |
| # Test strings | |
| examples = [ | |
| "The mitochondria is the powerhouse of the cell.", | |
| "The stock market crashed yesterday due to inflation." | |
| ] | |
| # Get predictions | |
| predictions = classifier(examples) | |
| print(predictions) | |
| # Output: | |
| # [{'label': 'LABEL_1', 'score': 0.99...}, <- Biology | |
| # {'label': 'LABEL_0', 'score': 0.98...}] <- Non-Biology | |
| ``` | |
|  | |
| Intended Use | |
| This model is ideal for: | |
| Filtering biological data from Common Crawl or other web datasets. | |
| Categorizing academic papers. | |
| Tagging educational content. | |
| Limitations | |
| Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context. |