| | --- |
| | license: cc-by-sa-4.0 |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | pipeline_tag: text-classification |
| | tags: |
| | - sports |
| | datasets: |
| | - Chrisneverdie/OnlySports_Dataset |
| | base_model: Snowflake/snowflake-arctic-embed-xs |
| | --- |
| | |
| |
|
| | # Sports Text Classifier |
| |
|
| | ## Overview |
| |
|
| | This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content. |
| |
|
| | ## Model Architecture |
| |
|
| | - Base model: [Snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) |
| | - Additional layer: Binary classification layer |
| | - Training: 10 epochs with a learning rate of 3e-4 |
| |
|
| | ## Performance |
| |
|
| | The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents: |
| |
|
| |  |
| |
|
| | ## Training Data |
| |
|
| | The classifier was trained on a balanced dataset of sports and non-sports content: |
| |
|
| | - 64k samples from seven prestigious sports websites |
| | - 36k non-sports text documents classified using GPT-3.5 |
| |
|
| | ## Usage |
| |
|
| | This classifier is primarily used in the creation of the OnlySports Dataset, presented in this [paper](https://arxiv.org/abs/2409.00286). It can be applied to filter large text corpora for sports-related content with high accuracy. |
| |
|
| | ## Integration |
| |
|
| | The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset. |
| |
|
| | ## Related Projects |
| |
|
| | This classifier is part of the larger OnlySports collection, which includes: |
| |
|
| | - [OnlySports Dataset](https://huggingface.co/collections/Chrisneverdie/onlysports-66b3e5cf595eb81220cc27a6) |
| | - [OnlySportsLM](https://huggingface.co/Chrisneverdie/OnlySportsLM_196M) |
| |
|
| | For more information, check our [paper](https://arxiv.org/abs/2409.00286). |