aquiro1994

Update README.md

4e7de42 verified 19 days ago

4.96 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	tags:
	- text-classification
	- naics
	- industry-classification
	- github
	- roberta
	datasets:
	- custom
	metrics:
	- f1
	- accuracy
	pipeline_tag: text-classification
	---

	# NAICS GitHub Repository Classifier

	A fine-tuned RoBERTa-large model that classifies GitHub repositories into 19 NAICS (North American Industry Classification System) industry sectors based on repository metadata.

	## Model Description

	This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.

	- Model: `roberta-large` (355M parameters)
	- Task: Multi-class text classification (19 classes)
	- Language: English
	- Training Data: 6,588 labeled GitHub repositories

	## Intended Use

	- Classifying GitHub repositories by industry sector
	- Analyzing open-source software ecosystem by industry
	- Research on technology adoption across industries

	## NAICS Classes

	\| Label \| NAICS Code \| Industry Sector \|
	\|-------\|------------\|-----------------\|
	\| 0 \| 11 \| Agriculture, Forestry, Fishing and Hunting \|
	\| 1 \| 21 \| Mining, Quarrying, Oil and Gas Extraction \|
	\| 2 \| 22 \| Utilities \|
	\| 3 \| 23 \| Construction \|
	\| 4 \| 31-33 \| Manufacturing \|
	\| 5 \| 42 \| Wholesale Trade \|
	\| 6 \| 44-45 \| Retail Trade \|
	\| 7 \| 48-49 \| Transportation and Warehousing \|
	\| 8 \| 51 \| Information \|
	\| 9 \| 52 \| Finance and Insurance \|
	\| 10 \| 53 \| Real Estate and Rental \|
	\| 11 \| 54 \| Professional, Scientific, Technical Services \|
	\| 12 \| 56 \| Administrative and Support Services \|
	\| 13 \| 61 \| Educational Services \|
	\| 14 \| 62 \| Health Care and Social Assistance \|
	\| 15 \| 71 \| Arts, Entertainment, and Recreation \|
	\| 16 \| 72 \| Accommodation and Food Services \|
	\| 17 \| 81 \| Other Services \|
	\| 18 \| 92 \| Public Administration \|

	## Usage

	### Quick Start

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="alexanderquispe/naics-github-classifier"
	)

	text = "Repository: bank-api \| Description: REST API for banking transactions \| README: A secure API for financial operations"
	result = classifier(text)
	print(result)
	# [{'label': '52', 'score': 0.95}] # Finance and Insurance
	```

	### Full Example

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
	tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")

	# Format input
	text = "Repository: mediscan \| Description: AI diagnostic tool for radiology \| Topics: healthcare; medical-imaging; deep-learning \| README: MediScan uses computer vision to assist radiologists..."

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	outputs = model(**inputs)
	predicted_class = torch.argmax(outputs.logits, dim=1).item()

	# Map to NAICS code
	id2label = model.config.id2label
	print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care)
	```

	## Input Format

	The model expects text in this format:

	```
	Repository: {repo_name} \| Description: {description} \| Topics: {topics} \| README: {readme_content}
	```

	\| Field \| Required \| Description \|
	\|-------\|----------\|-------------\|
	\| Repository \| Yes \| Repository name \|
	\| Description \| No \| Short description \|
	\| Topics \| No \| Semicolon-separated tags \|
	\| README \| No \| README content (can be truncated) \|

	## Training Details

	### Training Data

	- Source: GitHub repositories labeled with NAICS codes
	- Size: 6,588 examples
	- Classes: 19 NAICS sectors
	- Split: 70% train / 10% validation / 20% test

	### Training Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| `roberta-large` \|
	\| Batch Size \| 32 \|
	\| Learning Rate \| 2e-5 \|
	\| Epochs \| 8 \|
	\| Max Sequence Length \| 512 \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|
	\| Early Stopping Patience \| 5 \|

	### Preprocessing

	Text preprocessing includes:
	- Removal of markdown badges and formatting
	- URL cleaning (keep domain names)
	- License header removal
	- Code block removal (keep language indicators)
	- Technology term normalization (js → javascript, py → python)
	- Whitespace normalization

	## Limitations

	- Trained primarily on English repositories
	- May not generalize to non-software repositories
	- NAICS code 55 (Management of Companies) excluded due to limited training data
	- Performance may vary for repositories with minimal README content

	## Citation

	```bibtex
	@misc{naics-github-classifier,
	author = {{GitHub, Inc.} and Xu, Kevin and Quispe, Alexander},
	title = {NAICS GitHub Repository Classifier},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
	}
	```

	## Repository

	Training code and data preparation: [github.com/alexanderquispe/naics-github-train](https://github.com/alexanderquispe/naics-github-train)