| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - text-classification |
| - url-classification |
| - bert |
| - domain-classification |
| pipeline_tag: text-classification |
| widget: |
| - text: https://acmewidgets.com |
| - text: https://store.myshopify.com |
| - text: https://example.wixsite.com/store |
| - text: https://business.com |
| datasets: |
| - custom |
| metrics: |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| --- |
| |
| # URL Classifier |
|
|
| A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`). |
|
|
| ## Model Description |
|
|
| This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between: |
|
|
| - **LABEL_0 (official_website)**: Direct company/brand websites |
| - **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.) |
| |
| ## Training Details |
| |
| ### Base Model |
| - **Architecture**: BERT for Sequence Classification |
| - **Base Model**: `amahdaouy/DomURLs_BERT` |
| - **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4` |
| |
| ### Training Configuration |
| | Parameter | Value | |
| |-----------|-------| |
| | **Epochs** | 20 | |
| | **Learning Rate** | 2e-5 | |
| | **Batch Size** | 32 | |
| | **Max Sequence Length** | 64 tokens | |
| | **Optimizer** | AdamW | |
| | **Weight Decay** | 0.01 | |
| | **LR Scheduler** | ReduceLROnPlateau | |
| | **Early Stopping** | Patience: 3, Threshold: 0.001 | |
|
|
| ### Training Data |
| - Custom curated dataset of platform and official website URLs |
| - Balanced training set with equal representation of both classes |
| - Domain-specific preprocessing and data augmentation |
|
|
| ## Performance |
|
|
| ### Test Set Metrics |
| | Metric | Threshold | Achieved | |
| |--------|-----------|----------| |
| | **Accuracy** | β₯ 0.80 | **β₯ 0.99** β
| |
| | **F1 Score** | β₯ 0.80 | **β₯ 0.99** β
| |
| | **Precision** | β₯ 0.80 | **β₯ 0.99** β
| |
| | **Recall** | β₯ 0.80 | **β₯ 0.99** β
| |
| | **False Positive Rate** | β€ 0.15 | **< 0.01** β
| |
| | **False Negative Rate** | β€ 0.15 | **< 0.01** β
| |
|
|
| ### Example Predictions |
| - `https://acmewidgets.com` β **official_website** (99.98% confidence) |
| - `https://store.myshopify.com` β **platform** (75.96% confidence) |
| - `https://example.wixsite.com/store` β **platform** (high confidence) |
| |
| ## Usage |
| |
| ### Direct Inference with Transformers |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load model and tokenizer |
| model_name = "DiligentAI/urlbert-url-classifier" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| |
| # Classify URL |
| url = "https://acmewidgets.com" |
| inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
| predicted_class = torch.argmax(predictions, dim=1).item() |
| confidence = predictions[0][predicted_class].item() |
| |
| label_map = {0: "official_website", 1: "platform"} |
| print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})") |
| Using Hugging Face Pipeline |
| from transformers import pipeline |
| |
| classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier") |
| result = classifier("https://store.myshopify.com") |
| |
| |
| Pydantic Integration (Production-Ready) |
| |
| from transformers import pipeline |
| from pydantic import BaseModel, Field |
| from typing import Literal |
| |
| class URLClassificationResult(BaseModel): |
| url: str |
| label: Literal["official_website", "platform"] |
| confidence: float = Field(..., ge=0.0, le=1.0) |
| |
| def classify_url(url: str) -> URLClassificationResult: |
| classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier") |
| result = classifier(url[:64])[0] # Truncate to max_length |
| |
| label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"} |
| |
| return URLClassificationResult( |
| url=url, |
| label=label_map[result["label"]], |
| confidence=result["score"] |
| ) |
| |
| Limitations and Bias |
| Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated. |
| Domain Focus: Optimized for e-commerce and business websites |
| Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.) |
| Language: Primarily trained on English-language domains |
| Edge Cases: May have lower confidence on: |
| Uncommon TLDs |
| Very short URLs |
| Internationalized domain names |
| Intended Use |
| |
| Primary Use Cases: |
| URL filtering and categorization pipelines |
| Lead qualification systems |
| Web scraping and data collection workflows |
| Business intelligence and market research |
| Out of Scope: |
| Content classification (only URL structure is analyzed) |
| Malicious URL detection (use dedicated security models) |
| Language detection |
| Spam filtering |
| Model Card Authors |
| DiligentAI Team |
| |
| Citation |
| @misc{urlbert-classifier-2025, |
| author = {DiligentAI}, |
| title = {URL Classifier - Platform vs Official Website Detection}, |
| year = {2025}, |
| publisher = {HuggingFace}, |
| howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}} |
| } |
| |
| License |
| MIT License |
| Framework Versions |
| Transformers: 4.57.0+ |
| PyTorch: 2.0.0+ |
| Python: 3.10+ |
| Training Infrastructure |
| Framework: PyTorch + Hugging Face Transformers |
| Pipeline Orchestration: DVC (Data Version Control) |
| CI/CD: GitHub Actions |
| Model Format: Safetensors |
| Dependencies: See repository |
| Model Versioning |
| This model is automatically versioned and deployed via GitHub Actions. Each release includes: |
| Model checkpoint (.safetensors) |
| Tokenizer configuration |
| Label mapping (label_map.json) |
| Performance metrics (metrics.json) |
| |
| Contact |
| For issues, questions, or feedback: |
| GitHub: DiligentAI/url-classifier |
| Organization: DiligentAI |