Balance Tes Haters — Harassment Classifier

Binary classifier for French social media comments: harassment (1) vs benign (0).

Built for the Balance Tes Haters project, which collects and analyses cyberbullying reports from Instagram, TikTok, YouTube and Twitter.

Architecture

This is a two-component model:

Component Description
Encoder Snowflake/snowflake-arctic-embed-l-v2.0 — 568M params, 1024-dim embeddings, loaded from HuggingFace at inference
Classifier harassment_arctic_mlp.joblib — sklearn MLP (512→128, ReLU) trained on frozen Arctic embeddings, bundled in this repo (~7 MB)

The encoder is not fine-tuned — only the MLP head was trained. This keeps the classifier small and the encoder swappable.

Performance

Evaluated on a stratified held-out test set (15% of annotated French comments):

Metric Score
F1 0.6916
Precision 0.6852
Recall 0.6981
Accuracy 0.7130

Comparison with other frozen-embedding approaches on the same test set:

Model Classifier F1
Arctic MLP 0.6916
Arctic LogReg 0.6903
Harrier (270M) LightGBM 0.6729
jina-nano (239M) LightGBM 0.6573
jina-small (677M) MLP 0.6195

Usage

from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer
import joblib
import numpy as np

# Load components
clf = joblib.load(hf_hub_download(
    repo_id="DataForGood/balance-tes-haters-classifier",
    filename="harassment_arctic_mlp.joblib",
))
encoder = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")

def predict(text: str) -> int:
    """Returns 1 (harassment) or 0 (benign)."""
    X = encoder.encode([text], convert_to_numpy=True)
    return int(clf.predict(X)[0])

def predict_proba(text: str) -> float:
    """Returns harassment probability between 0 and 1."""
    X = encoder.encode([text], convert_to_numpy=True)
    return float(clf.predict_proba(X)[0, 1])

# Examples
predict("<Insert hateful french comment>")   # → 1
predict("super vidéo, continue comme ça")  # → 0

Training Data

  • Real annotations: French social media comments manually annotated via the Balance Tes Haters platform, covering 11 harassment categories (injure, menaces, doxxing, incitation à la haine, etc.)
  • Split: 70% train / 15% val / 15% test (stratified)
  • The MLP was trained on the real split only (no synthetic augmentation for this checkpoint)

Categories detected

The model collapses all harassment categories into a single binary label:

  • 0 — Absence de cyberharcèlement
  • 1 — Any of: Cyberharcèlement, Injure, Diffamation, Menaces, Doxxing, Incitation au suicide, Incitation à la haine, Cyberharcèlement à caractère sexuel, and others

Limitations

  • Trained exclusively on French comments — not suitable for other languages
  • Sarcasm and context-dependent harassment may be misclassified
  • F1 of ~0.69 means roughly 1 in 10 harassment comments is missed and 1 in 10 benign comments is flagged
  • Should be used as a triage tool, not a final decision system — human review recommended for borderline cases

Dependencies

pip install sentence-transformers scikit-learn huggingface_hub
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • f1 on French social media comments (held-out test set)
    self-reported
    0.692
  • precision on French social media comments (held-out test set)
    self-reported
    0.685
  • recall on French social media comments (held-out test set)
    self-reported
    0.698
  • accuracy on French social media comments (held-out test set)
    self-reported
    0.713