| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - ru |
| | library_name: gigacheck |
| | tags: |
| | - text-classification |
| | - ai-detection |
| | - multilingual |
| | - gigacheck |
| | datasets: |
| | - iitolstykh/LLMTrace_classification |
| | base_model: |
| | - mistralai/Mistral-7B-v0.3 |
| | --- |
| | |
| | # GigaCheck-Classifier-Multi |
| |
|
| | <p style="text-align: center;"> |
| | <div align="center"> |
| | <img src="https://raw.githubusercontent.com/sweetdream779/LLMTrace-info/refs/heads/main/images/logo/GigaCheck-classifier-multi.PNG" width="40%"/> |
| | </div> |
| | <p align="center"> |
| | <a href="https://sweetdream779.github.io/LLMTrace-info"> 🌐 LLMTrace Website </a> | |
| | <a href="http://arxiv.org/abs/2509.21269"> 📜 LLMTrace Paper on arXiv </a> | |
| | <a href="https://huggingface.co/datasets/iitolstykh/LLMTrace_classification"> 🤗 LLMTrace - Classification Dataset </a> | |
| | <a href="https://github.com/ai-forever/gigacheck"> Github </a> | |
| | </p> |
| |
|
| | ## Model Card |
| |
|
| | ### Model Description |
| |
|
| | This is the official `GigaCheck-Classifier-Multi` model from the `LLMTrace` project. It is a multilingual transformer-based model trained for the **binary classification of text** as either `human` or `ai`. |
| |
|
| | The model was trained jointly on the English and Russian portions of the `LLMTrace Classification dataset`. It is designed to be a robust baseline for detecting AI-generated content across multiple domains, text lengths and prompt types. |
| |
|
| | For complete details on the training data, methodology, and evaluation, please refer to our research paper: link(coming soon) |
| |
|
| | ### Intended Use & Limitations |
| |
|
| | This model is intended for academic research, analysis of AI-generated content, and as a baseline for developing more advanced detection tools. |
| |
|
| | **Limitations:** |
| | * The model's performance may degrade on text generated by LLMs released after its training date (September 2025). |
| | * It is not infallible and can produce false positives (flagging human text as AI) and false negatives. |
| | * Performance may vary on domains or styles of text not well-represented in the training data. |
| |
|
| |
|
| | ## Evaluation |
| |
|
| | The model was evaluated on the test split of the `LLMTrace Classification dataset`, which was not seen during training. Performance metrics are reported below: |
| |
|
| | | Metric | Value | |
| | |-----------------------|---------| |
| | | F1 Score (AI) | 98.64 | |
| | | F1 Score (Human) | 98.00 | |
| | | Mean Accuracy | 98.46 | |
| | | TPR @ FPR=0.01 | 97.93 | |
| |
|
| |
|
| | ## Quick start |
| |
|
| | Requirements: |
| | - python3.11 |
| | - [gigacheck](https://github.com/ai-forever/gigacheck) |
| |
|
| | ```bash |
| | pip install git+https://github.com/ai-forever/gigacheck |
| | ``` |
| |
|
| | ### Inference with transformers (with trust_remote_code=True) |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | import torch |
| | |
| | gigacheck_model = AutoModel.from_pretrained( |
| | "iitolstykh/GigaCheck-Classifier-Multi", |
| | trust_remote_code=True, |
| | device_map="cuda:0", |
| | torch_dtype=torch.bfloat16 |
| | ) |
| | |
| | text = """To be, or not to be, that is the question: |
| | Whether ’tis nobler in the mind to suffer |
| | The slings and arrows of outrageous fortune, |
| | Or to take arms against a sea of troubles |
| | And by opposing end them. |
| | """ |
| | |
| | output = gigacheck_model([text.replace("\n", " ")]) |
| | |
| | print([gigacheck_model.config.id2label[int(c_id)] for c_id in output.pred_label_ids]) |
| | ``` |
| |
|
| | ### Inference with gigacheck |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoConfig |
| | from gigacheck.inference.src.mistral_detector import MistralDetector |
| | |
| | model_name = "iitolstykh/GigaCheck-Classifier-Multi" |
| | |
| | config = AutoConfig.from_pretrained(model_name) |
| | model = MistralDetector( |
| | max_seq_len=config.max_length, |
| | with_detr=config.with_detr, |
| | id2label=config.id2label, |
| | device="cpu" if not torch.cuda.is_available() else "cuda:0", |
| | ).from_pretrained(model_name) |
| | |
| | text = """To be, or not to be, that is the question: |
| | Whether ’tis nobler in the mind to suffer |
| | The slings and arrows of outrageous fortune, |
| | Or to take arms against a sea of troubles |
| | And by opposing end them. |
| | """ |
| | |
| | output = model.predict(text.replace("\n", " ")) |
| | print(output) |
| | ``` |
| |
|
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite our papers: |
| |
|
| | ```bibtex |
| | @article{Layer2025LLMTrace, |
| | Title = {{LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text}}, |
| | Author = {Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Maksim Kuprashevich}, |
| | Year = {2025}, |
| | Eprint = {arXiv:2509.21269} |
| | } |
| | @article{tolstykh2024gigacheck, |
| | title={{GigaCheck: Detecting LLM-generated Content}}, |
| | author={Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Aleksandr Gordeev and Vladimir Dokholyan and Maksim Kuprashevich}, |
| | journal={arXiv preprint arXiv:2410.23728}, |
| | year={2024} |
| | } |
| | ``` |