AggressiveBag
/

VKR_Model

Text Classification

ai-generated-code-detection

binary-classification

Model card Files Files and versions

VKR_Model / README.md

Kirill

Add VKR CodeBERT detector model

aa7f273 13 days ago

|

history blame contribute delete

1.94 kB

	---
	language:
	- code
	license: mit
	base_model: microsoft/codebert-base
	tags:
	- code
	- python
	- ai-generated-code-detection
	- codebert
	- binary-classification
	pipeline_tag: text-classification
	datasets:
	- AggressiveBag/VKR_Dataset
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	---

	# VKR Model

	Binary classifier for detecting whether Python code is human-written or AI-generated.

	The model was fine-tuned from `microsoft/codebert-base` on the `AggressiveBag/VKR_Dataset` dataset.

	## Labels

	- `0`: human-written Python code
	- `1`: AI-generated Python code

	## Training Setup

	- Base model: `microsoft/codebert-base`
	- Maximum sequence length: 512
	- Epochs: 3
	- Batch size: 8
	- Learning rate: `2e-5`
	- Weight decay: `0.01`
	- Warmup ratio: `0.06`
	- Seed: 42
	- Encoder frozen: yes

	## Validation Metrics

	\| Metric \| Value \|
	\|---\|---:\|
	\| Loss \| 0.4192 \|
	\| Accuracy \| 0.7836 \|
	\| Precision, AI class \| 0.7142 \|
	\| Recall, AI class \| 0.9456 \|
	\| F1, AI class \| 0.8138 \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "AggressiveBag/VKR_Model"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	code = "print('hello world')"
	inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)

	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1)[0]

	print({"human": float(probs[0]), "ai": float(probs[1])})
	```

	## Intended Use

	This model is intended for research and educational experiments related to AI-generated Python code detection. It should not be used as the sole evidence for high-stakes decisions, because AI-code detection can produce false positives and false negatives.

	## Dataset

	The training data is based on human solutions from APPS and locally generated AI solutions. See `AggressiveBag/VKR_Dataset` for dataset details.