OpenParallax Shield Classifier v1

Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.

Performance

Tested against 321 adversarial payloads across 6 attack categories:

Metric	Pre-trained	Fine-tuned
Accuracy	77.6%	98.8%
False negatives	71	4
False positives	1	0

Per-Category Results

Category	Pre-trained	Fine-tuned
Encoding evasion	51.3%	100%
Shell injection	73.3%	100%
Authority spoofing	82.1%	100%
Path traversal	64.0%	96.0%
Data exfiltration	86.1%	100%
Prompt injection	92.8%	97.9%

Training

Base model: ProtectAI/deberta-v3-base-prompt-injection-v2
Training data: 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset)
Epochs: 3
Hardware: Google Colab T4 GPU

Optimized for detecting injections in:

Tool call arguments (file paths, shell commands, HTTP requests)
Authority spoofing ("system override", "admin approved", tool impersonation)
Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)

Usage with OpenParallax Shield

openparallax get-classifier

Usage with ONNX Runtime (Node.js)

import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";

const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");

const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);

const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

openparallax
/

shield-classifier-v1

OpenParallax Shield Classifier v1

Performance

Per-Category Results

Training

Usage with OpenParallax Shield

Usage with ONNX Runtime (Node.js)

License

Dataset used to train openparallax/shield-classifier-v1