OpenParallax Shield Classifier v1

Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.

Performance

Tested against 321 adversarial payloads across 6 attack categories:

Metric Pre-trained Fine-tuned
Accuracy 77.6% 98.8%
False negatives 71 4
False positives 1 0

Per-Category Results

Category Pre-trained Fine-tuned
Encoding evasion 51.3% 100%
Shell injection 73.3% 100%
Authority spoofing 82.1% 100%
Path traversal 64.0% 96.0%
Data exfiltration 86.1% 100%
Prompt injection 92.8% 97.9%

Training

Optimized for detecting injections in:

  • Tool call arguments (file paths, shell commands, HTTP requests)
  • Authority spoofing ("system override", "admin approved", tool impersonation)
  • Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
  • Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)

Usage with OpenParallax Shield

openparallax get-classifier

Usage with ONNX Runtime (Node.js)

import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";

const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");

const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);

const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train openparallax/shield-classifier-v1