OpenParallax Shield Classifier v1
Fine-tuned DeBERTa-v3-base for prompt injection detection in AI agent tool calls.
Performance
Tested against 321 adversarial payloads across 6 attack categories:
| Metric | Pre-trained | Fine-tuned |
|---|---|---|
| Accuracy | 77.6% | 98.8% |
| False negatives | 71 | 4 |
| False positives | 1 | 0 |
Per-Category Results
| Category | Pre-trained | Fine-tuned |
|---|---|---|
| Encoding evasion | 51.3% | 100% |
| Shell injection | 73.3% | 100% |
| Authority spoofing | 82.1% | 100% |
| Path traversal | 64.0% | 96.0% |
| Data exfiltration | 86.1% | 100% |
| Prompt injection | 92.8% | 97.9% |
Training
- Base model: ProtectAI/deberta-v3-base-prompt-injection-v2
- Training data: 6,787 samples (red-team payloads + agent-specific benign actions + NeurAlchemy dataset)
- Epochs: 3
- Hardware: Google Colab T4 GPU
Optimized for detecting injections in:
- Tool call arguments (file paths, shell commands, HTTP requests)
- Authority spoofing ("system override", "admin approved", tool impersonation)
- Encoding evasion (base64, hex, URL encoding, Unicode homoglyphs, bidirectional text)
- Multilingual injection (Spanish, Chinese, Russian, Arabic, Japanese, Korean, and more)
Usage with OpenParallax Shield
openparallax get-classifier
Usage with ONNX Runtime (Node.js)
import * as ort from "onnxruntime-node";
import { Tokenizer } from "tokenizers";
const session = await ort.InferenceSession.create("model.onnx");
const tokenizer = Tokenizer.fromFile("tokenizer.json");
const encoded = await tokenizer.encode("your text here");
const inputIds = new ort.Tensor("int64", BigInt64Array.from(encoded.getIds().map(BigInt)), [1, encoded.getIds().length]);
const attentionMask = new ort.Tensor("int64", BigInt64Array.from(encoded.getAttentionMask().map(BigInt)), [1, encoded.getAttentionMask().length]);
const results = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// logits[0] = SAFE probability, logits[1] = INJECTION probability
License
Apache 2.0