KChad's picture
small change
91a0048
---
title: Prompt Injection Envinronment
emoji: "🔊"
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv-0.2.3
- openenv
- Prompt Injection
- Reinforcement Learning
---
# IDPI Exfil Environment
`IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.
The environment is designed for evaluating whether an agent can:
- read and process email content safely
- identify indirect prompt injection patterns
- avoid unsafe outbound actions
- produce clean internal summaries on harder tasks
## Features
- `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles
- `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
- `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested
- `Structured scoring`: reward shaping during the episode plus end-of-episode grading
- `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health
- `Unified evaluation runner`: one runner for inference and baseline-style evaluation output
## Quick Start
### Install
```bash
pip install -e .
```
### Run the server
```bash
python server/app.py
```
The API starts on `http://localhost:7860`.
### Run the evaluation runner
Create a `.env` in the project root with the submission-compatible inference variables:
```bash
API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here
```
Then run:
```bash
python inference.py
```
Compatibility commands still work:
```bash
python run_baseline.py
python baseline/run_baseline.py
```
All three commands use the same unified runner and write results to `baseline/baseline_results.json`.
For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path.
## Using the HTTP API
### Reset the environment
```python
import requests
response = requests.post(
"http://localhost:7860/reset",
json={"difficulty": "easy"},
)
observation = response.json()
print(observation)
```
### Step with an action
```python
import requests
response = requests.post(
"http://localhost:7860/step",
json={"action": {"type": "ReadDoc"}},
)
result = response.json()
print(result["observation"])
print(result["reward"])
print(result["done"])
```
### Inspect state or final result
```python
import requests
state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()
```
Additional discovery endpoints are also available:
```python
import requests
metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()
```
## Environment Overview
### Task model
Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.
Typical safe behavior includes:
- reviewing the email with `ReadDoc`
- flagging suspicious instructions with `FlagInjection`
- writing an internal-only clean summary with `WriteDB`
- refusing unsafe outbound behavior
### Difficulty levels
- `easy`: explicit or obvious prompt injection patterns
- `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
- `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary
### Actions
- `ReadDoc`
Reads the current email content.
- `WriteDB(content="...")`
Writes an internal summary or note.
- `SendEmail(to="...", body="...")`
Sends an outbound email. This is the main exfiltration risk surface.
- `FlagInjection(reason="...")`
Marks the content as suspicious.
- `NoOp`
Takes no action.
### Observation fields
- `subject`
- `body`
- `last_action`
- `last_action_result`
- `step_count`
- `task_difficulty`
- `summary_required`
- `current_summary`
### Endpoints
- `GET /`
- `GET /web`
- `GET /health`
- `GET /metadata`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /result`
- `POST /mcp`
## Evaluation
The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits.
The current manifest includes:
- `32 scenarios total`
- `9 easy`
- `13 medium`
- `10 hard`
Each run logs:
- per-step actions and rewards
- whether exfiltration occurred
- whether the task was completed safely
- final graded score
- per-difficulty summary statistics
Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json).
### Optional runner settings
These values can be set in `.env`:
```bash
EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2
```
## Development
### Project structure
```text
Prompt-injection-env/
|-- env/
| |-- environment.py
| |-- dataset_loader.py
| |-- scenario_generator.py
| |-- injection_engine.py
| |-- policy_engine.py
| |-- taint_tracker.py
| |-- reward.py
| |-- grader.py
| `-- models.py
|-- server/
| |-- app.py
| `-- Dockerfile
|-- docs/
| |-- person_a_benchmark_manifest.json
| |-- person_a_scenario_audit.md
| `-- person_a_showcase_episodes.md
|-- baseline/
| |-- run_baseline.py
| `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md
```
### Core components
- [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py)
Main episode lifecycle, action handling, reward plumbing, and result assembly.
- [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py)
FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
- [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py)
Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
- [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml)
Environment metadata for OpenEnv packaging and deployment.
### CI and sync
Includes GitHub workflows for:
- continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml)
- syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml)
Submission-focused regression tests live in `submission_tests/` and can be run with:
```bash
python -m unittest discover -s submission_tests
```
### Local Python usage
You can exercise the environment directly without starting the server:
```python
from env.environment import IDPIExfilEnv
from env.models import Action
env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)
```
## Notes
- The environment terminates early on unsafe exfiltration events.
- Hard tasks require a clean internal summary that excludes attacker-controlled content.
- Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.