| --- |
| title: Prompt Injection Envinronment |
| emoji: "🔊" |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| pinned: false |
| app_port: 8000 |
| base_path: /web |
| tags: |
| - openenv-0.2.3 |
| - openenv |
| - Prompt Injection |
| - Reinforcement Learning |
| --- |
| |
| # IDPI Exfil Environment |
|
|
| `IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays. |
|
|
| The environment is designed for evaluating whether an agent can: |
|
|
| - read and process email content safely |
| - identify indirect prompt injection patterns |
| - avoid unsafe outbound actions |
| - produce clean internal summaries on harder tasks |
|
|
| ## Features |
|
|
| - `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles |
| - `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more |
| - `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested |
| - `Structured scoring`: reward shaping during the episode plus end-of-episode grading |
| - `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health |
| - `Unified evaluation runner`: one runner for inference and baseline-style evaluation output |
|
|
| ## Quick Start |
|
|
| ### Install |
|
|
| ```bash |
| pip install -e . |
| ``` |
|
|
| ### Run the server |
|
|
| ```bash |
| python server/app.py |
| ``` |
|
|
| The API starts on `http://localhost:7860`. |
|
|
| ### Run the evaluation runner |
|
|
| Create a `.env` in the project root with the submission-compatible inference variables: |
|
|
| ```bash |
| API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/ |
| MODEL_NAME=gemini-2.5-flash |
| HF_TOKEN=your_gemini_api_key_here |
| ``` |
|
|
| Then run: |
|
|
| ```bash |
| python inference.py |
| ``` |
|
|
| Compatibility commands still work: |
|
|
| ```bash |
| python run_baseline.py |
| python baseline/run_baseline.py |
| ``` |
|
|
| All three commands use the same unified runner and write results to `baseline/baseline_results.json`. |
| For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path. |
|
|
| ## Using the HTTP API |
|
|
| ### Reset the environment |
|
|
| ```python |
| import requests |
| |
| response = requests.post( |
| "http://localhost:7860/reset", |
| json={"difficulty": "easy"}, |
| ) |
| observation = response.json() |
| print(observation) |
| ``` |
|
|
| ### Step with an action |
|
|
| ```python |
| import requests |
| |
| response = requests.post( |
| "http://localhost:7860/step", |
| json={"action": {"type": "ReadDoc"}}, |
| ) |
| result = response.json() |
| |
| print(result["observation"]) |
| print(result["reward"]) |
| print(result["done"]) |
| ``` |
|
|
| ### Inspect state or final result |
|
|
| ```python |
| import requests |
| |
| state = requests.get("http://localhost:7860/state").json() |
| episode_result = requests.get("http://localhost:7860/result").json() |
| ``` |
|
|
| Additional discovery endpoints are also available: |
|
|
| ```python |
| import requests |
| |
| metadata = requests.get("http://localhost:7860/metadata").json() |
| schema = requests.get("http://localhost:7860/schema").json() |
| ``` |
|
|
| ## Environment Overview |
|
|
| ### Task model |
|
|
| Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels. |
|
|
| Typical safe behavior includes: |
|
|
| - reviewing the email with `ReadDoc` |
| - flagging suspicious instructions with `FlagInjection` |
| - writing an internal-only clean summary with `WriteDB` |
| - refusing unsafe outbound behavior |
|
|
| ### Difficulty levels |
|
|
| - `easy`: explicit or obvious prompt injection patterns |
| - `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers |
| - `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary |
|
|
| ### Actions |
|
|
| - `ReadDoc` |
| Reads the current email content. |
| - `WriteDB(content="...")` |
| Writes an internal summary or note. |
| - `SendEmail(to="...", body="...")` |
| Sends an outbound email. This is the main exfiltration risk surface. |
| - `FlagInjection(reason="...")` |
| Marks the content as suspicious. |
| - `NoOp` |
| Takes no action. |
|
|
| ### Observation fields |
|
|
| - `subject` |
| - `body` |
| - `last_action` |
| - `last_action_result` |
| - `step_count` |
| - `task_difficulty` |
| - `summary_required` |
| - `current_summary` |
|
|
| ### Endpoints |
|
|
| - `GET /` |
| - `GET /web` |
| - `GET /health` |
| - `GET /metadata` |
| - `GET /schema` |
| - `POST /reset` |
| - `POST /step` |
| - `GET /state` |
| - `GET /result` |
| - `POST /mcp` |
|
|
| ## Evaluation |
|
|
| The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits. |
|
|
| The current manifest includes: |
|
|
| - `32 scenarios total` |
| - `9 easy` |
| - `13 medium` |
| - `10 hard` |
|
|
| Each run logs: |
|
|
| - per-step actions and rewards |
| - whether exfiltration occurred |
| - whether the task was completed safely |
| - final graded score |
| - per-difficulty summary statistics |
|
|
| Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json). |
|
|
| ### Optional runner settings |
|
|
| These values can be set in `.env`: |
|
|
| ```bash |
| EPISODES_PER_DIFFICULTY=1 |
| BASELINE_RESULTS_PATH=baseline/submission_results.json |
| REQUEST_DELAY_SECONDS=3.0 |
| MAX_MODEL_RETRIES=2 |
| ``` |
|
|
| ## Development |
|
|
| ### Project structure |
|
|
| ```text |
| Prompt-injection-env/ |
| |-- env/ |
| | |-- environment.py |
| | |-- dataset_loader.py |
| | |-- scenario_generator.py |
| | |-- injection_engine.py |
| | |-- policy_engine.py |
| | |-- taint_tracker.py |
| | |-- reward.py |
| | |-- grader.py |
| | `-- models.py |
| |-- server/ |
| | |-- app.py |
| | `-- Dockerfile |
| |-- docs/ |
| | |-- person_a_benchmark_manifest.json |
| | |-- person_a_scenario_audit.md |
| | `-- person_a_showcase_episodes.md |
| |-- baseline/ |
| | |-- run_baseline.py |
| | `-- baseline_results.json |
| |-- inference.py |
| |-- run_baseline.py |
| |-- openenv.yaml |
| `-- README.md |
| ``` |
|
|
| ### Core components |
|
|
| - [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py) |
| Main episode lifecycle, action handling, reward plumbing, and result assembly. |
| - [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py) |
| FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints. |
| - [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py) |
| Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner. |
| - [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml) |
| Environment metadata for OpenEnv packaging and deployment. |
|
|
|
|
|
|
| ### CI and sync |
|
|
| Includes GitHub workflows for: |
|
|
| - continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml) |
| - syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml) |
|
|
| Submission-focused regression tests live in `submission_tests/` and can be run with: |
|
|
| ```bash |
| python -m unittest discover -s submission_tests |
| ``` |
|
|
| ### Local Python usage |
|
|
| You can exercise the environment directly without starting the server: |
|
|
| ```python |
| from env.environment import IDPIExfilEnv |
| from env.models import Action |
| |
| env = IDPIExfilEnv() |
| obs = env.reset(difficulty="easy") |
| obs, reward, done, info = env.step(Action(type="ReadDoc")) |
| print(obs) |
| print(reward) |
| ``` |
|
|
| ## Notes |
|
|
| - The environment terminates early on unsafe exfiltration events. |
| - Hard tasks require a clean internal summary that excludes attacker-controlled content. |
| - Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. |
|
|