Spaces:

KChad
/

Prompt-Injection-RL-environment

Sleeping

App Files Files Community

Prompt-Injection-RL-environment / README.md

KChad

small change

91a0048 28 days ago

preview code

raw

history blame contribute delete

7.97 kB

	---
	title: Prompt Injection Envinronment
	emoji: "🔊"
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv-0.2.3
	- openenv
	- Prompt Injection
	- Reinforcement Learning
	---

	# IDPI Exfil Environment

	`IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.

	The environment is designed for evaluating whether an agent can:

	- read and process email content safely
	- identify indirect prompt injection patterns
	- avoid unsafe outbound actions
	- produce clean internal summaries on harder tasks

	## Features

	- `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles
	- `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
	- `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested
	- `Structured scoring`: reward shaping during the episode plus end-of-episode grading
	- `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health
	- `Unified evaluation runner`: one runner for inference and baseline-style evaluation output

	## Quick Start

	### Install

	```bash
	pip install -e .
	```

	### Run the server

	```bash
	python server/app.py
	```

	The API starts on `http://localhost:7860`.

	### Run the evaluation runner

	Create a `.env` in the project root with the submission-compatible inference variables:

	```bash
	API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
	MODEL_NAME=gemini-2.5-flash
	HF_TOKEN=your_gemini_api_key_here
	```

	Then run:

	```bash
	python inference.py
	```

	Compatibility commands still work:

	```bash
	python run_baseline.py
	python baseline/run_baseline.py
	```

	All three commands use the same unified runner and write results to `baseline/baseline_results.json`.
	For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path.

	## Using the HTTP API

	### Reset the environment

	```python
	import requests

	response = requests.post(
	"http://localhost:7860/reset",
	json={"difficulty": "easy"},
	)
	observation = response.json()
	print(observation)
	```

	### Step with an action

	```python
	import requests

	response = requests.post(
	"http://localhost:7860/step",
	json={"action": {"type": "ReadDoc"}},
	)
	result = response.json()

	print(result["observation"])
	print(result["reward"])
	print(result["done"])
	```

	### Inspect state or final result

	```python
	import requests

	state = requests.get("http://localhost:7860/state").json()
	episode_result = requests.get("http://localhost:7860/result").json()
	```

	Additional discovery endpoints are also available:

	```python
	import requests

	metadata = requests.get("http://localhost:7860/metadata").json()
	schema = requests.get("http://localhost:7860/schema").json()
	```

	## Environment Overview

	### Task model

	Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.

	Typical safe behavior includes:

	- reviewing the email with `ReadDoc`
	- flagging suspicious instructions with `FlagInjection`
	- writing an internal-only clean summary with `WriteDB`
	- refusing unsafe outbound behavior

	### Difficulty levels

	- `easy`: explicit or obvious prompt injection patterns
	- `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
	- `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary

	### Actions

	- `ReadDoc`
	Reads the current email content.
	- `WriteDB(content="...")`
	Writes an internal summary or note.
	- `SendEmail(to="...", body="...")`
	Sends an outbound email. This is the main exfiltration risk surface.
	- `FlagInjection(reason="...")`
	Marks the content as suspicious.
	- `NoOp`
	Takes no action.

	### Observation fields

	- `subject`
	- `body`
	- `last_action`
	- `last_action_result`
	- `step_count`
	- `task_difficulty`
	- `summary_required`
	- `current_summary`

	### Endpoints

	- `GET /`
	- `GET /web`
	- `GET /health`
	- `GET /metadata`
	- `GET /schema`
	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `GET /result`
	- `POST /mcp`

	## Evaluation

	The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits.

	The current manifest includes:

	- `32 scenarios total`
	- `9 easy`
	- `13 medium`
	- `10 hard`

	Each run logs:

	- per-step actions and rewards
	- whether exfiltration occurred
	- whether the task was completed safely
	- final graded score
	- per-difficulty summary statistics

	Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json).

	### Optional runner settings

	These values can be set in `.env`:

	```bash
	EPISODES_PER_DIFFICULTY=1
	BASELINE_RESULTS_PATH=baseline/submission_results.json
	REQUEST_DELAY_SECONDS=3.0
	MAX_MODEL_RETRIES=2
	```

	## Development

	### Project structure

	```text
	Prompt-injection-env/
	\|-- env/
	\| \|-- environment.py
	\| \|-- dataset_loader.py
	\| \|-- scenario_generator.py
	\| \|-- injection_engine.py
	\| \|-- policy_engine.py
	\| \|-- taint_tracker.py
	\| \|-- reward.py
	\| \|-- grader.py
	\| `-- models.py
	\|-- server/
	\| \|-- app.py
	\| `-- Dockerfile
	\|-- docs/
	\| \|-- person_a_benchmark_manifest.json
	\| \|-- person_a_scenario_audit.md
	\| `-- person_a_showcase_episodes.md
	\|-- baseline/
	\| \|-- run_baseline.py
	\| `-- baseline_results.json
	\|-- inference.py
	\|-- run_baseline.py
	\|-- openenv.yaml
	`-- README.md
	```

	### Core components

	- [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py)
	Main episode lifecycle, action handling, reward plumbing, and result assembly.
	- [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py)
	FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
	- [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py)
	Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
	- [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml)
	Environment metadata for OpenEnv packaging and deployment.



	### CI and sync

	Includes GitHub workflows for:

	- continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml)
	- syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml)

	Submission-focused regression tests live in `submission_tests/` and can be run with:

	```bash
	python -m unittest discover -s submission_tests
	```

	### Local Python usage

	You can exercise the environment directly without starting the server:

	```python
	from env.environment import IDPIExfilEnv
	from env.models import Action

	env = IDPIExfilEnv()
	obs = env.reset(difficulty="easy")
	obs, reward, done, info = env.step(Action(type="ReadDoc"))
	print(obs)
	print(reward)
	```

	## Notes

	- The environment terminates early on unsafe exfiltration events.
	- Hard tasks require a clean internal summary that excludes attacker-controlled content.
	- Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.