Spaces:
Sleeping
Ghostexec: A tiny chief-of-staff simulator for agents
Demo (<2 min): Ghostexec on YouTube
The messy day, in one paragraph
Picture a morning where everything arrives at once: a board member’s email, a double-booked calendar, a message from home about dinner, a report due at noon, and a teammate who is already annoyed. You cannot “solve” that with a single summary. You sequence decisions—who gets a reply, what gets rescheduled, what gets delegated—and each move changes how stressed you are, how people feel, and whether real work moves forward.
Ghostexec is a small, trainable environment that captures that feeling. It is built on OpenEnv so agents, researchers, and judges can talk to it through a standard HTTP and WebSocket API, run it on a public Hugging Face Space, and plug it into a real RL or preference-optimization loop.
What Ghostexec actually is
Ghostexec is an OpenEnv-compatible “AI chief of staff” simulator:
- Inbox, calendar, contacts, tasks, and stakeholder moods live in JSON scenarios under
scenarios/— the story is data, not hardcoded prose in Python. - Each step, the policy sees a plain-text briefing (the same kind of wall-of-text a human assistant might scan), not a raw dump of the whole world object.
- The agent returns one structured action per step — for example
reply_email,reschedule_meeting,complete_task, ordelegate_task— with fields validated against a schema. - Invalid actions do not crash the server. They return a controlled signal so learning (or evaluation) can continue.
do_nothingis penalised so “freeze and hope it goes away” is not a free winning strategy when fires are burning.
Under the hood, the simulation advances time, moods, and conflicts, and optional drift events in the JSON can reshuffle the situation mid-episode so the agent is tested on adaptation, not memorizing the first screen.
Why OpenEnv?
OpenEnv gives us a shared contract: reset, step, schema, health, WebSocket sessions, and tooling to validate and ship environments. Our manifest is in openenv.yaml (environment name ghostexec, three tasks with graders, FastAPI app entrypoint). That keeps the submission inspectable and reproducible — judges can open the Space, read the repo, and run tests locally with the same entrypoints we use in Docker.
Rewards: teach the model, certify the run
We use two layers of feedback, on purpose:
- Dense step reward (in
server/reward.py) blends conflict, relationship, and task progress with fixed weights 0.35 / 0.35 / 0.30, plus bounded shaping and explicit handling of invalid or idle steps. That signal is what you want when training with modern RL or GRPO-style methods. - Trajectory graders (in
graders.py, wired inopenenv.yaml) produce bounded scores for three tasks — easy, medium, and hard scenarios — so hackathon certification stays in a well-defined range.
Together: the model can learn from rich per-step feedback, while organizers can score full trajectories against clear tasks.
Try it in sixty seconds
Short demo video: https://youtu.be/g4IFZMEzfO8
Live Space (public): https://huggingface.co/spaces/modelbuilderhq/ghostexec
- Open
/docson the Space for the interactive API, or/webfor the OpenEnv playground. - Full README (formatted, with tables and deep links):
https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md
Local quick start (from repo root):
uv sync
uv run server --port 8000
Then use GhostexecEnv from client.py (WebSocket session) for multi-step episodes, or raw HTTP if you only need a smoke test. The README’s “Quick start” section has a copy-paste Python snippet.
Source: mirror on GitHub as you prefer; the canonical hackathon artifact is the Space + repo layout described in the README.
If you only have two minutes on video
Published walkthrough: youtu.be/g4IFZMEzfO8
A tight arc that works for non-technical viewers:
- Show the briefing — scroll through the same text the model sees. Say: “This is not a quiz; it’s a shift at work.”
- One good action — e.g. a thoughtful
reply_emailwith a realemail_idfrom the text; show reward or mood metadata if the UI exposes it. - One bad action — wrong id or
do_nothingwhile something urgent waits; show that the world does not crash, but the score hurts. - One sentence on training — “We can optimize policies against this API with GRPO / TRL-style loops; graders score whole episodes for the hackathon.”
End on: Ghostexec is the busy day, compressed — so models can practice being calm, fast, and fair before anyone trusts them near a real calendar.
Theme fit (hackathon)
Ghostexec aligns naturally with personalized, high-stakes tasks: executive triage, delegation, and tradeoffs between people, calendar, and deadlines. Diverse scenarios/*.json and optional curriculum / perturbation hooks (see README) make it easy to stress-test policies without rewriting core engine code.
Where to read more
| Resource | Link |
|---|---|
| Demo video (<2 min) | YouTube |
| Full project README (judging sections, layout, commands) | README on the Hub |
| Innovation-only deep dive | environment-innovation/README.md in the repo |
| OpenEnv upstream | github.com/meta-pytorch/OpenEnv |
Closing
Ghostexec is not trying to replace human assistants. It is trying to give models and researchers a credible, stressful, and kind miniature office: text that reads like work, actions that look like tools, and scores that admit tradeoffs. If that sounds useful, spin up the Space, break something on purpose, and watch the environment keep running — that resilience is part of the point.
— Ghostexec / OpenEnv submission