ghostexec / BLOG.md
modelbuilderhq's picture
Upload folder using huggingface_hub
60fe7cd verified
# Ghostexec: A tiny chief-of-staff simulator for agents
**Demo (<2 min):** [Ghostexec on YouTube](https://youtu.be/g4IFZMEzfO8)
---
## The messy day, in one paragraph
Picture a morning where everything arrives at once: a board member’s email, a double-booked calendar, a message from home about dinner, a report due at noon, and a teammate who is already annoyed. You cannot “solve” that with a single summary. You **sequence** decisions—who gets a reply, what gets rescheduled, what gets delegated—and each move changes how stressed you are, how people feel, and whether real work moves forward.
**Ghostexec** is a small, trainable environment that captures that feeling. It is built on **[OpenEnv](https://github.com/meta-pytorch/OpenEnv)** so agents, researchers, and judges can talk to it through a **standard HTTP and WebSocket API**, run it on a **public Hugging Face Space**, and plug it into a **real RL or preference-optimization loop**.
---
## What Ghostexec actually is
Ghostexec is an **OpenEnv-compatible “AI chief of staff” simulator**:
- **Inbox, calendar, contacts, tasks**, and **stakeholder moods** live in **JSON scenarios** under `scenarios/` — the story is data, not hardcoded prose in Python.
- Each step, the policy sees a **plain-text briefing** (the same kind of wall-of-text a human assistant might scan), not a raw dump of the whole world object.
- The agent returns **one structured action per step** — for example `reply_email`, `reschedule_meeting`, `complete_task`, or `delegate_task` — with fields validated against a schema.
- **Invalid actions do not crash the server.** They return a controlled signal so learning (or evaluation) can continue.
- **`do_nothing` is penalised** so “freeze and hope it goes away” is not a free winning strategy when fires are burning.
Under the hood, the simulation advances **time**, **moods**, and **conflicts**, and optional **drift events** in the JSON can reshuffle the situation mid-episode so the agent is tested on **adaptation**, not memorizing the first screen.
---
## Why OpenEnv?
[OpenEnv](https://github.com/meta-pytorch/OpenEnv) gives us a **shared contract**: reset, step, schema, health, WebSocket sessions, and tooling to **validate** and **ship** environments. Our manifest is in **`openenv.yaml`** (environment name `ghostexec`, three tasks with graders, FastAPI app entrypoint). That keeps the submission **inspectable** and **reproducible** — judges can open the Space, read the repo, and run tests locally with the same entrypoints we use in Docker.
---
## Rewards: teach the model, certify the run
We use **two layers** of feedback, on purpose:
1. **Dense step reward** (in `server/reward.py`) blends **conflict**, **relationship**, and **task** progress with fixed weights **0.35 / 0.35 / 0.30**, plus bounded shaping and explicit handling of invalid or idle steps. That signal is what you want when **training** with modern RL or GRPO-style methods.
2. **Trajectory graders** (in `graders.py`, wired in `openenv.yaml`) produce **bounded** scores for three **tasks** — easy, medium, and hard scenarios — so hackathon **certification** stays in a well-defined range.
Together: the model can **learn** from rich per-step feedback, while organizers can **score** full trajectories against clear tasks.
---
## Try it in sixty seconds
**Short demo video:** [https://youtu.be/g4IFZMEzfO8](https://youtu.be/g4IFZMEzfO8)
**Live Space (public):** [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec)
- Open **`/docs`** on the Space for the interactive API, or **`/web`** for the OpenEnv playground.
- Full README (formatted, with tables and deep links):
[https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md](https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md)
**Local quick start** (from repo root):
```bash
uv sync
uv run server --port 8000
```
Then use **`GhostexecEnv`** from `client.py` (WebSocket session) for **multi-step episodes**, or raw HTTP if you only need a smoke test. The README’s “Quick start” section has a copy-paste Python snippet.
**Source:** mirror on GitHub as you prefer; the canonical hackathon artifact is the **Space + repo** layout described in the README.
---
## If you only have two minutes on video
Published walkthrough: [**youtu.be/g4IFZMEzfO8**](https://youtu.be/g4IFZMEzfO8)
A tight arc that works for non-technical viewers:
1. **Show the briefing** — scroll through the same text the model sees. Say: “This is not a quiz; it’s a shift at work.”
2. **One good action** — e.g. a thoughtful `reply_email` with a real `email_id` from the text; show reward or mood metadata if the UI exposes it.
3. **One bad action** — wrong id or `do_nothing` while something urgent waits; show that the world **does not crash**, but the score **hurts**.
4. **One sentence on training** — “We can optimize policies against this API with GRPO / TRL-style loops; graders score whole episodes for the hackathon.”
End on: **Ghostexec is the busy day, compressed — so models can practice being calm, fast, and fair before anyone trusts them near a real calendar.**
---
## Theme fit (hackathon)
Ghostexec aligns naturally with **personalized, high-stakes tasks**: executive triage, delegation, and tradeoffs between **people**, **calendar**, and **deadlines**. Diverse **`scenarios/*.json`** and optional curriculum / perturbation hooks (see README) make it easy to **stress-test** policies without rewriting core engine code.
---
## Where to read more
| Resource | Link |
|----------|------|
| Demo video (<2 min) | [YouTube](https://youtu.be/g4IFZMEzfO8) |
| Full project README (judging sections, layout, commands) | [README on the Hub](https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md) |
| Innovation-only deep dive | [`environment-innovation/README.md`](environment-innovation/README.md) in the repo |
| OpenEnv upstream | [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) |
---
## Closing
Ghostexec is not trying to replace human assistants. It is trying to give **models and researchers** a **credible, stressful, and kind** miniature office: text that reads like work, actions that look like tools, and scores that admit **tradeoffs**. If that sounds useful, spin up the Space, break something on purpose, and watch the environment **keep running** — that resilience is part of the point.
*— Ghostexec / OpenEnv submission*