Spaces:

modelbuilderhq
/

ghostexec

Sleeping

App Files Files Community

ghostexec / BLOG.md

modelbuilderhq

Upload folder using huggingface_hub

60fe7cd verified 9 days ago

preview code

raw

history blame contribute delete

6.59 kB

Ghostexec: A tiny chief-of-staff simulator for agents

Demo (<2 min): Ghostexec on YouTube

The messy day, in one paragraph

Picture a morning where everything arrives at once: a board member’s email, a double-booked calendar, a message from home about dinner, a report due at noon, and a teammate who is already annoyed. You cannot “solve” that with a single summary. You sequence decisions—who gets a reply, what gets rescheduled, what gets delegated—and each move changes how stressed you are, how people feel, and whether real work moves forward.

Ghostexec is a small, trainable environment that captures that feeling. It is built on OpenEnv so agents, researchers, and judges can talk to it through a standard HTTP and WebSocket API, run it on a public Hugging Face Space, and plug it into a real RL or preference-optimization loop.

What Ghostexec actually is

Ghostexec is an OpenEnv-compatible “AI chief of staff” simulator:

Inbox, calendar, contacts, tasks, and stakeholder moods live in JSON scenarios under scenarios/ — the story is data, not hardcoded prose in Python.
Each step, the policy sees a plain-text briefing (the same kind of wall-of-text a human assistant might scan), not a raw dump of the whole world object.
The agent returns one structured action per step — for example reply_email, reschedule_meeting, complete_task, or delegate_task — with fields validated against a schema.
Invalid actions do not crash the server. They return a controlled signal so learning (or evaluation) can continue.
do_nothing is penalised so “freeze and hope it goes away” is not a free winning strategy when fires are burning.

Under the hood, the simulation advances time, moods, and conflicts, and optional drift events in the JSON can reshuffle the situation mid-episode so the agent is tested on adaptation, not memorizing the first screen.

Why OpenEnv?

OpenEnv gives us a shared contract: reset, step, schema, health, WebSocket sessions, and tooling to validate and ship environments. Our manifest is in openenv.yaml (environment name ghostexec, three tasks with graders, FastAPI app entrypoint). That keeps the submission inspectable and reproducible — judges can open the Space, read the repo, and run tests locally with the same entrypoints we use in Docker.

Rewards: teach the model, certify the run

We use two layers of feedback, on purpose:

Dense step reward (in server/reward.py) blends conflict, relationship, and task progress with fixed weights 0.35 / 0.35 / 0.30, plus bounded shaping and explicit handling of invalid or idle steps. That signal is what you want when training with modern RL or GRPO-style methods.
Trajectory graders (in graders.py, wired in openenv.yaml) produce bounded scores for three tasks — easy, medium, and hard scenarios — so hackathon certification stays in a well-defined range.

Together: the model can learn from rich per-step feedback, while organizers can score full trajectories against clear tasks.

Try it in sixty seconds

Short demo video: https://youtu.be/g4IFZMEzfO8

Live Space (public): https://huggingface.co/spaces/modelbuilderhq/ghostexec

Open /docs on the Space for the interactive API, or /web for the OpenEnv playground.
Full README (formatted, with tables and deep links):
https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md

Local quick start (from repo root):

uv sync
uv run server --port 8000

Then use GhostexecEnv from client.py (WebSocket session) for multi-step episodes, or raw HTTP if you only need a smoke test. The README’s “Quick start” section has a copy-paste Python snippet.

Source: mirror on GitHub as you prefer; the canonical hackathon artifact is the Space + repo layout described in the README.

If you only have two minutes on video

Published walkthrough: youtu.be/g4IFZMEzfO8

A tight arc that works for non-technical viewers:

Show the briefing — scroll through the same text the model sees. Say: “This is not a quiz; it’s a shift at work.”
One good action — e.g. a thoughtful reply_email with a real email_id from the text; show reward or mood metadata if the UI exposes it.
One bad action — wrong id or do_nothing while something urgent waits; show that the world does not crash, but the score hurts.
One sentence on training — “We can optimize policies against this API with GRPO / TRL-style loops; graders score whole episodes for the hackathon.”

End on: Ghostexec is the busy day, compressed — so models can practice being calm, fast, and fair before anyone trusts them near a real calendar.

Theme fit (hackathon)

Ghostexec aligns naturally with personalized, high-stakes tasks: executive triage, delegation, and tradeoffs between people, calendar, and deadlines. Diverse scenarios/*.json and optional curriculum / perturbation hooks (see README) make it easy to stress-test policies without rewriting core engine code.

Where to read more

Resource	Link
Demo video (<2 min)	YouTube
Full project README (judging sections, layout, commands)	README on the Hub
Innovation-only deep dive	`environment-innovation/README.md` in the repo
OpenEnv upstream	github.com/meta-pytorch/OpenEnv

Closing

Ghostexec is not trying to replace human assistants. It is trying to give models and researchers a credible, stressful, and kind miniature office: text that reads like work, actions that look like tools, and scores that admit tradeoffs. If that sounds useful, spin up the Space, break something on purpose, and watch the environment keep running — that resilience is part of the point.

— Ghostexec / OpenEnv submission