cloud_queue_env / IMPLEMENTATION_ROADMAP.md
Mrkumar007's picture
Upload folder using huggingface_hub
16bd852 verified

QueueOps OpenEnv Implementation Roadmap

This roadmap is the execution reference for building the real-world queueing environment in this repository.

Constraints locked in:

  • Keep existing directory structure unchanged.
  • Treat cloud_queue_env/ as the project root.
  • Use HF token provider flow in inference.py.
  • Follow OpenEnv compliance strictly: typed models, step()/reset()/state(), valid openenv.yaml.
  • Provide deterministic graders with partial scoring in [0, 1].
  • Deliver at least 3 tasks (more optional).

V1 - Hackathon-Ready Submission

Goal: submit a valid, real-world OpenEnv benchmark with 3 deterministic graded tasks and reproducible inference outputs.

Phase 1 - Core Simulator Foundation

Sub-goals:

  1. Replace echo logic with queue-operations simulation core.
  2. Add deterministic RNG with explicit seed handling.
  3. Implement proper episode boundaries (horizon, terminal conditions).
  4. Keep strict OpenEnv contract for reset(), step(), and state.

Definition of done:

  • Environment no longer behaves as dummy echo.
  • Same seed + same action trace => identical trajectory.
  • Episode always terminates predictably.

Phase 2 - Task System (Easy/Medium/Hard)

Sub-goals:

  1. Add task selection (task_id) and per-task config.
  2. Implement Task A (single queue, admission control).
  3. Implement Task B (multi-server, priority routing).
  4. Implement Task C (two-stage queue network, dynamic scaling/cost).

Definition of done:

  • All 3 tasks run end-to-end from reset() to terminal state.
  • Difficulty progression is visible from A -> B -> C.

Phase 3 - Deterministic Graders + Partial Scoring

Sub-goals:

  1. Implement per-task grader formulas from master spec.
  2. Keep each grader output bounded in [0, 1].
  3. Handle invalid/NaN/infinite values safely and deterministically.
  4. Aggregate final benchmark score as mean of task scores.

Definition of done:

  • Repeated runs on same seeds produce same grader outputs.
  • Partial scoring is meaningful (not binary pass/fail only).

Phase 4 - Reward Shaping and Safety Penalties

Sub-goals:

  1. Add dense reward components: wait, throughput, SLA, cost, fairness, safety.
  2. Add penalties for invalid actions and exploit patterns.
  3. Bound reward scale across tasks.
  4. Expose reward components in info for debugging.

Definition of done:

  • Reward moves through trajectory, not only at the end.
  • Unsafe or degenerate behavior is penalized.

Phase 5 - Inference Protocol Compliance

Sub-goals:

  1. Update inference.py to run all required tasks with fixed seeds.
  2. Keep OpenAI client usage while authenticating with HF token flow.
  3. Emit strict [START], [STEP], [END] line format.
  4. Print per-task and final aggregate scores.

Definition of done:

  • Script executes benchmark sweep reproducibly.
  • Output format matches hackathon requirements.

Phase 6 - Packaging, Validation, Documentation

Sub-goals:

  1. Validate openenv.yaml metadata and app wiring.
  2. Confirm Docker build/run success.
  3. Update README with task definitions, action/observation spaces, reward/grader equations, baseline results.
  4. Verify deployment readiness for HF Space.

Definition of done:

  • OpenEnv validation passes.
  • Container starts and serves correctly.
  • README is submission-ready.

V1 Submission Gate

All must be true:

  1. 3 tasks implemented and deterministic.
  2. Graders return valid partial scores in [0, 1].
  3. Inference script reports reproducible benchmark outputs.
  4. OpenEnv spec compliance confirmed.
  5. Docker and README requirements satisfied.

V2 - Quality and Robustness Upgrade

Goal: improve benchmark reliability, score stability, and anti-exploit behavior after initial submission.

Phase 1 - Determinism Hardening

Sub-goals:

  1. Split RNG streams (arrivals/service/abandonment/shocks).
  2. Add trace replay support for debugging.
  3. Extend info with deterministic audit fields.

Phase 2 - Difficulty Calibration

Sub-goals:

  1. Tune parameters for cleaner A/B/C separation.
  2. Improve level interpolation behavior.
  3. Add stronger guards against reject-all or noop exploitation.

Phase 3 - Reporting and Confidence

Sub-goals:

  1. Add standardized per-seed report table.
  2. Add mean/std summaries over seed sets.
  3. Flag unstable metrics and grader edge cases.

V2 Exit Criteria

  1. Lower run-to-run variance on fixed seed sets.
  2. Clearer task difficulty progression.
  3. Better fairness and exploit resistance.

V3 - Extended Benchmark Pack (Optional)

Goal: increase novelty and long-term benchmark value with optional extra tasks.

Phase 1 - Task D (Non-stationary Load)

Sub-goals:

  1. Add shift-based and bursty arrivals.
  2. Grade robustness under changing demand.

Phase 2 - Task E (Partial Observability)

Sub-goals:

  1. Add delayed/noisy metrics.
  2. Grade safe decisions under uncertainty.

Phase 3 - Public Benchmark Packaging

Sub-goals:

  1. Publish official seed suites.
  2. Add benchmark profiles: quick / standard / full.
  3. Provide reference baseline outputs.

V3 Exit Criteria

  1. 4-5 total tasks available.
  2. Broader real-world coverage.
  3. Stronger benchmark differentiation.

Execution Order

Recommended order:

  1. Complete V1 fully and submit.
  2. Continue with V2 for quality hardening.
  3. Do V3 only if timeline allows.

Immediate next implementation step:

  • Start V1 Phase 1 (models + simulator core + deterministic state transitions).