Spaces:

anonymousDevil
/

cognitive-load-manager

Sleeping

App Files Files Community

AE-Shree commited on 26 days ago

Commit

f3f7834

1 Parent(s): dfa9f05

Select ous To ROund 2 !!

Browse files

Files changed (7) hide show

README.md +6 -12
grader/clm_graders.py +82 -63
inference.py +5 -2
models.py +231 -100
openenv.yaml +4 -4
server/app.py +13 -7
tests/test_clm.py +257 -0

README.md CHANGED Viewed

@@ -17,21 +17,21 @@ tags: [openenv, rl, scheduling, agent-eval, productivity]
 [![Python 3.11](https://img.shields.io/badge/Python-3.11-blue?style=for-the-badge&logo=python)](#)
 [![React Dashboard](https://img.shields.io/badge/React-Live_Dashboard-blue?style=for-the-badge&logo=react)](#)
-CLM is a **real-world productivity simulation** where an AI agent plays the role of a human knowledge worker's task scheduler. It must manage heterogeneous work items — emails, meetings, code reviews, reports, and calls — each with different cognitive demands, deadlines, priorities, and dependencies, while keeping the worker's energy and stress within safe bounds.
 *This is not a toy game.* CLM models how humans actually experience workload: stress accumulates when deadlines approach, fatigue reduces efficiency, context-switching has a cognitive cost, and deep focus yields better output at the expense of higher energy.
----
 ## 🎯 Why This Environment Matters
-Modern knowledge workers face **cognitive load management** as one of their most critical daily challenges — yet no RL environment has modelled this domain in a principled, agent-evaluatable way. CLM fills this gap:
 - **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems.
-- **Useful for evaluating LLM planning ability** — especially multi-step planning under resource constraints.
 - **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit.
----
 ## 🕹️ Actions
@@ -50,7 +50,7 @@ Action format:
 {"type": "break", "task_id": null}
 ```
----
 ## 👁️ Observation Space
@@ -85,7 +85,6 @@ Action format:
 - `upcoming_deadlines` — tasks with deadline within the next 5 steps
 - `focus_mode` — whether the agent is currently in deep-work state
----
 ## 📋 Tasks & Baseline Scores
@@ -99,7 +98,6 @@ Action format:
 Scores produced by heuristic agent (priority + deadline triage with focus mode).
 A strong LLM agent should achieve: easy >0.85, medium >0.55, hard >0.35, expert >0.25.
----
 ## 🏆 Scoring Formula
@@ -119,7 +117,6 @@ score = weighted_completion × 0.60
 Score is always in **(0.01, 0.99)** — never exactly 0 or 1.
----
 ## 🚀 Setup
@@ -149,7 +146,6 @@ cd frontend && npm install && npm run dev
 # Visit http://localhost:5173
 ```
----
 ## 🏛️ Architecture
@@ -177,7 +173,6 @@ graph TD
     API -->|OpenEnv spec| OE[openenv validate]
 ```
----
 ## 📊 Reward Shaping Details
@@ -197,7 +192,6 @@ Step rewards provide **dense signal** across the full trajectory:
 | Episode: all done (on time) | +1.0 |
 | Episode: all done (late) | +0.5 |
----
 ## ⚙️ Environment Variables

 [![Python 3.11](https://img.shields.io/badge/Python-3.11-blue?style=for-the-badge&logo=python)](#)
 [![React Dashboard](https://img.shields.io/badge/React-Live_Dashboard-blue?style=for-the-badge&logo=react)](#)
+CLM is a **real-world productivity simulation** where an AI agent plays the role of a human knowledge worker's task scheduler. It must manage heterogeneous work items like emails, meetings, code reviews, reports, and calls each with different cognitive demands, deadlines, priorities, and dependencies, while keeping the worker's energy and stress within safe bounds.
 *This is not a toy game.* CLM models how humans actually experience workload: stress accumulates when deadlines approach, fatigue reduces efficiency, context-switching has a cognitive cost, and deep focus yields better output at the expense of higher energy.
 ## 🎯 Why This Environment Matters
+Modern knowledge workers face **cognitive load management** as one of their most critical daily challenges, yet no RL environment has modelled this domain in a principled, agent-evaluatable way. CLM fills this gap:
 - **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems.
+- **Useful for evaluating LLM planning ability** especially multi-step planning under resource constraints.
 - **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit.
 ## 🕹️ Actions
 {"type": "break", "task_id": null}
 ```
 ## 👁️ Observation Space
 - `upcoming_deadlines` — tasks with deadline within the next 5 steps
 - `focus_mode` — whether the agent is currently in deep-work state
 ## 📋 Tasks & Baseline Scores
 Scores produced by heuristic agent (priority + deadline triage with focus mode).
 A strong LLM agent should achieve: easy >0.85, medium >0.55, hard >0.35, expert >0.25.
 ## 🏆 Scoring Formula
 Score is always in **(0.01, 0.99)** — never exactly 0 or 1.
 ## 🚀 Setup
 # Visit http://localhost:5173
 ```
 ## 🏛️ Architecture
     API -->|OpenEnv spec| OE[openenv validate]
 ```
 ## 📊 Reward Shaping Details
 | Episode: all done (on time) | +1.0 |
 | Episode: all done (late) | +0.5 |
 ## ⚙️ Environment Variables

grader/clm_graders.py CHANGED Viewed

@@ -1,10 +1,11 @@
 """
-Class-based graders for CLM tasks — matches auto-dev's BaseGrader interface.
-Graders run a heuristic agent to episode completion and score the FINAL state.
-Each difficulty produces DIFFERENT scores (easy ~0.75, medium ~0.45, hard ~0.20, expert ~0.08).
-Scores are ALWAYS strictly in (0.01, 0.99).
 """
 import sys, os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
@@ -22,18 +23,68 @@ def _safe(raw) -> float:
         return _MIN
 def _heuristic_action(env: CLMEnvironment) -> Action:
     """
     Competent heuristic agent:
-    - Enters focus mode on critical tasks with approaching deadlines
     - Takes breaks when fatigued or stressed
     - Prioritises: critical > high > normal > low, then earliest deadline
-    - Respects task dependencies (never works on a blocked task)
     """
-    state = env.state
     blocked = env._blocked_ids()
-    # Rest condition
     if state.energy < 0.30 or state.stress > 0.70:
         return Action(type="break", task_id=None)
@@ -41,80 +92,48 @@ def _heuristic_action(env: CLMEnvironment) -> Action:
     if not pending:
         return Action(type="delay", task_id=None)
-    # Sort by priority weight DESC then deadline ASC
     pending.sort(key=lambda t: (
         -PRIORITY_WEIGHT[t.priority],
         t.deadline if t.deadline is not None else 9999
     ))
     target = pending[0]
-    # Use focus mode for critical tasks with deadline in ≤10 steps
     use_focus = (
         target.priority == "critical"
         and target.deadline is not None
         and (target.deadline - state.time_step) <= 10
         and state.energy > 0.55
     )
-    if state.current_task_id == target.id:
-        return Action(type="focus" if use_focus else "work", task_id=target.id)
     return Action(type="focus" if use_focus else "work", task_id=target.id)
-def _run_episode(difficulty: str) -> tuple:
-    try:
-        tasks  = generate_tasks(difficulty)
-        max_s  = 60 if difficulty == "expert" else 50
-        env    = CLMEnvironment(tasks=tasks, max_steps=max_s)
-        env.reset()
-        done, step = False, 0
-        while not done and step < max_s:
-            action = _heuristic_action(env)
-            _, _, done, _ = env.step(action)
-            step += 1
-        raw   = deterministic_grader(env.state.tasks, env.state.time_step, env.state.energy)
-        score = _safe(raw)
-        comp  = sum(1 for t in env.state.tasks if t.progress >= 1.0)
-        msg   = (
-            f"CLM {difficulty} | score={score:.4f} | "
-            f"steps={step} energy={env.state.energy:.2f} "
-            f"completed={comp}/{len(env.state.tasks)}"
-        )
-        return score, score >= 0.5, msg
-    except Exception as e:
-        return _MIN, False, f"Grader error: {e}"
-def _from_trajectory(trajectory: dict, difficulty: str) -> tuple:
-    if trajectory and "tasks" in trajectory:
-        raw_tasks = trajectory.get("tasks", [])
-        ts  = trajectory.get("time_step", 50)
-        eng = trajectory.get("energy", 0.5)
-        task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
-        raw   = deterministic_grader(task_objs, ts, eng)
-        score = _safe(raw)
-        comp  = sum(1 for t in task_objs if t.progress >= 1.0)
-        msg   = f"CLM {difficulty} | score={score:.4f} | completed={comp}/{len(task_objs)}"
-        return score, score >= 0.5, msg
-    return _run_episode(difficulty)
 class EasyGrader:
-    """Easy: 2 tasks (email + report), no deadlines. Expected heuristic score: ~0.72–0.82."""
-    def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "easy")
-    def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "easy")[0]
 class MediumGrader:
-    """Medium: 5 tasks with mixed priorities and deadlines. Expected: ~0.38–0.52."""
-    def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "medium")
-    def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "medium")[0]
 class HardGrader:
-    """Hard: 8 tasks with dependencies and tight deadlines + interruptions. Expected: ~0.15–0.28."""
-    def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "hard")
-    def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "hard")[0]
 class ExpertGrader:
-    """Expert: 10 tasks, deep dependencies, 3 mid-episode interruptions. Expected: ~0.05–0.15."""
-    def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "expert")
-    def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "expert")[0]

 """
+Class-based graders for CLM tasks.
+FIX 1: _from_trajectory no longer falls back to running a heuristic episode
+when the trajectory is empty or missing. It returns 0.01 immediately.
+The grader MUST score the actual agent, not a proxy.
+Graders produce scores strictly in (0.01, 0.99).
 """
 import sys, os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
         return _MIN
+def _from_trajectory(trajectory: dict, difficulty: str) -> tuple:
+    """
+    Score a completed agent trajectory.
+    FIX 1: If trajectory is empty or has no tasks, return 0.01 immediately.
+    We must never rerun a heuristic episode here — that would score the
+    heuristic agent, not the LLM agent under evaluation.
+    """
+    if not trajectory or not trajectory.get("tasks"):
+        return _MIN, False, f"CLM {difficulty} | score=0.0100 | empty trajectory"
+    raw_tasks = trajectory["tasks"]
+    ts  = trajectory.get("time_step", 50)
+    eng = trajectory.get("energy", 0.5)
+    task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
+    raw   = deterministic_grader(task_objs, ts, eng)
+    score = _safe(raw)
+    comp  = sum(1 for t in task_objs if t.progress >= 1.0)
+    msg   = f"CLM {difficulty} | score={score:.4f} | completed={comp}/{len(task_objs)}"
+    return score, score >= 0.5, msg
+def _run_heuristic_baseline(difficulty: str) -> tuple:
+    """
+    Run a heuristic agent to produce a BASELINE reference score only.
+    This is used for reporting / README baseline numbers — NEVER for
+    grading an LLM agent's actual trajectory.
+    """
+    try:
+        tasks  = generate_tasks(difficulty, seed=42)   # fixed seed for reproducibility
+        max_s  = 60 if difficulty == "expert" else 50
+        env    = CLMEnvironment(tasks=tasks, max_steps=max_s, seed=42)
+        env.reset()
+        done, step = False, 0
+        while not done and step < max_s:
+            action = _heuristic_action(env)
+            _, _, done, _ = env.step(action)
+            step += 1
+        raw   = deterministic_grader(env.state.tasks, env.state.time_step, env.state.energy)
+        score = _safe(raw)
+        comp  = sum(1 for t in env.state.tasks if t.progress >= 1.0)
+        msg   = (
+            f"CLM {difficulty} baseline | score={score:.4f} | "
+            f"steps={step} energy={env.state.energy:.2f} "
+            f"completed={comp}/{len(env.state.tasks)}"
+        )
+        return score, score >= 0.5, msg
+    except Exception as e:
+        return _MIN, False, f"Baseline error: {e}"
 def _heuristic_action(env: CLMEnvironment) -> Action:
     """
     Competent heuristic agent:
     - Takes breaks when fatigued or stressed
     - Prioritises: critical > high > normal > low, then earliest deadline
+    - Respects task dependencies
+    - Uses focus mode on critical tasks near their deadline
     """
+    state   = env.state
     blocked = env._blocked_ids()
     if state.energy < 0.30 or state.stress > 0.70:
         return Action(type="break", task_id=None)
     if not pending:
         return Action(type="delay", task_id=None)
     pending.sort(key=lambda t: (
         -PRIORITY_WEIGHT[t.priority],
         t.deadline if t.deadline is not None else 9999
     ))
     target = pending[0]
     use_focus = (
         target.priority == "critical"
         and target.deadline is not None
         and (target.deadline - state.time_step) <= 10
         and state.energy > 0.55
     )
     return Action(type="focus" if use_focus else "work", task_id=target.id)
+# ==========================================
+# PUBLIC GRADER CLASSES
+# ==========================================
 class EasyGrader:
+    """Easy: 2 tasks (email + report), no deadlines. Expected score: ~0.72–0.82."""
+    def grade(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "easy")
+    def __call__(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "easy")[0]
 class MediumGrader:
+    """Medium: 5 tasks, mixed priorities and deadlines. Expected: ~0.38–0.52."""
+    def grade(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "medium")
+    def __call__(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "medium")[0]
 class HardGrader:
+    """Hard: 8 tasks, dependencies, tight deadlines, stochastic interruptions. Expected: ~0.15–0.28."""
+    def grade(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "hard")
+    def __call__(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "hard")[0]
 class ExpertGrader:
+    """Expert: 10 tasks, deep dependencies, 3 stochastic interruptions. Expected: ~0.05–0.15."""
+    def grade(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "expert")
+    def __call__(self, trajectory=None, *a, **kw):
+        return _from_trajectory(trajectory or {}, "expert")[0]

inference.py CHANGED Viewed

@@ -100,7 +100,9 @@ def heuristic_fallback(obs: dict) -> Dict:
     blocked = set(vs.get("blocked_tasks", []))
     tasks   = [t for t in obs.get("tasks", [])
                if t.get("progress", 0.0) < 1.0 and t["id"] not in blocked]
-    if vs.get("energy_level", 1.0) < 0.35 or vs.get("stress_warning", False):
         return {"type": "break", "task_id": None}
     if tasks:
         # Sort: critical > high > normal > low, then nearest deadline
@@ -108,7 +110,8 @@ def heuristic_fallback(obs: dict) -> Dict:
         tasks.sort(key=lambda t: (pmap.get(t.get("priority", "normal"), 2),
                                   t.get("deadline") or 9999))
         t = tasks[0]
-        atype = "focus" if t.get("priority") == "critical" and vs.get("energy_level", 1.0) > 0.55 else "work"
         return {"type": atype, "task_id": t["id"]}
     return {"type": "delay", "task_id": None}

     blocked = set(vs.get("blocked_tasks", []))
     tasks   = [t for t in obs.get("tasks", [])
                if t.get("progress", 0.0) < 1.0 and t["id"] not in blocked]
+    # FIX 6: observation is now partially observable — use categorical labels
+    fatigue = vs.get("fatigue_level", "low")
+    if fatigue == "high" or vs.get("stress_warning", False):
         return {"type": "break", "task_id": None}
     if tasks:
         # Sort: critical > high > normal > low, then nearest deadline
         tasks.sort(key=lambda t: (pmap.get(t.get("priority", "normal"), 2),
                                   t.get("deadline") or 9999))
         t = tasks[0]
+        fatigue_ok = vs.get("fatigue_level", "low") != "high"
+        atype = "focus" if t.get("priority") == "critical" and fatigue_ok else "work"
         return {"type": atype, "task_id": t["id"]}
     return {"type": "delay", "task_id": None}

models.py CHANGED Viewed

@@ -1,8 +1,9 @@
 from pydantic import BaseModel, Field
 from typing import List, Optional, Literal, Tuple, Dict, Any
 # ==========================================
-# TASK TYPES — makes this clearly real-world
 # ==========================================
 TaskType = Literal["email", "meeting", "code_review", "report", "call"]
 Priority  = Literal["critical", "high", "normal", "low"]
@@ -11,6 +12,9 @@ PRIORITY_WEIGHT    = {"critical": 1.5, "high": 1.2, "normal": 1.0, "low": 0.7}
 TASK_ENERGY_COST   = {"email": 0.08, "meeting": 0.18, "code_review": 0.20, "report": 0.14, "call": 0.11}
 TASK_PROGRESS_RATE = {"email": 0.35, "meeting": 0.30, "code_review": 0.20, "report": 0.22, "call": 0.28}
 # ==========================================
 # OPENENV SCHEMAS
 # ==========================================
@@ -21,17 +25,22 @@ class Task(BaseModel):
     priority:  Priority  = "normal"
     progress:  float     = 0.0
     deadline:  Optional[int] = None
-    depends_on: Optional[str] = None    # must complete parent task first
-    is_interrupted: bool = False         # injected mid-episode
 class VisibleState(BaseModel):
     fatigue_level:      str        # "low" | "medium" | "high"
     stress_warning:     bool
-    energy_level:       float = 1.0
-    stress_level:       float = 0.0
     focus_mode:         bool  = False
-    upcoming_deadlines: List[str] = []  # task ids with deadline ≤ 5 steps away
-    blocked_tasks:      List[str] = []  # task ids blocked by unfinished dependencies
 class Observation(BaseModel):
     tasks:        List[Task]
@@ -40,72 +49,165 @@ class Observation(BaseModel):
 class Action(BaseModel):
     type: Literal["work", "break", "switch", "delay", "focus"]
-    # work   — normal work on task_id
-    # break  — rest; recover energy + reduce stress
-    # switch — change active task (small context-switch cost)
-    # delay  — do nothing; slight stress relief
-    # focus  — deep-work mode: 2× progress, 2× energy cost
     task_id: Optional[str] = None
 class EnvState(BaseModel):
-    energy:             float = 1.0
-    stress:             float = 0.0
-    fatigue:            float = 0.0
-    time_step:          int   = 0
-    current_task_id:    Optional[str] = None
-    tasks:              List[Task] = []
-    focus_mode:         bool  = False
-    interruption_count: int   = 0
-    milestone_rewards:  Dict[str, float] = {}
 # ==========================================
-# TASK GENERATION
 # ==========================================
-def generate_tasks(level: str) -> list[Task]:
     if level == "easy":
-        # 2 simple tasks, no deadlines — learn basics
         return [
-            Task(id="e1", difficulty="easy", task_type="email",  priority="normal", deadline=None),
-            Task(id="e2", difficulty="easy", task_type="report", priority="normal", deadline=None),
         ]
     elif level == "medium":
-        # 5 mixed tasks with deadlines and priorities
         return [
-            Task(id="m1", difficulty="medium", task_type="email",      priority="critical", deadline=14),
-            Task(id="m2", difficulty="medium", task_type="meeting",     priority="high",     deadline=20),
-            Task(id="m3", difficulty="medium", task_type="code_review", priority="normal",   deadline=28),
-            Task(id="m4", difficulty="medium", task_type="report",      priority="high",     deadline=35),
-            Task(id="m5", difficulty="medium", task_type="call",        priority="low",      deadline=45),
         ]
     elif level == "hard":
-        # 8 tasks with task dependencies + 2 mid-episode interruptions
         return [
-            Task(id="h1", difficulty="hard", task_type="email",       priority="critical", deadline=12),
-            Task(id="h2", difficulty="hard", task_type="code_review",  priority="high",     deadline=16),
-            Task(id="h3", difficulty="hard", task_type="meeting",      priority="critical", deadline=20, depends_on="h1"),
-            Task(id="h4", difficulty="hard", task_type="report",       priority="high",     deadline=24),
-            Task(id="h5", difficulty="hard", task_type="call",         priority="normal",   deadline=28, depends_on="h2"),
-            Task(id="h6", difficulty="hard", task_type="email",        priority="high",     deadline=32),
-            Task(id="h7", difficulty="hard", task_type="code_review",  priority="critical", deadline=38, depends_on="h4"),
-            Task(id="h8", difficulty="hard", task_type="report",       priority="normal",   deadline=46),
         ]
     elif level == "expert":
-        # 10 tasks, deep dependencies, 3 mid-episode interruptions
         return [
-            Task(id="x1",  difficulty="expert", task_type="email",       priority="critical", deadline=8),
-            Task(id="x2",  difficulty="expert", task_type="code_review",  priority="high",     deadline=12),
-            Task(id="x3",  difficulty="expert", task_type="meeting",      priority="critical", deadline=14, depends_on="x1"),
-            Task(id="x4",  difficulty="expert", task_type="report",       priority="high",     deadline=18, depends_on="x2"),
-            Task(id="x5",  difficulty="expert", task_type="call",         priority="normal",   deadline=22, depends_on="x3"),
-            Task(id="x6",  difficulty="expert", task_type="code_review",  priority="critical", deadline=24),
-            Task(id="x7",  difficulty="expert", task_type="email",        priority="high",     deadline=28, depends_on="x4"),
-            Task(id="x8",  difficulty="expert", task_type="report",       priority="normal",   deadline=33, depends_on="x6"),
-            Task(id="x9",  difficulty="expert", task_type="meeting",      priority="critical", deadline=36, depends_on="x5"),
-            Task(id="x10", difficulty="expert", task_type="call",         priority="high",     deadline=44),
         ]
     return []
@@ -126,8 +228,19 @@ def _inject_interruption(state: EnvState, step: int) -> None:
 # GRADER
 # ==========================================
 def grader(trajectory: dict) -> float:
-    """OpenEnv single-argument grader."""
-    raw_tasks = trajectory.get("tasks", [])
     ts  = trajectory.get("time_step", 50)
     eng = trajectory.get("energy", 0.5)
     task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
@@ -136,41 +249,35 @@ def grader(trajectory: dict) -> float:
 def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float) -> float:
     """
-    Additive grader producing strictly different scores per difficulty:
-      easy   ≈ 0.70–0.80  (completes all tasks, no deadlines)
-      medium ≈ 0.38–0.55  (completes 2–3/5 with deadlines)
-      hard   ≈ 0.18–0.30  (completes 2–3/10 with dependencies)
-      expert ≈ 0.06–0.15  (completes 1–2/13 with interruptions)
-    Score formula (additive — no harsh subtractive penalties):
-      weighted_completion  × 0.60   (primary driver)
-    + deadline_adherence   × 0.22   (fraction of tasks meeting deadline)
-    + energy_efficiency    × 0.10   (reward for not burning out)
-    + dependency_bonus     × 0.05   (rewarded correct sequencing)
-    + interruption_bonus   × 0.03   (handled urgent tasks)
-    Always returns value in (0.01, 0.99).
     """
     if not tasks:
         return 0.01
     total_weight = sum(PRIORITY_WEIGHT[t.priority] for t in tasks)
-    # ── Weighted completion (partial progress counts) ─────────────────���────────
     wc = sum(t.progress * PRIORITY_WEIGHT[t.priority] for t in tasks) / max(total_weight, 0.01)
-    # ── Deadline adherence (fraction of COMPLETABLE tasks that met deadline) ───
-    completable   = [t for t in tasks if t.deadline is not None]
-    met_deadline  = sum(
         1 for t in completable
         if t.progress >= 1.0 and time_step <= t.deadline
     )
     da = (met_deadline / len(completable)) if completable else 1.0
-    # ── Energy efficiency ─────────────────────────────────────────────────────
     ee = max(0.0, (final_energy - 0.10) * 0.13)
-    # ── Dependency ordering bonus ──────────────────────────────────────────────
     dep_bonus = 0.0
     for t in tasks:
         if t.depends_on and t.progress >= 1.0:
@@ -179,11 +286,11 @@ def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float)
                 dep_bonus += 0.015
     dep_bonus = min(0.05, dep_bonus)
-    # ── Interruption handling bonus ────────────────────────────────────────────
     interrupted = [t for t in tasks if t.is_interrupted]
     int_bonus = 0.0
     if interrupted:
-        handled  = sum(1 for t in interrupted if t.progress >= 1.0)
         int_bonus = min(0.03, (handled / len(interrupted)) * 0.03)
     raw = wc * 0.60 + da * 0.22 + ee + dep_bonus + int_bonus
@@ -191,22 +298,41 @@ def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float)
 # ==========================================
-# OPENENV ENVIRONMENT
 # ==========================================
-class CLMEnvironment:
-    _INTERRUPT_STEPS = {
-        "hard":   [15, 32],
-        "expert": [7, 18, 32],
-    }
-    def __init__(self, tasks: list[Task], max_steps: int = 50):
         self.max_steps     = max_steps
         self.initial_tasks = tasks
         self.difficulty    = tasks[0].difficulty if tasks else "easy"
-        self.state         = EnvState(tasks=[t.model_copy() for t in tasks])
     def reset(self) -> Observation:
-        self.state = EnvState(tasks=[t.model_copy() for t in self.initial_tasks])
         return self._get_observation()
     def _blocked_ids(self) -> set[str]:
@@ -221,12 +347,16 @@ class CLMEnvironment:
     def _get_observation(self) -> Observation:
         e = self.state.energy
-        fl = "high" if e < 0.30 else ("medium" if e < 0.60 else "low")
         vs = VisibleState(
-            fatigue_level=fl,
-            stress_warning=self.state.stress > 0.65,
-            energy_level=round(e, 3),
-            stress_level=round(self.state.stress, 3),
             focus_mode=self.state.focus_mode,
             upcoming_deadlines=self._upcoming_ids(),
             blocked_tasks=list(self._blocked_ids()),
@@ -237,23 +367,25 @@ class CLMEnvironment:
         reward  = 0.0
         blocked = self._blocked_ids()
-        # ── Inject interruptions ───────────────────────────────────────────────
-        int_steps = self._INTERRUPT_STEPS.get(self.difficulty, [])
-        if (self.state.time_step in int_steps
-                and self.state.interruption_count < len(int_steps)):
             _inject_interruption(self.state, self.state.time_step)
             reward -= 0.05
-        # ── Action processing ──────────────────────────────────────────────────
         if action.type in ("work", "focus"):
             is_focus = (action.type == "focus")
             if action.task_id:
                 if action.task_id in blocked:
-                    reward -= 0.15    # tried to work on blocked task
                 else:
                     if self.state.current_task_id and self.state.current_task_id != action.task_id:
-                        reward -= 0.07  # context-switch cost
                     self.state.current_task_id = action.task_id
                     self.state.focus_mode      = is_focus
@@ -272,7 +404,6 @@ class CLMEnvironment:
                 reward += 0.10 * (task.progress - old_p) * pw
-                # Milestone rewards
                 for ms, bonus in [(0.25, 0.04), (0.50, 0.07), (0.75, 0.09), (1.00, 0.18)]:
                     key = f"{task.id}@{ms}"
                     if task.progress >= ms and key not in self.state.milestone_rewards:
@@ -298,7 +429,7 @@ class CLMEnvironment:
         self.state.time_step += 1
-        # ── Stress dynamics ────────────────────────────────────────────────────
         for t in (tt for tt in self.state.tasks if tt.progress < 1.0):
             if t.deadline:
                 ttd = t.deadline - self.state.time_step
@@ -308,7 +439,7 @@ class CLMEnvironment:
                 elif ttd < 0:
                     self.state.stress = min(1.0, self.state.stress + 0.12 * pw)
-        # ── Episode termination ────────────────────────────────────────────────
         all_done = all(t.progress >= 1.0 for t in self.state.tasks)
         burnout  = self.state.energy < 0.07
         timeout  = self.state.time_step >= self.max_steps

 from pydantic import BaseModel, Field
 from typing import List, Optional, Literal, Tuple, Dict, Any
+import random
 # ==========================================
+# TASK TYPES
 # ==========================================
 TaskType = Literal["email", "meeting", "code_review", "report", "call"]
 Priority  = Literal["critical", "high", "normal", "low"]
 TASK_ENERGY_COST   = {"email": 0.08, "meeting": 0.18, "code_review": 0.20, "report": 0.14, "call": 0.11}
 TASK_PROGRESS_RATE = {"email": 0.35, "meeting": 0.30, "code_review": 0.20, "report": 0.22, "call": 0.28}
+ALL_TASK_TYPES: list[TaskType] = ["email", "meeting", "code_review", "report", "call"]
+ALL_PRIORITIES: list[Priority] = ["critical", "high", "normal", "low"]
 # ==========================================
 # OPENENV SCHEMAS
 # ==========================================
     priority:  Priority  = "normal"
     progress:  float     = 0.0
     deadline:  Optional[int] = None
+    depends_on: Optional[str] = None
+    is_interrupted: bool = False
 class VisibleState(BaseModel):
+    """
+    FIX 6 — Partial observability: agent sees only categorical labels,
+    not raw float values for energy/stress. This rewards agents that
+    reason from context rather than reading exact numbers.
+    """
     fatigue_level:      str        # "low" | "medium" | "high"
+    stress_level:       str        # "calm" | "elevated" | "critical"
     stress_warning:     bool
     focus_mode:         bool  = False
+    upcoming_deadlines: List[str] = []
+    blocked_tasks:      List[str] = []
+    # energy_level and stress float removed — use fatigue_level / stress_level instead
 class Observation(BaseModel):
     tasks:        List[Task]
 class Action(BaseModel):
     type: Literal["work", "break", "switch", "delay", "focus"]
     task_id: Optional[str] = None
 class EnvState(BaseModel):
+    energy:                  float = 1.0
+    stress:                  float = 0.0
+    fatigue:                 float = 0.0
+    time_step:               int   = 0
+    current_task_id:         Optional[str] = None
+    tasks:                   List[Task] = []
+    focus_mode:              bool  = False
+    interruption_count:      int   = 0
+    milestone_rewards:       Dict[str, float] = {}
+    # FIX 3 — stochastic interrupt tracking
+    next_interrupt_eligible: int  = 999
+    interrupt_budget:        int  = 0
 # ==========================================
+# FIX 2 — PROCEDURAL TASK GENERATION
+# Seed-based so episodes are reproducible on request but vary by default.
+# Deadlines jitter +-3 steps; task types and secondary priorities randomised.
 # ==========================================
+def generate_tasks(level: str, seed: Optional[int] = None) -> list[Task]:
+    """
+    Generate tasks for the given difficulty level.
+    Pass seed=None for a random seed (default for live play),
+    or an explicit int for reproducible evaluation runs.
+    """
+    rng = random.Random(seed)
+    def _jitter(base: int, lo: int = -3, hi: int = 3) -> int:
+        return max(1, base + rng.randint(lo, hi))
+    def _p(pool: list) -> str:
+        return rng.choice(pool)
     if level == "easy":
         return [
+            Task(id="e1", difficulty="easy",
+                 task_type=_p(["email", "report"]),
+                 priority=_p(["normal", "high"]),
+                 deadline=None),
+            Task(id="e2", difficulty="easy",
+                 task_type=_p(["report", "code_review"]),
+                 priority=_p(["normal", "low"]),
+                 deadline=None),
         ]
     elif level == "medium":
         return [
+            Task(id="m1", difficulty="medium",
+                 task_type=_p(["email", "call"]),
+                 priority="critical",
+                 deadline=_jitter(14)),
+            Task(id="m2", difficulty="medium",
+                 task_type=_p(["meeting", "code_review"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(20)),
+            Task(id="m3", difficulty="medium",
+                 task_type=_p(["code_review", "report"]),
+                 priority=_p(["normal", "high"]),
+                 deadline=_jitter(28)),
+            Task(id="m4", difficulty="medium",
+                 task_type=_p(["report", "meeting"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(35)),
+            Task(id="m5", difficulty="medium",
+                 task_type=_p(["call", "email"]),
+                 priority=_p(["low", "normal"]),
+                 deadline=_jitter(45)),
         ]
     elif level == "hard":
         return [
+            Task(id="h1", difficulty="hard",
+                 task_type=_p(["email", "call"]),
+                 priority="critical",
+                 deadline=_jitter(12)),
+            Task(id="h2", difficulty="hard",
+                 task_type=_p(["code_review", "report"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(16)),
+            Task(id="h3", difficulty="hard",
+                 task_type=_p(["meeting", "call"]),
+                 priority="critical",
+                 deadline=_jitter(20),
+                 depends_on="h1"),
+            Task(id="h4", difficulty="hard",
+                 task_type=_p(["report", "code_review"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(24)),
+            Task(id="h5", difficulty="hard",
+                 task_type=_p(["call", "meeting"]),
+                 priority=_p(["normal", "high"]),
+                 deadline=_jitter(28),
+                 depends_on="h2"),
+            Task(id="h6", difficulty="hard",
+                 task_type=_p(["email", "report"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(32)),
+            Task(id="h7", difficulty="hard",
+                 task_type=_p(["code_review", "meeting"]),
+                 priority="critical",
+                 deadline=_jitter(38),
+                 depends_on="h4"),
+            Task(id="h8", difficulty="hard",
+                 task_type=_p(["report", "email"]),
+                 priority=_p(["normal", "low"]),
+                 deadline=_jitter(46)),
         ]
     elif level == "expert":
         return [
+            Task(id="x1",  difficulty="expert",
+                 task_type=_p(["email", "call"]),
+                 priority="critical",
+                 deadline=_jitter(8)),
+            Task(id="x2",  difficulty="expert",
+                 task_type=_p(["code_review", "report"]),
+                 priority=_p(["high", "critical"]),
+                 deadline=_jitter(12)),
+            Task(id="x3",  difficulty="expert",
+                 task_type=_p(["meeting", "call"]),
+                 priority="critical",
+                 deadline=_jitter(14),
+                 depends_on="x1"),
+            Task(id="x4",  difficulty="expert",
+                 task_type=_p(["report", "code_review"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(18),
+                 depends_on="x2"),
+            Task(id="x5",  difficulty="expert",
+                 task_type=_p(["call", "meeting"]),
+                 priority=_p(["normal", "high"]),
+                 deadline=_jitter(22),
+                 depends_on="x3"),
+            Task(id="x6",  difficulty="expert",
+                 task_type=_p(["code_review", "email"]),
+                 priority="critical",
+                 deadline=_jitter(24)),
+            Task(id="x7",  difficulty="expert",
+                 task_type=_p(["email", "report"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(28),
+                 depends_on="x4"),
+            Task(id="x8",  difficulty="expert",
+                 task_type=_p(["report", "call"]),
+                 priority=_p(["normal", "high"]),
+                 deadline=_jitter(33),
+                 depends_on="x6"),
+            Task(id="x9",  difficulty="expert",
+                 task_type=_p(["meeting", "code_review"]),
+                 priority="critical",
+                 deadline=_jitter(36),
+                 depends_on="x5"),
+            Task(id="x10", difficulty="expert",
+                 task_type=_p(["call", "email"]),
+                 priority=_p(["high", "normal"]),
+                 deadline=_jitter(44)),
         ]
     return []
 # GRADER
 # ==========================================
 def grader(trajectory: dict) -> float:
+    """
+    OpenEnv single-argument grader.
+    FIX 1: If trajectory is empty or missing tasks, return 0.01 immediately.
+    The grader MUST score the actual agent trajectory — it must never silently
+    fall back to re-running a heuristic episode. Doing so would let the
+    environment grade itself rather than the agent under evaluation.
+    """
+    if not trajectory or not trajectory.get("tasks"):
+        # Empty trajectory = agent produced no useful state → minimum score
+        return 0.01
+    raw_tasks = trajectory["tasks"]
     ts  = trajectory.get("time_step", 50)
     eng = trajectory.get("energy", 0.5)
     task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
 def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float) -> float:
     """
+    Scores the ACTUAL final task state. Always returns a value in (0.01, 0.99).
+    Formula:
+      weighted_completion  x 0.60
+      deadline_adherence   x 0.22
+      energy_efficiency    x 0.10
+      dependency_bonus     x 0.05
+      interruption_bonus   x 0.03
     """
     if not tasks:
         return 0.01
     total_weight = sum(PRIORITY_WEIGHT[t.priority] for t in tasks)
+    # Weighted completion (partial progress counts)
     wc = sum(t.progress * PRIORITY_WEIGHT[t.priority] for t in tasks) / max(total_weight, 0.01)
+    # Deadline adherence
+    completable  = [t for t in tasks if t.deadline is not None]
+    met_deadline = sum(
         1 for t in completable
         if t.progress >= 1.0 and time_step <= t.deadline
     )
     da = (met_deadline / len(completable)) if completable else 1.0
+    # Energy efficiency
     ee = max(0.0, (final_energy - 0.10) * 0.13)
+    # Dependency ordering bonus
     dep_bonus = 0.0
     for t in tasks:
         if t.depends_on and t.progress >= 1.0:
                 dep_bonus += 0.015
     dep_bonus = min(0.05, dep_bonus)
+    # Interruption handling bonus
     interrupted = [t for t in tasks if t.is_interrupted]
     int_bonus = 0.0
     if interrupted:
+        handled   = sum(1 for t in interrupted if t.progress >= 1.0)
         int_bonus = min(0.03, (handled / len(interrupted)) * 0.03)
     raw = wc * 0.60 + da * 0.22 + ee + dep_bonus + int_bonus
 # ==========================================
+# FIX 3 — STOCHASTIC INTERRUPTION CONFIG
+# Interruptions fire with a per-step probability once an eligibility
+# window opens, with a cooldown to prevent back-to-back fires.
+# budget = max number of interrupts for the difficulty level.
 # ==========================================
+_INTERRUPT_CONFIG = {
+    #           prob_per_step  eligible_from  cooldown_steps  budget
+    "hard":   (0.18,          10,             8,              2),
+    "expert": (0.22,           6,             7,              3),
+}
+class CLMEnvironment:
+    def __init__(self, tasks: list[Task], max_steps: int = 50,
+                 seed: Optional[int] = None):
         self.max_steps     = max_steps
         self.initial_tasks = tasks
         self.difficulty    = tasks[0].difficulty if tasks else "easy"
+        self._rng          = random.Random(seed)
+        cfg = _INTERRUPT_CONFIG.get(self.difficulty, (0.0, 999, 999, 0))
+        self._interrupt_prob, eligible_from, self._cooldown, budget = cfg
+        self.state = EnvState(
+            tasks=[t.model_copy() for t in tasks],
+            next_interrupt_eligible=eligible_from,
+            interrupt_budget=budget,
+        )
     def reset(self) -> Observation:
+        cfg = _INTERRUPT_CONFIG.get(self.difficulty, (0.0, 999, 999, 0))
+        _, eligible_from, _, budget = cfg
+        self.state = EnvState(
+            tasks=[t.model_copy() for t in self.initial_tasks],
+            next_interrupt_eligible=eligible_from,
+            interrupt_budget=budget,
+        )
         return self._get_observation()
     def _blocked_ids(self) -> set[str]:
     def _get_observation(self) -> Observation:
         e = self.state.energy
+        s = self.state.stress
+        # FIX 6: Categorical labels only — no raw floats exposed to agent
+        fatigue_label = "high" if e < 0.30 else ("medium" if e < 0.60 else "low")
+        stress_label  = "critical" if s > 0.75 else ("elevated" if s > 0.45 else "calm")
         vs = VisibleState(
+            fatigue_level=fatigue_label,
+            stress_level=stress_label,
+            stress_warning=s > 0.65,
             focus_mode=self.state.focus_mode,
             upcoming_deadlines=self._upcoming_ids(),
             blocked_tasks=list(self._blocked_ids()),
         reward  = 0.0
         blocked = self._blocked_ids()
+        # FIX 3: Stochastic interruption — probabilistic, not fixed-step
+        if (self.state.interrupt_budget > 0
+                and self.state.time_step >= self.state.next_interrupt_eligible
+                and self._rng.random() < self._interrupt_prob):
             _inject_interruption(self.state, self.state.time_step)
+            self.state.interrupt_budget -= 1
+            self.state.next_interrupt_eligible = self.state.time_step + self._cooldown
             reward -= 0.05
+        # Action processing
         if action.type in ("work", "focus"):
             is_focus = (action.type == "focus")
             if action.task_id:
                 if action.task_id in blocked:
+                    reward -= 0.15
                 else:
                     if self.state.current_task_id and self.state.current_task_id != action.task_id:
+                        reward -= 0.07
                     self.state.current_task_id = action.task_id
                     self.state.focus_mode      = is_focus
                 reward += 0.10 * (task.progress - old_p) * pw
                 for ms, bonus in [(0.25, 0.04), (0.50, 0.07), (0.75, 0.09), (1.00, 0.18)]:
                     key = f"{task.id}@{ms}"
                     if task.progress >= ms and key not in self.state.milestone_rewards:
         self.state.time_step += 1
+        # Stress dynamics
         for t in (tt for tt in self.state.tasks if tt.progress < 1.0):
             if t.deadline:
                 ttd = t.deadline - self.state.time_step
                 elif ttd < 0:
                     self.state.stress = min(1.0, self.state.stress + 0.12 * pw)
+        # Episode termination
         all_done = all(t.progress >= 1.0 for t in self.state.tasks)
         burnout  = self.state.energy < 0.07
         timeout  = self.state.time_step >= self.max_steps

openenv.yaml CHANGED Viewed

@@ -48,10 +48,10 @@ observation_space:
     - depends_on: task_id or null
     - is_interrupted: bool
   visible_state:
-    - fatigue_level: "low | medium | high"
-    - stress_warning: bool
-    - energy_level: float [0.0, 1.0]
-    - stress_level: float [0.0, 1.0]
     - focus_mode: bool
     - upcoming_deadlines: list[task_id]
     - blocked_tasks: list[task_id]

     - depends_on: task_id or null
     - is_interrupted: bool
   visible_state:
+    # Partial observability: energy/stress are categorical labels, not raw floats.
+    - fatigue_level: "low | medium | high"     # energy bands: >0.6 | 0.3-0.6 | <0.3
+    - stress_level: "calm | elevated | critical" # stress bands: <0.45 | 0.45-0.75 | >0.75
+    - stress_warning: bool                       # true when stress > 0.65
     - focus_mode: bool
     - upcoming_deadlines: list[task_id]
     - blocked_tasks: list[task_id]

server/app.py CHANGED Viewed

@@ -1,15 +1,21 @@
-import uvicorn
 import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-from backend.main import app  # app is now importable as server.app:app
-def main():
-    uvicorn.run(app, host="0.0.0.0", port=7860)
 if __name__ == "__main__":
-    main()

+"""
+server/app.py — single entry point for CLM OpenEnv server.
+Imports the FastAPI app built in backend/main.py and exposes it for:
+  - Dockerfile: uvicorn server.app:app --host 0.0.0.0 --port 7860
+  - openenv.yaml: app: server.app:app
+All route logic lives in backend/main.py. This file is intentionally thin.
+"""
 import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from backend.main import app  # single source of truth for the FastAPI app
+__all__ = ["app"]
 if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

tests/test_clm.py ADDED Viewed

	@@ -0,0 +1,257 @@

+"""
+tests/test_clm.py — unit tests for the Cognitive Load Manager environment.
+Run with:  pytest tests/test_clm.py -v
+"""
+import sys, os, pytest
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from models import (
+    Action, Task, EnvState, CLMEnvironment,
+    generate_tasks, deterministic_grader, grader,
+    PRIORITY_WEIGHT,
+)
+from grader.clm_graders import (
+    EasyGrader, MediumGrader, HardGrader, ExpertGrader, _from_trajectory,
+)
+# ─────────────────────────────────────────────────────────────────────────────
+# FIX 2 — Procedural generation
+# ─────────────────────────────────────────────────────────────────────────────
+class TestProceduralGeneration:
+    def test_seed_produces_same_tasks(self):
+        a = generate_tasks("medium", seed=7)
+        b = generate_tasks("medium", seed=7)
+        assert [t.model_dump() for t in a] == [t.model_dump() for t in b]
+    def test_different_seeds_differ(self):
+        results = set()
+        for s in range(20):
+            tasks = generate_tasks("medium", seed=s)
+            results.add(tuple(t.deadline for t in tasks))
+        assert len(results) > 1, "All seeds produced identical deadlines"
+    def test_task_counts(self):
+        assert len(generate_tasks("easy"))   == 2
+        assert len(generate_tasks("medium")) == 5
+        assert len(generate_tasks("hard"))   == 8
+        assert len(generate_tasks("expert")) == 10
+    def test_deadlines_positive_and_bounded(self):
+        """Jitter can reorder adjacent deadlines, but all must be positive and sane."""
+        base_deadlines = {"medium": [14, 20, 28, 35, 45], "hard": [12, 16, 20, 24, 28, 32, 38, 46]}
+        for level, bases in base_deadlines.items():
+            for seed in range(20):
+                tasks = generate_tasks(level, seed=seed)
+                for t in tasks:
+                    if t.deadline is not None:
+                        assert t.deadline >= 1, f"Deadline must be >= 1, got {t.deadline}"
+                        # Should be within ±5 of the nearest base (generous bound)
+                        nearest = min(bases, key=lambda b: abs(b - t.deadline))
+                        assert abs(t.deadline - nearest) <= 5, \
+                            f"Deadline {t.deadline} too far from base {nearest}"
+# ─────────────────────────────────────────────────────────────────────────────
+# FIX 1 — Grader trajectory bug
+# ─────────────────────────────────────────────────────────────────────────────
+class TestGraderTrajectoryBug:
+    def test_empty_trajectory_returns_min(self):
+        assert grader({}) == 0.01
+    def test_missing_tasks_returns_min(self):
+        assert grader({"time_step": 50, "energy": 0.8}) == 0.01
+    def test_empty_tasks_list_returns_min(self):
+        assert grader({"tasks": [], "time_step": 50, "energy": 0.8}) == 0.01
+    def test_grader_class_empty_trajectory(self):
+        for cls in [EasyGrader, MediumGrader, HardGrader, ExpertGrader]:
+            score = cls()(trajectory={})
+            assert score == 0.01, f"{cls.__name__} returned {score} for empty trajectory"
+    def test_from_trajectory_empty(self):
+        score, success, msg = _from_trajectory({}, "easy")
+        assert score == 0.01
+        assert success is False
+        assert "empty trajectory" in msg
+    def test_real_trajectory_scores_above_min(self):
+        """A trajectory with completed tasks should score > 0.01."""
+        tasks = generate_tasks("easy", seed=1)
+        for t in tasks:
+            t.progress = 1.0
+        traj = {"tasks": [t.model_dump() for t in tasks], "time_step": 20, "energy": 0.7}
+        assert grader(traj) > 0.01
+# ─────────────────────────────────────────────────────────────────────────────
+# Environment basics
+# ─────────────────────────────────────────────────────────────────────────────
+class TestReset:
+    def test_reset_produces_clean_state(self):
+        env = CLMEnvironment(tasks=generate_tasks("easy", seed=0), max_steps=50)
+        obs = env.reset()
+        assert env.state.energy == 1.0
+        assert env.state.stress == 0.0
+        assert env.state.time_step == 0
+        assert all(t.progress == 0.0 for t in env.state.tasks)
+    def test_reset_after_episode_clears_state(self):
+        env = CLMEnvironment(tasks=generate_tasks("easy", seed=0), max_steps=50)
+        env.reset()
+        for _ in range(10):
+            env.step(Action(type="work", task_id="e1"))
+        env.reset()
+        assert env.state.time_step == 0
+        assert env.state.energy == 1.0
+# ─────────────────────────────────────────────────────────────────────────────
+# Blocked-task penalty (Fix 3 indirectly — env mechanics)
+# ─────────────────────────────────────────────────────────────────────────────
+class TestBlockedTaskPenalty:
+    def test_working_on_blocked_task_gives_penalty(self):
+        tasks = generate_tasks("hard", seed=0)
+        env   = CLMEnvironment(tasks=tasks, max_steps=50)
+        env.reset()
+        # h3 depends on h1 — h1 not done yet, so h3 is blocked
+        blocked = env._blocked_ids()
+        assert "h3" in blocked, "h3 should be blocked at episode start"
+        _, reward, _, _ = env.step(Action(type="work", task_id="h3"))
+        assert reward <= -0.15, f"Expected penalty for blocked task, got {reward}"
+# ─────────────────────────────────────────────────────────────────────────────
+# FIX 3 — Stochastic interruptions
+# ─────────────────────────────────────────────────────────────────────────────
+class TestStochasticInterruptions:
+    def test_hard_eventually_interrupts(self):
+        """Over many seeds, at least one hard episode should fire an interruption."""
+        fired = False
+        for seed in range(50):
+            tasks = generate_tasks("hard", seed=seed)
+            env   = CLMEnvironment(tasks=tasks, max_steps=50, seed=seed)
+            env.reset()
+            done = False
+            while not done:
+                _, _, done, _ = env.step(Action(type="work", task_id=tasks[0].id))
+            if env.state.interruption_count > 0:
+                fired = True
+                break
+        assert fired, "Expected at least one interruption across 50 hard seeds"
+    def test_interruptions_respect_budget(self):
+        """Hard episodes should never exceed budget=2 interruptions."""
+        for seed in range(30):
+            tasks = generate_tasks("hard", seed=seed)
+            env   = CLMEnvironment(tasks=tasks, max_steps=50, seed=seed)
+            env.reset()
+            done = False
+            while not done:
+                _, _, done, _ = env.step(Action(type="work", task_id=tasks[0].id))
+            assert env.state.interruption_count <= 2, \
+                f"Seed {seed}: got {env.state.interruption_count} interruptions, max is 2"
+    def test_no_interruptions_on_easy(self):
+        for seed in range(10):
+            tasks = generate_tasks("easy", seed=seed)
+            env   = CLMEnvironment(tasks=tasks, max_steps=50, seed=seed)
+            env.reset()
+            done = False
+            while not done:
+                _, _, done, _ = env.step(Action(type="break"))
+            assert env.state.interruption_count == 0
+# ─────────────────────────────────────────────────────────────────────────────
+# Burnout terminates episode
+# ─────────────────────────────────────────────────────────────────────────────
+class TestBurnout:
+    def test_burnout_terminates_episode(self):
+        tasks = generate_tasks("easy", seed=0)
+        env   = CLMEnvironment(tasks=tasks, max_steps=200)
+        env.reset()
+        env.state.energy = 0.08   # just above burnout threshold
+        done = False
+        for _ in range(5):
+            _, _, done, info = env.step(Action(type="work", task_id="e1"))
+            if done:
+                break
+        assert done, "Episode should terminate on burnout"
+    def test_burnout_applies_penalty(self):
+        tasks = generate_tasks("easy", seed=0)
+        env   = CLMEnvironment(tasks=tasks, max_steps=200)
+        env.reset()
+        env.state.energy = 0.08
+        rewards = []
+        done = False
+        for _ in range(5):
+            _, r, done, _ = env.step(Action(type="work", task_id="e1"))
+            rewards.append(r)
+            if done:
+                break
+        assert any(r <= -0.5 for r in rewards), "Burnout should produce a large negative reward"
+# ─────────────────────────────────────────────────────────────────────────────
+# Grader score bounds
+# ─────────────────────────────────────────────────────────────────────────────
+class TestGraderBounds:
+    def test_grader_always_in_bounds(self):
+        for level in ["easy", "medium", "hard", "expert"]:
+            for seed in range(10):
+                tasks = generate_tasks(level, seed=seed)
+                for frac in [0.0, 0.3, 0.7, 1.0]:
+                    for t in tasks:
+                        t.progress = frac
+                    score = deterministic_grader(tasks, time_step=30, final_energy=0.5)
+                    assert 0.01 <= score <= 0.99, \
+                        f"Score {score} out of bounds for {level} seed={seed} progress={frac}"
+    def test_grader_higher_completion_scores_higher(self):
+        tasks_low  = generate_tasks("medium", seed=1)
+        tasks_high = generate_tasks("medium", seed=1)
+        for t in tasks_low:  t.progress = 0.0
+        for t in tasks_high: t.progress = 1.0
+        assert deterministic_grader(tasks_high, 30, 0.7) > \
+               deterministic_grader(tasks_low,  30, 0.7)
+# ─────────────────────────────────────────────────────────────────────────────
+# FIX 6 — Partial observability
+# ─────────────────────────────────────────────────────────────────────────────
+class TestPartialObservability:
+    def test_observation_has_no_raw_floats(self):
+        env = CLMEnvironment(tasks=generate_tasks("easy", seed=0))
+        obs = env.reset()
+        vs  = obs.visible_state
+        # energy_level and stress float must NOT appear in visible state
+        assert not hasattr(vs, "energy_level"), "energy_level float should not be in observation"
+        assert isinstance(vs.fatigue_level, str)
+        assert isinstance(vs.stress_level, str)
+    def test_fatigue_levels_are_valid(self):
+        env = CLMEnvironment(tasks=generate_tasks("easy", seed=0))
+        env.reset()
+        env.state.energy = 0.1   # should be "high" fatigue
+        obs = env._get_observation()
+        assert obs.visible_state.fatigue_level == "high"
+        env.state.energy = 0.5   # "medium"
+        assert env._get_observation().visible_state.fatigue_level == "medium"
+        env.state.energy = 0.9   # "low"
+        assert env._get_observation().visible_state.fatigue_level == "low"
+    def test_stress_levels_are_valid(self):
+        env = CLMEnvironment(tasks=generate_tasks("easy", seed=0))
+        env.reset()
+        env.state.stress = 0.8
+        assert env._get_observation().visible_state.stress_level == "critical"
+        env.state.stress = 0.5
+        assert env._get_observation().visible_state.stress_level == "elevated"
+        env.state.stress = 0.1
+        assert env._get_observation().visible_state.stress_level == "calm"