AE-Shree commited on
Commit
f3f7834
Β·
1 Parent(s): dfa9f05

Select ous To ROund 2 !!

Browse files
Files changed (7) hide show
  1. README.md +6 -12
  2. grader/clm_graders.py +82 -63
  3. inference.py +5 -2
  4. models.py +231 -100
  5. openenv.yaml +4 -4
  6. server/app.py +13 -7
  7. tests/test_clm.py +257 -0
README.md CHANGED
@@ -17,21 +17,21 @@ tags: [openenv, rl, scheduling, agent-eval, productivity]
17
  [![Python 3.11](https://img.shields.io/badge/Python-3.11-blue?style=for-the-badge&logo=python)](#)
18
  [![React Dashboard](https://img.shields.io/badge/React-Live_Dashboard-blue?style=for-the-badge&logo=react)](#)
19
 
20
- CLM is a **real-world productivity simulation** where an AI agent plays the role of a human knowledge worker's task scheduler. It must manage heterogeneous work items β€” emails, meetings, code reviews, reports, and calls β€” each with different cognitive demands, deadlines, priorities, and dependencies, while keeping the worker's energy and stress within safe bounds.
21
 
22
  *This is not a toy game.* CLM models how humans actually experience workload: stress accumulates when deadlines approach, fatigue reduces efficiency, context-switching has a cognitive cost, and deep focus yields better output at the expense of higher energy.
23
 
24
- ---
25
 
26
  ## 🎯 Why This Environment Matters
27
 
28
- Modern knowledge workers face **cognitive load management** as one of their most critical daily challenges β€” yet no RL environment has modelled this domain in a principled, agent-evaluatable way. CLM fills this gap:
29
 
30
  - **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems.
31
- - **Useful for evaluating LLM planning ability** β€” especially multi-step planning under resource constraints.
32
  - **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit.
33
 
34
- ---
35
 
36
  ## πŸ•ΉοΈ Actions
37
 
@@ -50,7 +50,7 @@ Action format:
50
  {"type": "break", "task_id": null}
51
  ```
52
 
53
- ---
54
 
55
  ## πŸ‘οΈ Observation Space
56
 
@@ -85,7 +85,6 @@ Action format:
85
  - `upcoming_deadlines` β€” tasks with deadline within the next 5 steps
86
  - `focus_mode` β€” whether the agent is currently in deep-work state
87
 
88
- ---
89
 
90
  ## πŸ“‹ Tasks & Baseline Scores
91
 
@@ -99,7 +98,6 @@ Action format:
99
  Scores produced by heuristic agent (priority + deadline triage with focus mode).
100
  A strong LLM agent should achieve: easy >0.85, medium >0.55, hard >0.35, expert >0.25.
101
 
102
- ---
103
 
104
  ## πŸ† Scoring Formula
105
 
@@ -119,7 +117,6 @@ score = weighted_completion Γ— 0.60
119
 
120
  Score is always in **(0.01, 0.99)** β€” never exactly 0 or 1.
121
 
122
- ---
123
 
124
  ## πŸš€ Setup
125
 
@@ -149,7 +146,6 @@ cd frontend && npm install && npm run dev
149
  # Visit http://localhost:5173
150
  ```
151
 
152
- ---
153
 
154
  ## πŸ›οΈ Architecture
155
 
@@ -177,7 +173,6 @@ graph TD
177
  API -->|OpenEnv spec| OE[openenv validate]
178
  ```
179
 
180
- ---
181
 
182
  ## πŸ“Š Reward Shaping Details
183
 
@@ -197,7 +192,6 @@ Step rewards provide **dense signal** across the full trajectory:
197
  | Episode: all done (on time) | +1.0 |
198
  | Episode: all done (late) | +0.5 |
199
 
200
- ---
201
 
202
  ## βš™οΈ Environment Variables
203
 
 
17
  [![Python 3.11](https://img.shields.io/badge/Python-3.11-blue?style=for-the-badge&logo=python)](#)
18
  [![React Dashboard](https://img.shields.io/badge/React-Live_Dashboard-blue?style=for-the-badge&logo=react)](#)
19
 
20
+ CLM is a **real-world productivity simulation** where an AI agent plays the role of a human knowledge worker's task scheduler. It must manage heterogeneous work items like emails, meetings, code reviews, reports, and calls each with different cognitive demands, deadlines, priorities, and dependencies, while keeping the worker's energy and stress within safe bounds.
21
 
22
  *This is not a toy game.* CLM models how humans actually experience workload: stress accumulates when deadlines approach, fatigue reduces efficiency, context-switching has a cognitive cost, and deep focus yields better output at the expense of higher energy.
23
 
24
+
25
 
26
  ## 🎯 Why This Environment Matters
27
 
28
+ Modern knowledge workers face **cognitive load management** as one of their most critical daily challenges, yet no RL environment has modelled this domain in a principled, agent-evaluatable way. CLM fills this gap:
29
 
30
  - **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems.
31
+ - **Useful for evaluating LLM planning ability** especially multi-step planning under resource constraints.
32
  - **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit.
33
 
34
+
35
 
36
  ## πŸ•ΉοΈ Actions
37
 
 
50
  {"type": "break", "task_id": null}
51
  ```
52
 
53
+
54
 
55
  ## πŸ‘οΈ Observation Space
56
 
 
85
  - `upcoming_deadlines` β€” tasks with deadline within the next 5 steps
86
  - `focus_mode` β€” whether the agent is currently in deep-work state
87
 
 
88
 
89
  ## πŸ“‹ Tasks & Baseline Scores
90
 
 
98
  Scores produced by heuristic agent (priority + deadline triage with focus mode).
99
  A strong LLM agent should achieve: easy >0.85, medium >0.55, hard >0.35, expert >0.25.
100
 
 
101
 
102
  ## πŸ† Scoring Formula
103
 
 
117
 
118
  Score is always in **(0.01, 0.99)** β€” never exactly 0 or 1.
119
 
 
120
 
121
  ## πŸš€ Setup
122
 
 
146
  # Visit http://localhost:5173
147
  ```
148
 
 
149
 
150
  ## πŸ›οΈ Architecture
151
 
 
173
  API -->|OpenEnv spec| OE[openenv validate]
174
  ```
175
 
 
176
 
177
  ## πŸ“Š Reward Shaping Details
178
 
 
192
  | Episode: all done (on time) | +1.0 |
193
  | Episode: all done (late) | +0.5 |
194
 
 
195
 
196
  ## βš™οΈ Environment Variables
197
 
grader/clm_graders.py CHANGED
@@ -1,10 +1,11 @@
1
  """
2
- Class-based graders for CLM tasks β€” matches auto-dev's BaseGrader interface.
3
 
4
- Graders run a heuristic agent to episode completion and score the FINAL state.
5
- Each difficulty produces DIFFERENT scores (easy ~0.75, medium ~0.45, hard ~0.20, expert ~0.08).
 
6
 
7
- Scores are ALWAYS strictly in (0.01, 0.99).
8
  """
9
  import sys, os
10
  sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
@@ -22,18 +23,68 @@ def _safe(raw) -> float:
22
  return _MIN
23
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  def _heuristic_action(env: CLMEnvironment) -> Action:
26
  """
27
  Competent heuristic agent:
28
- - Enters focus mode on critical tasks with approaching deadlines
29
  - Takes breaks when fatigued or stressed
30
  - Prioritises: critical > high > normal > low, then earliest deadline
31
- - Respects task dependencies (never works on a blocked task)
 
32
  """
33
- state = env.state
34
  blocked = env._blocked_ids()
35
 
36
- # Rest condition
37
  if state.energy < 0.30 or state.stress > 0.70:
38
  return Action(type="break", task_id=None)
39
 
@@ -41,80 +92,48 @@ def _heuristic_action(env: CLMEnvironment) -> Action:
41
  if not pending:
42
  return Action(type="delay", task_id=None)
43
 
44
- # Sort by priority weight DESC then deadline ASC
45
  pending.sort(key=lambda t: (
46
  -PRIORITY_WEIGHT[t.priority],
47
  t.deadline if t.deadline is not None else 9999
48
  ))
49
  target = pending[0]
50
 
51
- # Use focus mode for critical tasks with deadline in ≀10 steps
52
  use_focus = (
53
  target.priority == "critical"
54
  and target.deadline is not None
55
  and (target.deadline - state.time_step) <= 10
56
  and state.energy > 0.55
57
  )
58
-
59
- if state.current_task_id == target.id:
60
- return Action(type="focus" if use_focus else "work", task_id=target.id)
61
  return Action(type="focus" if use_focus else "work", task_id=target.id)
62
 
63
 
64
- def _run_episode(difficulty: str) -> tuple:
65
- try:
66
- tasks = generate_tasks(difficulty)
67
- max_s = 60 if difficulty == "expert" else 50
68
- env = CLMEnvironment(tasks=tasks, max_steps=max_s)
69
- env.reset()
70
- done, step = False, 0
71
- while not done and step < max_s:
72
- action = _heuristic_action(env)
73
- _, _, done, _ = env.step(action)
74
- step += 1
75
- raw = deterministic_grader(env.state.tasks, env.state.time_step, env.state.energy)
76
- score = _safe(raw)
77
- comp = sum(1 for t in env.state.tasks if t.progress >= 1.0)
78
- msg = (
79
- f"CLM {difficulty} | score={score:.4f} | "
80
- f"steps={step} energy={env.state.energy:.2f} "
81
- f"completed={comp}/{len(env.state.tasks)}"
82
- )
83
- return score, score >= 0.5, msg
84
- except Exception as e:
85
- return _MIN, False, f"Grader error: {e}"
86
-
87
-
88
- def _from_trajectory(trajectory: dict, difficulty: str) -> tuple:
89
- if trajectory and "tasks" in trajectory:
90
- raw_tasks = trajectory.get("tasks", [])
91
- ts = trajectory.get("time_step", 50)
92
- eng = trajectory.get("energy", 0.5)
93
- task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
94
- raw = deterministic_grader(task_objs, ts, eng)
95
- score = _safe(raw)
96
- comp = sum(1 for t in task_objs if t.progress >= 1.0)
97
- msg = f"CLM {difficulty} | score={score:.4f} | completed={comp}/{len(task_objs)}"
98
- return score, score >= 0.5, msg
99
- return _run_episode(difficulty)
100
-
101
-
102
  class EasyGrader:
103
- """Easy: 2 tasks (email + report), no deadlines. Expected heuristic score: ~0.72–0.82."""
104
- def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "easy")
105
- def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "easy")[0]
 
 
106
 
107
  class MediumGrader:
108
- """Medium: 5 tasks with mixed priorities and deadlines. Expected: ~0.38–0.52."""
109
- def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "medium")
110
- def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "medium")[0]
 
 
111
 
112
  class HardGrader:
113
- """Hard: 8 tasks with dependencies and tight deadlines + interruptions. Expected: ~0.15–0.28."""
114
- def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "hard")
115
- def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "hard")[0]
 
 
116
 
117
  class ExpertGrader:
118
- """Expert: 10 tasks, deep dependencies, 3 mid-episode interruptions. Expected: ~0.05–0.15."""
119
- def grade(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "expert")
120
- def __call__(self, trajectory=None, *a, **kw): return _from_trajectory(trajectory or {}, "expert")[0]
 
 
 
1
  """
2
+ Class-based graders for CLM tasks.
3
 
4
+ FIX 1: _from_trajectory no longer falls back to running a heuristic episode
5
+ when the trajectory is empty or missing. It returns 0.01 immediately.
6
+ The grader MUST score the actual agent, not a proxy.
7
 
8
+ Graders produce scores strictly in (0.01, 0.99).
9
  """
10
  import sys, os
11
  sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 
23
  return _MIN
24
 
25
 
26
+ def _from_trajectory(trajectory: dict, difficulty: str) -> tuple:
27
+ """
28
+ Score a completed agent trajectory.
29
+
30
+ FIX 1: If trajectory is empty or has no tasks, return 0.01 immediately.
31
+ We must never rerun a heuristic episode here β€” that would score the
32
+ heuristic agent, not the LLM agent under evaluation.
33
+ """
34
+ if not trajectory or not trajectory.get("tasks"):
35
+ return _MIN, False, f"CLM {difficulty} | score=0.0100 | empty trajectory"
36
+
37
+ raw_tasks = trajectory["tasks"]
38
+ ts = trajectory.get("time_step", 50)
39
+ eng = trajectory.get("energy", 0.5)
40
+ task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
41
+ raw = deterministic_grader(task_objs, ts, eng)
42
+ score = _safe(raw)
43
+ comp = sum(1 for t in task_objs if t.progress >= 1.0)
44
+ msg = f"CLM {difficulty} | score={score:.4f} | completed={comp}/{len(task_objs)}"
45
+ return score, score >= 0.5, msg
46
+
47
+
48
+ def _run_heuristic_baseline(difficulty: str) -> tuple:
49
+ """
50
+ Run a heuristic agent to produce a BASELINE reference score only.
51
+ This is used for reporting / README baseline numbers β€” NEVER for
52
+ grading an LLM agent's actual trajectory.
53
+ """
54
+ try:
55
+ tasks = generate_tasks(difficulty, seed=42) # fixed seed for reproducibility
56
+ max_s = 60 if difficulty == "expert" else 50
57
+ env = CLMEnvironment(tasks=tasks, max_steps=max_s, seed=42)
58
+ env.reset()
59
+ done, step = False, 0
60
+ while not done and step < max_s:
61
+ action = _heuristic_action(env)
62
+ _, _, done, _ = env.step(action)
63
+ step += 1
64
+ raw = deterministic_grader(env.state.tasks, env.state.time_step, env.state.energy)
65
+ score = _safe(raw)
66
+ comp = sum(1 for t in env.state.tasks if t.progress >= 1.0)
67
+ msg = (
68
+ f"CLM {difficulty} baseline | score={score:.4f} | "
69
+ f"steps={step} energy={env.state.energy:.2f} "
70
+ f"completed={comp}/{len(env.state.tasks)}"
71
+ )
72
+ return score, score >= 0.5, msg
73
+ except Exception as e:
74
+ return _MIN, False, f"Baseline error: {e}"
75
+
76
+
77
  def _heuristic_action(env: CLMEnvironment) -> Action:
78
  """
79
  Competent heuristic agent:
 
80
  - Takes breaks when fatigued or stressed
81
  - Prioritises: critical > high > normal > low, then earliest deadline
82
+ - Respects task dependencies
83
+ - Uses focus mode on critical tasks near their deadline
84
  """
85
+ state = env.state
86
  blocked = env._blocked_ids()
87
 
 
88
  if state.energy < 0.30 or state.stress > 0.70:
89
  return Action(type="break", task_id=None)
90
 
 
92
  if not pending:
93
  return Action(type="delay", task_id=None)
94
 
 
95
  pending.sort(key=lambda t: (
96
  -PRIORITY_WEIGHT[t.priority],
97
  t.deadline if t.deadline is not None else 9999
98
  ))
99
  target = pending[0]
100
 
 
101
  use_focus = (
102
  target.priority == "critical"
103
  and target.deadline is not None
104
  and (target.deadline - state.time_step) <= 10
105
  and state.energy > 0.55
106
  )
 
 
 
107
  return Action(type="focus" if use_focus else "work", task_id=target.id)
108
 
109
 
110
+ # ==========================================
111
+ # PUBLIC GRADER CLASSES
112
+ # ==========================================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  class EasyGrader:
114
+ """Easy: 2 tasks (email + report), no deadlines. Expected score: ~0.72–0.82."""
115
+ def grade(self, trajectory=None, *a, **kw):
116
+ return _from_trajectory(trajectory or {}, "easy")
117
+ def __call__(self, trajectory=None, *a, **kw):
118
+ return _from_trajectory(trajectory or {}, "easy")[0]
119
 
120
  class MediumGrader:
121
+ """Medium: 5 tasks, mixed priorities and deadlines. Expected: ~0.38–0.52."""
122
+ def grade(self, trajectory=None, *a, **kw):
123
+ return _from_trajectory(trajectory or {}, "medium")
124
+ def __call__(self, trajectory=None, *a, **kw):
125
+ return _from_trajectory(trajectory or {}, "medium")[0]
126
 
127
  class HardGrader:
128
+ """Hard: 8 tasks, dependencies, tight deadlines, stochastic interruptions. Expected: ~0.15–0.28."""
129
+ def grade(self, trajectory=None, *a, **kw):
130
+ return _from_trajectory(trajectory or {}, "hard")
131
+ def __call__(self, trajectory=None, *a, **kw):
132
+ return _from_trajectory(trajectory or {}, "hard")[0]
133
 
134
  class ExpertGrader:
135
+ """Expert: 10 tasks, deep dependencies, 3 stochastic interruptions. Expected: ~0.05–0.15."""
136
+ def grade(self, trajectory=None, *a, **kw):
137
+ return _from_trajectory(trajectory or {}, "expert")
138
+ def __call__(self, trajectory=None, *a, **kw):
139
+ return _from_trajectory(trajectory or {}, "expert")[0]
inference.py CHANGED
@@ -100,7 +100,9 @@ def heuristic_fallback(obs: dict) -> Dict:
100
  blocked = set(vs.get("blocked_tasks", []))
101
  tasks = [t for t in obs.get("tasks", [])
102
  if t.get("progress", 0.0) < 1.0 and t["id"] not in blocked]
103
- if vs.get("energy_level", 1.0) < 0.35 or vs.get("stress_warning", False):
 
 
104
  return {"type": "break", "task_id": None}
105
  if tasks:
106
  # Sort: critical > high > normal > low, then nearest deadline
@@ -108,7 +110,8 @@ def heuristic_fallback(obs: dict) -> Dict:
108
  tasks.sort(key=lambda t: (pmap.get(t.get("priority", "normal"), 2),
109
  t.get("deadline") or 9999))
110
  t = tasks[0]
111
- atype = "focus" if t.get("priority") == "critical" and vs.get("energy_level", 1.0) > 0.55 else "work"
 
112
  return {"type": atype, "task_id": t["id"]}
113
  return {"type": "delay", "task_id": None}
114
 
 
100
  blocked = set(vs.get("blocked_tasks", []))
101
  tasks = [t for t in obs.get("tasks", [])
102
  if t.get("progress", 0.0) < 1.0 and t["id"] not in blocked]
103
+ # FIX 6: observation is now partially observable β€” use categorical labels
104
+ fatigue = vs.get("fatigue_level", "low")
105
+ if fatigue == "high" or vs.get("stress_warning", False):
106
  return {"type": "break", "task_id": None}
107
  if tasks:
108
  # Sort: critical > high > normal > low, then nearest deadline
 
110
  tasks.sort(key=lambda t: (pmap.get(t.get("priority", "normal"), 2),
111
  t.get("deadline") or 9999))
112
  t = tasks[0]
113
+ fatigue_ok = vs.get("fatigue_level", "low") != "high"
114
+ atype = "focus" if t.get("priority") == "critical" and fatigue_ok else "work"
115
  return {"type": atype, "task_id": t["id"]}
116
  return {"type": "delay", "task_id": None}
117
 
models.py CHANGED
@@ -1,8 +1,9 @@
1
  from pydantic import BaseModel, Field
2
  from typing import List, Optional, Literal, Tuple, Dict, Any
 
3
 
4
  # ==========================================
5
- # TASK TYPES β€” makes this clearly real-world
6
  # ==========================================
7
  TaskType = Literal["email", "meeting", "code_review", "report", "call"]
8
  Priority = Literal["critical", "high", "normal", "low"]
@@ -11,6 +12,9 @@ PRIORITY_WEIGHT = {"critical": 1.5, "high": 1.2, "normal": 1.0, "low": 0.7}
11
  TASK_ENERGY_COST = {"email": 0.08, "meeting": 0.18, "code_review": 0.20, "report": 0.14, "call": 0.11}
12
  TASK_PROGRESS_RATE = {"email": 0.35, "meeting": 0.30, "code_review": 0.20, "report": 0.22, "call": 0.28}
13
 
 
 
 
14
  # ==========================================
15
  # OPENENV SCHEMAS
16
  # ==========================================
@@ -21,17 +25,22 @@ class Task(BaseModel):
21
  priority: Priority = "normal"
22
  progress: float = 0.0
23
  deadline: Optional[int] = None
24
- depends_on: Optional[str] = None # must complete parent task first
25
- is_interrupted: bool = False # injected mid-episode
26
 
27
  class VisibleState(BaseModel):
 
 
 
 
 
28
  fatigue_level: str # "low" | "medium" | "high"
 
29
  stress_warning: bool
30
- energy_level: float = 1.0
31
- stress_level: float = 0.0
32
  focus_mode: bool = False
33
- upcoming_deadlines: List[str] = [] # task ids with deadline ≀ 5 steps away
34
- blocked_tasks: List[str] = [] # task ids blocked by unfinished dependencies
 
35
 
36
  class Observation(BaseModel):
37
  tasks: List[Task]
@@ -40,72 +49,165 @@ class Observation(BaseModel):
40
 
41
  class Action(BaseModel):
42
  type: Literal["work", "break", "switch", "delay", "focus"]
43
- # work β€” normal work on task_id
44
- # break β€” rest; recover energy + reduce stress
45
- # switch β€” change active task (small context-switch cost)
46
- # delay β€” do nothing; slight stress relief
47
- # focus β€” deep-work mode: 2Γ— progress, 2Γ— energy cost
48
  task_id: Optional[str] = None
49
 
50
  class EnvState(BaseModel):
51
- energy: float = 1.0
52
- stress: float = 0.0
53
- fatigue: float = 0.0
54
- time_step: int = 0
55
- current_task_id: Optional[str] = None
56
- tasks: List[Task] = []
57
- focus_mode: bool = False
58
- interruption_count: int = 0
59
- milestone_rewards: Dict[str, float] = {}
 
 
 
60
 
61
 
62
  # ==========================================
63
- # TASK GENERATION
 
 
64
  # ==========================================
65
- def generate_tasks(level: str) -> list[Task]:
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  if level == "easy":
67
- # 2 simple tasks, no deadlines β€” learn basics
68
  return [
69
- Task(id="e1", difficulty="easy", task_type="email", priority="normal", deadline=None),
70
- Task(id="e2", difficulty="easy", task_type="report", priority="normal", deadline=None),
 
 
 
 
 
 
71
  ]
72
 
73
  elif level == "medium":
74
- # 5 mixed tasks with deadlines and priorities
75
  return [
76
- Task(id="m1", difficulty="medium", task_type="email", priority="critical", deadline=14),
77
- Task(id="m2", difficulty="medium", task_type="meeting", priority="high", deadline=20),
78
- Task(id="m3", difficulty="medium", task_type="code_review", priority="normal", deadline=28),
79
- Task(id="m4", difficulty="medium", task_type="report", priority="high", deadline=35),
80
- Task(id="m5", difficulty="medium", task_type="call", priority="low", deadline=45),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  ]
82
 
83
  elif level == "hard":
84
- # 8 tasks with task dependencies + 2 mid-episode interruptions
85
  return [
86
- Task(id="h1", difficulty="hard", task_type="email", priority="critical", deadline=12),
87
- Task(id="h2", difficulty="hard", task_type="code_review", priority="high", deadline=16),
88
- Task(id="h3", difficulty="hard", task_type="meeting", priority="critical", deadline=20, depends_on="h1"),
89
- Task(id="h4", difficulty="hard", task_type="report", priority="high", deadline=24),
90
- Task(id="h5", difficulty="hard", task_type="call", priority="normal", deadline=28, depends_on="h2"),
91
- Task(id="h6", difficulty="hard", task_type="email", priority="high", deadline=32),
92
- Task(id="h7", difficulty="hard", task_type="code_review", priority="critical", deadline=38, depends_on="h4"),
93
- Task(id="h8", difficulty="hard", task_type="report", priority="normal", deadline=46),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  ]
95
 
96
  elif level == "expert":
97
- # 10 tasks, deep dependencies, 3 mid-episode interruptions
98
  return [
99
- Task(id="x1", difficulty="expert", task_type="email", priority="critical", deadline=8),
100
- Task(id="x2", difficulty="expert", task_type="code_review", priority="high", deadline=12),
101
- Task(id="x3", difficulty="expert", task_type="meeting", priority="critical", deadline=14, depends_on="x1"),
102
- Task(id="x4", difficulty="expert", task_type="report", priority="high", deadline=18, depends_on="x2"),
103
- Task(id="x5", difficulty="expert", task_type="call", priority="normal", deadline=22, depends_on="x3"),
104
- Task(id="x6", difficulty="expert", task_type="code_review", priority="critical", deadline=24),
105
- Task(id="x7", difficulty="expert", task_type="email", priority="high", deadline=28, depends_on="x4"),
106
- Task(id="x8", difficulty="expert", task_type="report", priority="normal", deadline=33, depends_on="x6"),
107
- Task(id="x9", difficulty="expert", task_type="meeting", priority="critical", deadline=36, depends_on="x5"),
108
- Task(id="x10", difficulty="expert", task_type="call", priority="high", deadline=44),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  ]
110
 
111
  return []
@@ -126,8 +228,19 @@ def _inject_interruption(state: EnvState, step: int) -> None:
126
  # GRADER
127
  # ==========================================
128
  def grader(trajectory: dict) -> float:
129
- """OpenEnv single-argument grader."""
130
- raw_tasks = trajectory.get("tasks", [])
 
 
 
 
 
 
 
 
 
 
 
131
  ts = trajectory.get("time_step", 50)
132
  eng = trajectory.get("energy", 0.5)
133
  task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
@@ -136,41 +249,35 @@ def grader(trajectory: dict) -> float:
136
 
137
  def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float) -> float:
138
  """
139
- Additive grader producing strictly different scores per difficulty:
140
- easy β‰ˆ 0.70–0.80 (completes all tasks, no deadlines)
141
- medium β‰ˆ 0.38–0.55 (completes 2–3/5 with deadlines)
142
- hard β‰ˆ 0.18–0.30 (completes 2–3/10 with dependencies)
143
- expert β‰ˆ 0.06–0.15 (completes 1–2/13 with interruptions)
144
-
145
- Score formula (additive β€” no harsh subtractive penalties):
146
- weighted_completion Γ— 0.60 (primary driver)
147
- + deadline_adherence Γ— 0.22 (fraction of tasks meeting deadline)
148
- + energy_efficiency Γ— 0.10 (reward for not burning out)
149
- + dependency_bonus Γ— 0.05 (rewarded correct sequencing)
150
- + interruption_bonus Γ— 0.03 (handled urgent tasks)
151
-
152
- Always returns value in (0.01, 0.99).
153
  """
154
  if not tasks:
155
  return 0.01
156
 
157
  total_weight = sum(PRIORITY_WEIGHT[t.priority] for t in tasks)
158
 
159
- # ── Weighted completion (partial progress counts) ─────────────────���────────
160
  wc = sum(t.progress * PRIORITY_WEIGHT[t.priority] for t in tasks) / max(total_weight, 0.01)
161
 
162
- # ── Deadline adherence (fraction of COMPLETABLE tasks that met deadline) ───
163
- completable = [t for t in tasks if t.deadline is not None]
164
- met_deadline = sum(
165
  1 for t in completable
166
  if t.progress >= 1.0 and time_step <= t.deadline
167
  )
168
  da = (met_deadline / len(completable)) if completable else 1.0
169
 
170
- # ── Energy efficiency ─────────────────────────────────────────────────────
171
  ee = max(0.0, (final_energy - 0.10) * 0.13)
172
 
173
- # ── Dependency ordering bonus ──────────────────────────────────────────────
174
  dep_bonus = 0.0
175
  for t in tasks:
176
  if t.depends_on and t.progress >= 1.0:
@@ -179,11 +286,11 @@ def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float)
179
  dep_bonus += 0.015
180
  dep_bonus = min(0.05, dep_bonus)
181
 
182
- # ── Interruption handling bonus ────────────────────────────────────────────
183
  interrupted = [t for t in tasks if t.is_interrupted]
184
  int_bonus = 0.0
185
  if interrupted:
186
- handled = sum(1 for t in interrupted if t.progress >= 1.0)
187
  int_bonus = min(0.03, (handled / len(interrupted)) * 0.03)
188
 
189
  raw = wc * 0.60 + da * 0.22 + ee + dep_bonus + int_bonus
@@ -191,22 +298,41 @@ def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float)
191
 
192
 
193
  # ==========================================
194
- # OPENENV ENVIRONMENT
 
 
 
195
  # ==========================================
196
- class CLMEnvironment:
197
- _INTERRUPT_STEPS = {
198
- "hard": [15, 32],
199
- "expert": [7, 18, 32],
200
- }
 
201
 
202
- def __init__(self, tasks: list[Task], max_steps: int = 50):
 
 
203
  self.max_steps = max_steps
204
  self.initial_tasks = tasks
205
  self.difficulty = tasks[0].difficulty if tasks else "easy"
206
- self.state = EnvState(tasks=[t.model_copy() for t in tasks])
 
 
 
 
 
 
 
207
 
208
  def reset(self) -> Observation:
209
- self.state = EnvState(tasks=[t.model_copy() for t in self.initial_tasks])
 
 
 
 
 
 
210
  return self._get_observation()
211
 
212
  def _blocked_ids(self) -> set[str]:
@@ -221,12 +347,16 @@ class CLMEnvironment:
221
 
222
  def _get_observation(self) -> Observation:
223
  e = self.state.energy
224
- fl = "high" if e < 0.30 else ("medium" if e < 0.60 else "low")
 
 
 
 
 
225
  vs = VisibleState(
226
- fatigue_level=fl,
227
- stress_warning=self.state.stress > 0.65,
228
- energy_level=round(e, 3),
229
- stress_level=round(self.state.stress, 3),
230
  focus_mode=self.state.focus_mode,
231
  upcoming_deadlines=self._upcoming_ids(),
232
  blocked_tasks=list(self._blocked_ids()),
@@ -237,23 +367,25 @@ class CLMEnvironment:
237
  reward = 0.0
238
  blocked = self._blocked_ids()
239
 
240
- # ── Inject interruptions ───────────────────────────────────────────────
241
- int_steps = self._INTERRUPT_STEPS.get(self.difficulty, [])
242
- if (self.state.time_step in int_steps
243
- and self.state.interruption_count < len(int_steps)):
244
  _inject_interruption(self.state, self.state.time_step)
 
 
245
  reward -= 0.05
246
 
247
- # ── Action processing ──────────────────────────────────────────────────
248
  if action.type in ("work", "focus"):
249
  is_focus = (action.type == "focus")
250
 
251
  if action.task_id:
252
  if action.task_id in blocked:
253
- reward -= 0.15 # tried to work on blocked task
254
  else:
255
  if self.state.current_task_id and self.state.current_task_id != action.task_id:
256
- reward -= 0.07 # context-switch cost
257
  self.state.current_task_id = action.task_id
258
  self.state.focus_mode = is_focus
259
 
@@ -272,7 +404,6 @@ class CLMEnvironment:
272
 
273
  reward += 0.10 * (task.progress - old_p) * pw
274
 
275
- # Milestone rewards
276
  for ms, bonus in [(0.25, 0.04), (0.50, 0.07), (0.75, 0.09), (1.00, 0.18)]:
277
  key = f"{task.id}@{ms}"
278
  if task.progress >= ms and key not in self.state.milestone_rewards:
@@ -298,7 +429,7 @@ class CLMEnvironment:
298
 
299
  self.state.time_step += 1
300
 
301
- # ── Stress dynamics ────────────────────────────────────────────────────
302
  for t in (tt for tt in self.state.tasks if tt.progress < 1.0):
303
  if t.deadline:
304
  ttd = t.deadline - self.state.time_step
@@ -308,7 +439,7 @@ class CLMEnvironment:
308
  elif ttd < 0:
309
  self.state.stress = min(1.0, self.state.stress + 0.12 * pw)
310
 
311
- # ── Episode termination ────────────────────────────────────────────────
312
  all_done = all(t.progress >= 1.0 for t in self.state.tasks)
313
  burnout = self.state.energy < 0.07
314
  timeout = self.state.time_step >= self.max_steps
 
1
  from pydantic import BaseModel, Field
2
  from typing import List, Optional, Literal, Tuple, Dict, Any
3
+ import random
4
 
5
  # ==========================================
6
+ # TASK TYPES
7
  # ==========================================
8
  TaskType = Literal["email", "meeting", "code_review", "report", "call"]
9
  Priority = Literal["critical", "high", "normal", "low"]
 
12
  TASK_ENERGY_COST = {"email": 0.08, "meeting": 0.18, "code_review": 0.20, "report": 0.14, "call": 0.11}
13
  TASK_PROGRESS_RATE = {"email": 0.35, "meeting": 0.30, "code_review": 0.20, "report": 0.22, "call": 0.28}
14
 
15
+ ALL_TASK_TYPES: list[TaskType] = ["email", "meeting", "code_review", "report", "call"]
16
+ ALL_PRIORITIES: list[Priority] = ["critical", "high", "normal", "low"]
17
+
18
  # ==========================================
19
  # OPENENV SCHEMAS
20
  # ==========================================
 
25
  priority: Priority = "normal"
26
  progress: float = 0.0
27
  deadline: Optional[int] = None
28
+ depends_on: Optional[str] = None
29
+ is_interrupted: bool = False
30
 
31
  class VisibleState(BaseModel):
32
+ """
33
+ FIX 6 β€” Partial observability: agent sees only categorical labels,
34
+ not raw float values for energy/stress. This rewards agents that
35
+ reason from context rather than reading exact numbers.
36
+ """
37
  fatigue_level: str # "low" | "medium" | "high"
38
+ stress_level: str # "calm" | "elevated" | "critical"
39
  stress_warning: bool
 
 
40
  focus_mode: bool = False
41
+ upcoming_deadlines: List[str] = []
42
+ blocked_tasks: List[str] = []
43
+ # energy_level and stress float removed β€” use fatigue_level / stress_level instead
44
 
45
  class Observation(BaseModel):
46
  tasks: List[Task]
 
49
 
50
  class Action(BaseModel):
51
  type: Literal["work", "break", "switch", "delay", "focus"]
 
 
 
 
 
52
  task_id: Optional[str] = None
53
 
54
  class EnvState(BaseModel):
55
+ energy: float = 1.0
56
+ stress: float = 0.0
57
+ fatigue: float = 0.0
58
+ time_step: int = 0
59
+ current_task_id: Optional[str] = None
60
+ tasks: List[Task] = []
61
+ focus_mode: bool = False
62
+ interruption_count: int = 0
63
+ milestone_rewards: Dict[str, float] = {}
64
+ # FIX 3 β€” stochastic interrupt tracking
65
+ next_interrupt_eligible: int = 999
66
+ interrupt_budget: int = 0
67
 
68
 
69
  # ==========================================
70
+ # FIX 2 β€” PROCEDURAL TASK GENERATION
71
+ # Seed-based so episodes are reproducible on request but vary by default.
72
+ # Deadlines jitter +-3 steps; task types and secondary priorities randomised.
73
  # ==========================================
74
+ def generate_tasks(level: str, seed: Optional[int] = None) -> list[Task]:
75
+ """
76
+ Generate tasks for the given difficulty level.
77
+ Pass seed=None for a random seed (default for live play),
78
+ or an explicit int for reproducible evaluation runs.
79
+ """
80
+ rng = random.Random(seed)
81
+
82
+ def _jitter(base: int, lo: int = -3, hi: int = 3) -> int:
83
+ return max(1, base + rng.randint(lo, hi))
84
+
85
+ def _p(pool: list) -> str:
86
+ return rng.choice(pool)
87
+
88
  if level == "easy":
 
89
  return [
90
+ Task(id="e1", difficulty="easy",
91
+ task_type=_p(["email", "report"]),
92
+ priority=_p(["normal", "high"]),
93
+ deadline=None),
94
+ Task(id="e2", difficulty="easy",
95
+ task_type=_p(["report", "code_review"]),
96
+ priority=_p(["normal", "low"]),
97
+ deadline=None),
98
  ]
99
 
100
  elif level == "medium":
 
101
  return [
102
+ Task(id="m1", difficulty="medium",
103
+ task_type=_p(["email", "call"]),
104
+ priority="critical",
105
+ deadline=_jitter(14)),
106
+ Task(id="m2", difficulty="medium",
107
+ task_type=_p(["meeting", "code_review"]),
108
+ priority=_p(["high", "normal"]),
109
+ deadline=_jitter(20)),
110
+ Task(id="m3", difficulty="medium",
111
+ task_type=_p(["code_review", "report"]),
112
+ priority=_p(["normal", "high"]),
113
+ deadline=_jitter(28)),
114
+ Task(id="m4", difficulty="medium",
115
+ task_type=_p(["report", "meeting"]),
116
+ priority=_p(["high", "normal"]),
117
+ deadline=_jitter(35)),
118
+ Task(id="m5", difficulty="medium",
119
+ task_type=_p(["call", "email"]),
120
+ priority=_p(["low", "normal"]),
121
+ deadline=_jitter(45)),
122
  ]
123
 
124
  elif level == "hard":
 
125
  return [
126
+ Task(id="h1", difficulty="hard",
127
+ task_type=_p(["email", "call"]),
128
+ priority="critical",
129
+ deadline=_jitter(12)),
130
+ Task(id="h2", difficulty="hard",
131
+ task_type=_p(["code_review", "report"]),
132
+ priority=_p(["high", "normal"]),
133
+ deadline=_jitter(16)),
134
+ Task(id="h3", difficulty="hard",
135
+ task_type=_p(["meeting", "call"]),
136
+ priority="critical",
137
+ deadline=_jitter(20),
138
+ depends_on="h1"),
139
+ Task(id="h4", difficulty="hard",
140
+ task_type=_p(["report", "code_review"]),
141
+ priority=_p(["high", "normal"]),
142
+ deadline=_jitter(24)),
143
+ Task(id="h5", difficulty="hard",
144
+ task_type=_p(["call", "meeting"]),
145
+ priority=_p(["normal", "high"]),
146
+ deadline=_jitter(28),
147
+ depends_on="h2"),
148
+ Task(id="h6", difficulty="hard",
149
+ task_type=_p(["email", "report"]),
150
+ priority=_p(["high", "normal"]),
151
+ deadline=_jitter(32)),
152
+ Task(id="h7", difficulty="hard",
153
+ task_type=_p(["code_review", "meeting"]),
154
+ priority="critical",
155
+ deadline=_jitter(38),
156
+ depends_on="h4"),
157
+ Task(id="h8", difficulty="hard",
158
+ task_type=_p(["report", "email"]),
159
+ priority=_p(["normal", "low"]),
160
+ deadline=_jitter(46)),
161
  ]
162
 
163
  elif level == "expert":
 
164
  return [
165
+ Task(id="x1", difficulty="expert",
166
+ task_type=_p(["email", "call"]),
167
+ priority="critical",
168
+ deadline=_jitter(8)),
169
+ Task(id="x2", difficulty="expert",
170
+ task_type=_p(["code_review", "report"]),
171
+ priority=_p(["high", "critical"]),
172
+ deadline=_jitter(12)),
173
+ Task(id="x3", difficulty="expert",
174
+ task_type=_p(["meeting", "call"]),
175
+ priority="critical",
176
+ deadline=_jitter(14),
177
+ depends_on="x1"),
178
+ Task(id="x4", difficulty="expert",
179
+ task_type=_p(["report", "code_review"]),
180
+ priority=_p(["high", "normal"]),
181
+ deadline=_jitter(18),
182
+ depends_on="x2"),
183
+ Task(id="x5", difficulty="expert",
184
+ task_type=_p(["call", "meeting"]),
185
+ priority=_p(["normal", "high"]),
186
+ deadline=_jitter(22),
187
+ depends_on="x3"),
188
+ Task(id="x6", difficulty="expert",
189
+ task_type=_p(["code_review", "email"]),
190
+ priority="critical",
191
+ deadline=_jitter(24)),
192
+ Task(id="x7", difficulty="expert",
193
+ task_type=_p(["email", "report"]),
194
+ priority=_p(["high", "normal"]),
195
+ deadline=_jitter(28),
196
+ depends_on="x4"),
197
+ Task(id="x8", difficulty="expert",
198
+ task_type=_p(["report", "call"]),
199
+ priority=_p(["normal", "high"]),
200
+ deadline=_jitter(33),
201
+ depends_on="x6"),
202
+ Task(id="x9", difficulty="expert",
203
+ task_type=_p(["meeting", "code_review"]),
204
+ priority="critical",
205
+ deadline=_jitter(36),
206
+ depends_on="x5"),
207
+ Task(id="x10", difficulty="expert",
208
+ task_type=_p(["call", "email"]),
209
+ priority=_p(["high", "normal"]),
210
+ deadline=_jitter(44)),
211
  ]
212
 
213
  return []
 
228
  # GRADER
229
  # ==========================================
230
  def grader(trajectory: dict) -> float:
231
+ """
232
+ OpenEnv single-argument grader.
233
+
234
+ FIX 1: If trajectory is empty or missing tasks, return 0.01 immediately.
235
+ The grader MUST score the actual agent trajectory β€” it must never silently
236
+ fall back to re-running a heuristic episode. Doing so would let the
237
+ environment grade itself rather than the agent under evaluation.
238
+ """
239
+ if not trajectory or not trajectory.get("tasks"):
240
+ # Empty trajectory = agent produced no useful state β†’ minimum score
241
+ return 0.01
242
+
243
+ raw_tasks = trajectory["tasks"]
244
  ts = trajectory.get("time_step", 50)
245
  eng = trajectory.get("energy", 0.5)
246
  task_objs = [Task(**t) if isinstance(t, dict) else t for t in raw_tasks]
 
249
 
250
  def deterministic_grader(tasks: list[Task], time_step: int, final_energy: float) -> float:
251
  """
252
+ Scores the ACTUAL final task state. Always returns a value in (0.01, 0.99).
253
+
254
+ Formula:
255
+ weighted_completion x 0.60
256
+ deadline_adherence x 0.22
257
+ energy_efficiency x 0.10
258
+ dependency_bonus x 0.05
259
+ interruption_bonus x 0.03
 
 
 
 
 
 
260
  """
261
  if not tasks:
262
  return 0.01
263
 
264
  total_weight = sum(PRIORITY_WEIGHT[t.priority] for t in tasks)
265
 
266
+ # Weighted completion (partial progress counts)
267
  wc = sum(t.progress * PRIORITY_WEIGHT[t.priority] for t in tasks) / max(total_weight, 0.01)
268
 
269
+ # Deadline adherence
270
+ completable = [t for t in tasks if t.deadline is not None]
271
+ met_deadline = sum(
272
  1 for t in completable
273
  if t.progress >= 1.0 and time_step <= t.deadline
274
  )
275
  da = (met_deadline / len(completable)) if completable else 1.0
276
 
277
+ # Energy efficiency
278
  ee = max(0.0, (final_energy - 0.10) * 0.13)
279
 
280
+ # Dependency ordering bonus
281
  dep_bonus = 0.0
282
  for t in tasks:
283
  if t.depends_on and t.progress >= 1.0:
 
286
  dep_bonus += 0.015
287
  dep_bonus = min(0.05, dep_bonus)
288
 
289
+ # Interruption handling bonus
290
  interrupted = [t for t in tasks if t.is_interrupted]
291
  int_bonus = 0.0
292
  if interrupted:
293
+ handled = sum(1 for t in interrupted if t.progress >= 1.0)
294
  int_bonus = min(0.03, (handled / len(interrupted)) * 0.03)
295
 
296
  raw = wc * 0.60 + da * 0.22 + ee + dep_bonus + int_bonus
 
298
 
299
 
300
  # ==========================================
301
+ # FIX 3 β€” STOCHASTIC INTERRUPTION CONFIG
302
+ # Interruptions fire with a per-step probability once an eligibility
303
+ # window opens, with a cooldown to prevent back-to-back fires.
304
+ # budget = max number of interrupts for the difficulty level.
305
  # ==========================================
306
+ _INTERRUPT_CONFIG = {
307
+ # prob_per_step eligible_from cooldown_steps budget
308
+ "hard": (0.18, 10, 8, 2),
309
+ "expert": (0.22, 6, 7, 3),
310
+ }
311
+
312
 
313
+ class CLMEnvironment:
314
+ def __init__(self, tasks: list[Task], max_steps: int = 50,
315
+ seed: Optional[int] = None):
316
  self.max_steps = max_steps
317
  self.initial_tasks = tasks
318
  self.difficulty = tasks[0].difficulty if tasks else "easy"
319
+ self._rng = random.Random(seed)
320
+ cfg = _INTERRUPT_CONFIG.get(self.difficulty, (0.0, 999, 999, 0))
321
+ self._interrupt_prob, eligible_from, self._cooldown, budget = cfg
322
+ self.state = EnvState(
323
+ tasks=[t.model_copy() for t in tasks],
324
+ next_interrupt_eligible=eligible_from,
325
+ interrupt_budget=budget,
326
+ )
327
 
328
  def reset(self) -> Observation:
329
+ cfg = _INTERRUPT_CONFIG.get(self.difficulty, (0.0, 999, 999, 0))
330
+ _, eligible_from, _, budget = cfg
331
+ self.state = EnvState(
332
+ tasks=[t.model_copy() for t in self.initial_tasks],
333
+ next_interrupt_eligible=eligible_from,
334
+ interrupt_budget=budget,
335
+ )
336
  return self._get_observation()
337
 
338
  def _blocked_ids(self) -> set[str]:
 
347
 
348
  def _get_observation(self) -> Observation:
349
  e = self.state.energy
350
+ s = self.state.stress
351
+
352
+ # FIX 6: Categorical labels only β€” no raw floats exposed to agent
353
+ fatigue_label = "high" if e < 0.30 else ("medium" if e < 0.60 else "low")
354
+ stress_label = "critical" if s > 0.75 else ("elevated" if s > 0.45 else "calm")
355
+
356
  vs = VisibleState(
357
+ fatigue_level=fatigue_label,
358
+ stress_level=stress_label,
359
+ stress_warning=s > 0.65,
 
360
  focus_mode=self.state.focus_mode,
361
  upcoming_deadlines=self._upcoming_ids(),
362
  blocked_tasks=list(self._blocked_ids()),
 
367
  reward = 0.0
368
  blocked = self._blocked_ids()
369
 
370
+ # FIX 3: Stochastic interruption β€” probabilistic, not fixed-step
371
+ if (self.state.interrupt_budget > 0
372
+ and self.state.time_step >= self.state.next_interrupt_eligible
373
+ and self._rng.random() < self._interrupt_prob):
374
  _inject_interruption(self.state, self.state.time_step)
375
+ self.state.interrupt_budget -= 1
376
+ self.state.next_interrupt_eligible = self.state.time_step + self._cooldown
377
  reward -= 0.05
378
 
379
+ # Action processing
380
  if action.type in ("work", "focus"):
381
  is_focus = (action.type == "focus")
382
 
383
  if action.task_id:
384
  if action.task_id in blocked:
385
+ reward -= 0.15
386
  else:
387
  if self.state.current_task_id and self.state.current_task_id != action.task_id:
388
+ reward -= 0.07
389
  self.state.current_task_id = action.task_id
390
  self.state.focus_mode = is_focus
391
 
 
404
 
405
  reward += 0.10 * (task.progress - old_p) * pw
406
 
 
407
  for ms, bonus in [(0.25, 0.04), (0.50, 0.07), (0.75, 0.09), (1.00, 0.18)]:
408
  key = f"{task.id}@{ms}"
409
  if task.progress >= ms and key not in self.state.milestone_rewards:
 
429
 
430
  self.state.time_step += 1
431
 
432
+ # Stress dynamics
433
  for t in (tt for tt in self.state.tasks if tt.progress < 1.0):
434
  if t.deadline:
435
  ttd = t.deadline - self.state.time_step
 
439
  elif ttd < 0:
440
  self.state.stress = min(1.0, self.state.stress + 0.12 * pw)
441
 
442
+ # Episode termination
443
  all_done = all(t.progress >= 1.0 for t in self.state.tasks)
444
  burnout = self.state.energy < 0.07
445
  timeout = self.state.time_step >= self.max_steps
openenv.yaml CHANGED
@@ -48,10 +48,10 @@ observation_space:
48
  - depends_on: task_id or null
49
  - is_interrupted: bool
50
  visible_state:
51
- - fatigue_level: "low | medium | high"
52
- - stress_warning: bool
53
- - energy_level: float [0.0, 1.0]
54
- - stress_level: float [0.0, 1.0]
55
  - focus_mode: bool
56
  - upcoming_deadlines: list[task_id]
57
  - blocked_tasks: list[task_id]
 
48
  - depends_on: task_id or null
49
  - is_interrupted: bool
50
  visible_state:
51
+ # Partial observability: energy/stress are categorical labels, not raw floats.
52
+ - fatigue_level: "low | medium | high" # energy bands: >0.6 | 0.3-0.6 | <0.3
53
+ - stress_level: "calm | elevated | critical" # stress bands: <0.45 | 0.45-0.75 | >0.75
54
+ - stress_warning: bool # true when stress > 0.65
55
  - focus_mode: bool
56
  - upcoming_deadlines: list[task_id]
57
  - blocked_tasks: list[task_id]
server/app.py CHANGED
@@ -1,15 +1,21 @@
1
- import uvicorn
 
 
 
 
 
 
 
 
2
  import sys
3
  import os
4
 
5
  sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
6
 
7
- from backend.main import app # app is now importable as server.app:app
8
-
9
-
10
- def main():
11
- uvicorn.run(app, host="0.0.0.0", port=7860)
12
 
 
13
 
14
  if __name__ == "__main__":
15
- main()
 
 
1
+ """
2
+ server/app.py β€” single entry point for CLM OpenEnv server.
3
+
4
+ Imports the FastAPI app built in backend/main.py and exposes it for:
5
+ - Dockerfile: uvicorn server.app:app --host 0.0.0.0 --port 7860
6
+ - openenv.yaml: app: server.app:app
7
+
8
+ All route logic lives in backend/main.py. This file is intentionally thin.
9
+ """
10
  import sys
11
  import os
12
 
13
  sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
14
 
15
+ from backend.main import app # single source of truth for the FastAPI app
 
 
 
 
16
 
17
+ __all__ = ["app"]
18
 
19
  if __name__ == "__main__":
20
+ import uvicorn
21
+ uvicorn.run(app, host="0.0.0.0", port=7860)
tests/test_clm.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ tests/test_clm.py β€” unit tests for the Cognitive Load Manager environment.
3
+
4
+ Run with: pytest tests/test_clm.py -v
5
+ """
6
+ import sys, os, pytest
7
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
8
+
9
+ from models import (
10
+ Action, Task, EnvState, CLMEnvironment,
11
+ generate_tasks, deterministic_grader, grader,
12
+ PRIORITY_WEIGHT,
13
+ )
14
+ from grader.clm_graders import (
15
+ EasyGrader, MediumGrader, HardGrader, ExpertGrader, _from_trajectory,
16
+ )
17
+
18
+
19
+ # ─────────────────────────────────────────────────────────────────────────────
20
+ # FIX 2 β€” Procedural generation
21
+ # ─────────────────────────────────────────────────────────────────────────────
22
+ class TestProceduralGeneration:
23
+ def test_seed_produces_same_tasks(self):
24
+ a = generate_tasks("medium", seed=7)
25
+ b = generate_tasks("medium", seed=7)
26
+ assert [t.model_dump() for t in a] == [t.model_dump() for t in b]
27
+
28
+ def test_different_seeds_differ(self):
29
+ results = set()
30
+ for s in range(20):
31
+ tasks = generate_tasks("medium", seed=s)
32
+ results.add(tuple(t.deadline for t in tasks))
33
+ assert len(results) > 1, "All seeds produced identical deadlines"
34
+
35
+ def test_task_counts(self):
36
+ assert len(generate_tasks("easy")) == 2
37
+ assert len(generate_tasks("medium")) == 5
38
+ assert len(generate_tasks("hard")) == 8
39
+ assert len(generate_tasks("expert")) == 10
40
+
41
+ def test_deadlines_positive_and_bounded(self):
42
+ """Jitter can reorder adjacent deadlines, but all must be positive and sane."""
43
+ base_deadlines = {"medium": [14, 20, 28, 35, 45], "hard": [12, 16, 20, 24, 28, 32, 38, 46]}
44
+ for level, bases in base_deadlines.items():
45
+ for seed in range(20):
46
+ tasks = generate_tasks(level, seed=seed)
47
+ for t in tasks:
48
+ if t.deadline is not None:
49
+ assert t.deadline >= 1, f"Deadline must be >= 1, got {t.deadline}"
50
+ # Should be within Β±5 of the nearest base (generous bound)
51
+ nearest = min(bases, key=lambda b: abs(b - t.deadline))
52
+ assert abs(t.deadline - nearest) <= 5, \
53
+ f"Deadline {t.deadline} too far from base {nearest}"
54
+
55
+
56
+ # ─────────────────────────────────────────────────────────────────────────────
57
+ # FIX 1 β€” Grader trajectory bug
58
+ # ─────────────────────────────────────────────────────────────────────────────
59
+ class TestGraderTrajectoryBug:
60
+ def test_empty_trajectory_returns_min(self):
61
+ assert grader({}) == 0.01
62
+
63
+ def test_missing_tasks_returns_min(self):
64
+ assert grader({"time_step": 50, "energy": 0.8}) == 0.01
65
+
66
+ def test_empty_tasks_list_returns_min(self):
67
+ assert grader({"tasks": [], "time_step": 50, "energy": 0.8}) == 0.01
68
+
69
+ def test_grader_class_empty_trajectory(self):
70
+ for cls in [EasyGrader, MediumGrader, HardGrader, ExpertGrader]:
71
+ score = cls()(trajectory={})
72
+ assert score == 0.01, f"{cls.__name__} returned {score} for empty trajectory"
73
+
74
+ def test_from_trajectory_empty(self):
75
+ score, success, msg = _from_trajectory({}, "easy")
76
+ assert score == 0.01
77
+ assert success is False
78
+ assert "empty trajectory" in msg
79
+
80
+ def test_real_trajectory_scores_above_min(self):
81
+ """A trajectory with completed tasks should score > 0.01."""
82
+ tasks = generate_tasks("easy", seed=1)
83
+ for t in tasks:
84
+ t.progress = 1.0
85
+ traj = {"tasks": [t.model_dump() for t in tasks], "time_step": 20, "energy": 0.7}
86
+ assert grader(traj) > 0.01
87
+
88
+
89
+ # ─────────────────────────────────────────────────────────────────────────────
90
+ # Environment basics
91
+ # ─────────────────────────────────────────────────────────────────────────────
92
+ class TestReset:
93
+ def test_reset_produces_clean_state(self):
94
+ env = CLMEnvironment(tasks=generate_tasks("easy", seed=0), max_steps=50)
95
+ obs = env.reset()
96
+ assert env.state.energy == 1.0
97
+ assert env.state.stress == 0.0
98
+ assert env.state.time_step == 0
99
+ assert all(t.progress == 0.0 for t in env.state.tasks)
100
+
101
+ def test_reset_after_episode_clears_state(self):
102
+ env = CLMEnvironment(tasks=generate_tasks("easy", seed=0), max_steps=50)
103
+ env.reset()
104
+ for _ in range(10):
105
+ env.step(Action(type="work", task_id="e1"))
106
+ env.reset()
107
+ assert env.state.time_step == 0
108
+ assert env.state.energy == 1.0
109
+
110
+
111
+ # ─────────────────────────────────────────────────────────────────────────────
112
+ # Blocked-task penalty (Fix 3 indirectly β€” env mechanics)
113
+ # ─────────────────────────────────────────────────────────────────────────────
114
+ class TestBlockedTaskPenalty:
115
+ def test_working_on_blocked_task_gives_penalty(self):
116
+ tasks = generate_tasks("hard", seed=0)
117
+ env = CLMEnvironment(tasks=tasks, max_steps=50)
118
+ env.reset()
119
+
120
+ # h3 depends on h1 β€” h1 not done yet, so h3 is blocked
121
+ blocked = env._blocked_ids()
122
+ assert "h3" in blocked, "h3 should be blocked at episode start"
123
+
124
+ _, reward, _, _ = env.step(Action(type="work", task_id="h3"))
125
+ assert reward <= -0.15, f"Expected penalty for blocked task, got {reward}"
126
+
127
+
128
+ # ─────────────────────────────────────────────────────────────────────────────
129
+ # FIX 3 β€” Stochastic interruptions
130
+ # ─────────────────────────────────────────────────────────────────────────────
131
+ class TestStochasticInterruptions:
132
+ def test_hard_eventually_interrupts(self):
133
+ """Over many seeds, at least one hard episode should fire an interruption."""
134
+ fired = False
135
+ for seed in range(50):
136
+ tasks = generate_tasks("hard", seed=seed)
137
+ env = CLMEnvironment(tasks=tasks, max_steps=50, seed=seed)
138
+ env.reset()
139
+ done = False
140
+ while not done:
141
+ _, _, done, _ = env.step(Action(type="work", task_id=tasks[0].id))
142
+ if env.state.interruption_count > 0:
143
+ fired = True
144
+ break
145
+ assert fired, "Expected at least one interruption across 50 hard seeds"
146
+
147
+ def test_interruptions_respect_budget(self):
148
+ """Hard episodes should never exceed budget=2 interruptions."""
149
+ for seed in range(30):
150
+ tasks = generate_tasks("hard", seed=seed)
151
+ env = CLMEnvironment(tasks=tasks, max_steps=50, seed=seed)
152
+ env.reset()
153
+ done = False
154
+ while not done:
155
+ _, _, done, _ = env.step(Action(type="work", task_id=tasks[0].id))
156
+ assert env.state.interruption_count <= 2, \
157
+ f"Seed {seed}: got {env.state.interruption_count} interruptions, max is 2"
158
+
159
+ def test_no_interruptions_on_easy(self):
160
+ for seed in range(10):
161
+ tasks = generate_tasks("easy", seed=seed)
162
+ env = CLMEnvironment(tasks=tasks, max_steps=50, seed=seed)
163
+ env.reset()
164
+ done = False
165
+ while not done:
166
+ _, _, done, _ = env.step(Action(type="break"))
167
+ assert env.state.interruption_count == 0
168
+
169
+
170
+ # ─────────────────────────────────────────────────────────────────────────────
171
+ # Burnout terminates episode
172
+ # ─────────────────────────────────────────────────────────────────────────────
173
+ class TestBurnout:
174
+ def test_burnout_terminates_episode(self):
175
+ tasks = generate_tasks("easy", seed=0)
176
+ env = CLMEnvironment(tasks=tasks, max_steps=200)
177
+ env.reset()
178
+ env.state.energy = 0.08 # just above burnout threshold
179
+ done = False
180
+ for _ in range(5):
181
+ _, _, done, info = env.step(Action(type="work", task_id="e1"))
182
+ if done:
183
+ break
184
+ assert done, "Episode should terminate on burnout"
185
+
186
+ def test_burnout_applies_penalty(self):
187
+ tasks = generate_tasks("easy", seed=0)
188
+ env = CLMEnvironment(tasks=tasks, max_steps=200)
189
+ env.reset()
190
+ env.state.energy = 0.08
191
+ rewards = []
192
+ done = False
193
+ for _ in range(5):
194
+ _, r, done, _ = env.step(Action(type="work", task_id="e1"))
195
+ rewards.append(r)
196
+ if done:
197
+ break
198
+ assert any(r <= -0.5 for r in rewards), "Burnout should produce a large negative reward"
199
+
200
+
201
+ # ─────────────────────────────────────────────────────────────────────────────
202
+ # Grader score bounds
203
+ # ─────────────────────────────────────────────────────────────────────────────
204
+ class TestGraderBounds:
205
+ def test_grader_always_in_bounds(self):
206
+ for level in ["easy", "medium", "hard", "expert"]:
207
+ for seed in range(10):
208
+ tasks = generate_tasks(level, seed=seed)
209
+ for frac in [0.0, 0.3, 0.7, 1.0]:
210
+ for t in tasks:
211
+ t.progress = frac
212
+ score = deterministic_grader(tasks, time_step=30, final_energy=0.5)
213
+ assert 0.01 <= score <= 0.99, \
214
+ f"Score {score} out of bounds for {level} seed={seed} progress={frac}"
215
+
216
+ def test_grader_higher_completion_scores_higher(self):
217
+ tasks_low = generate_tasks("medium", seed=1)
218
+ tasks_high = generate_tasks("medium", seed=1)
219
+ for t in tasks_low: t.progress = 0.0
220
+ for t in tasks_high: t.progress = 1.0
221
+ assert deterministic_grader(tasks_high, 30, 0.7) > \
222
+ deterministic_grader(tasks_low, 30, 0.7)
223
+
224
+
225
+ # ─────────────────────────────────────────────────────────────────────────────
226
+ # FIX 6 β€” Partial observability
227
+ # ─────────────────────────────────────────────────────────────────────────────
228
+ class TestPartialObservability:
229
+ def test_observation_has_no_raw_floats(self):
230
+ env = CLMEnvironment(tasks=generate_tasks("easy", seed=0))
231
+ obs = env.reset()
232
+ vs = obs.visible_state
233
+ # energy_level and stress float must NOT appear in visible state
234
+ assert not hasattr(vs, "energy_level"), "energy_level float should not be in observation"
235
+ assert isinstance(vs.fatigue_level, str)
236
+ assert isinstance(vs.stress_level, str)
237
+
238
+ def test_fatigue_levels_are_valid(self):
239
+ env = CLMEnvironment(tasks=generate_tasks("easy", seed=0))
240
+ env.reset()
241
+ env.state.energy = 0.1 # should be "high" fatigue
242
+ obs = env._get_observation()
243
+ assert obs.visible_state.fatigue_level == "high"
244
+ env.state.energy = 0.5 # "medium"
245
+ assert env._get_observation().visible_state.fatigue_level == "medium"
246
+ env.state.energy = 0.9 # "low"
247
+ assert env._get_observation().visible_state.fatigue_level == "low"
248
+
249
+ def test_stress_levels_are_valid(self):
250
+ env = CLMEnvironment(tasks=generate_tasks("easy", seed=0))
251
+ env.reset()
252
+ env.state.stress = 0.8
253
+ assert env._get_observation().visible_state.stress_level == "critical"
254
+ env.state.stress = 0.5
255
+ assert env._get_observation().visible_state.stress_level == "elevated"
256
+ env.state.stress = 0.1
257
+ assert env._get_observation().visible_state.stress_level == "calm"