| # Code Review Report: F011 Step 1.3 (`notebooks/compare_methods.ipynb`) |
| |
| **Risk Tier:** Medium |
| **Status:** Passed with Warnings |
| **Verdict:** APPROVE |
| |
| ## Summary |
| |
| `LLMToolCallingPolicy` is implemented per Step 1.3 intent: it builds episode messages, uses chat-template tool calling, forces ANSWER at low budget, and falls back to `parse_error` on unparseable output. No correctness or security blockers were found in the scoped notebook change. |
|
|
| ## Evidence |
|
|
| ### Tests |
| - **Status:** Mixed (targeted checks passed; existing unrelated smoke failures persist) |
| - **Commands:** |
| - `uv run python - <<'PY' ... compile notebook cells ... PY` |
| - `uv run python - <<'PY' ... runtime checks for valid action / budget fallback / parse fallback ... PY` |
| - `uv run pytest tests/test_smoke.py -v` |
| - **Results:** |
| - Notebook code-cell compilation: passed (`Compiled 6 code cells successfully`) |
| - Policy runtime checks: passed (`QUERY` valid path, `ANSWER budget_exhausted`, `ANSWER parse_error`) |
| - Smoke tests: `21 passed, 4 failed` (pre-existing reward expectation mismatches in environment tests) |
|
|
| ### Security (Medium) |
| - **Status:** Clear |
| - **Checks:** Medium-tier quick checks on parsing/generation fallback paths; no secret handling, auth, or privilege-sensitive paths added. |
|
|
| ## Issues |
|
|
| ### Critical |
| None. |
|
|
| ### Important |
| None. |
|
|
| ### Minor |
| 1. **Episode reset heuristic is question-text based and can theoretically leak history if two consecutive episodes start with identical question text.** |
| - **Location:** `notebooks/compare_methods.ipynb:313-316` |
| - **Recommendation:** Consider adding a stronger episode boundary signal (e.g., explicit wrapper reset hook or observation-based reset trigger). |
|
|
| ## Next Actions |
|
|
| 1. Proceed to Step 1.4. |
| 2. Optionally harden reset boundary logic before large-scale eval runs. |
|
|