File size: 11,315 Bytes
9e64e71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# Demo: SQLEnv β€” Flat OpenEnv Environment with Action Dispatch

> **Generated:** 2026-02-28T14:26Z
> **Branch:** `refactor-openenv-tutorial-project-structure` @ `f28bfaa`
> **Environment:** Python 3.12.3, torch 2.2.2, MockTokenizer (no Ollama required)

---

## What This Branch Does

This branch refactors the `sql-env` project from a nested `envs/sql_env/` layout into the canonical flat `openenv init` structure, and integrates the `action-feature` branch's core action dispatch system.

The result: a working RL environment where an agent sends natural language messages (e.g. _"describe the students table"_), the environment classifies them into action types (describe/sample/query), dispatches to the appropriate handler, and returns tokenized observations for RL training. All of this runs without external services β€” `MockTokenizer` replaces HuggingFace tokenizers and Ollama failures are handled gracefully.

---

## Quickstart

```bash
git checkout refactor-openenv-tutorial-project-structure
uv sync
uv run pytest tests/ -v          # 21 tests, ~3.5s
```

**Prerequisites:** Python 3.11-3.12, `uv`.
**Optional:** Ollama with `llama3.2` for LLM-guided table selection (not needed for demo).

---

## Evidence

### 1. All 21 Tests Pass

```
$ uv run pytest tests/ -v

tests/test_smoke.py::TestModels::test_action_creation PASSED             [  4%]
tests/test_smoke.py::TestModels::test_action_with_tokens PASSED          [  9%]
tests/test_smoke.py::TestModels::test_observation_creation PASSED        [ 14%]
tests/test_smoke.py::TestModels::test_state_creation PASSED              [ 19%]
tests/test_smoke.py::TestEnvironment::test_instantiation PASSED          [ 23%]
tests/test_smoke.py::TestEnvironment::test_reset_returns_observation PASSED [ 28%]
tests/test_smoke.py::TestEnvironment::test_reset_with_empty_prompt PASSED [ 33%]
tests/test_smoke.py::TestEnvironment::test_reset_creates_new_episode PASSED [ 38%]
tests/test_smoke.py::TestEnvironment::test_step_describe PASSED          [ 42%]
tests/test_smoke.py::TestEnvironment::test_step_sample PASSED            [ 47%]
tests/test_smoke.py::TestEnvironment::test_tokens_grow_across_turns PASSED [ 52%]
tests/test_smoke.py::TestActionDetection::test_describe_keywords PASSED  [ 57%]
tests/test_smoke.py::TestActionDetection::test_sample_keywords PASSED    [ 61%]
tests/test_smoke.py::TestActionDetection::test_query_default PASSED      [ 66%]
tests/test_smoke.py::TestMessageToAction::test_creates_action PASSED     [ 71%]
tests/test_smoke.py::TestMessageToAction::test_appends_to_history PASSED [ 76%]
tests/test_smoke.py::TestMessageToAction::test_validates_input PASSED    [ 80%]
tests/test_smoke.py::TestClientSerialization::test_step_payload_serialization PASSED [ 85%]
tests/test_smoke.py::TestClientSerialization::test_parse_result_deserialization PASSED [ 90%]
tests/test_smoke.py::TestSchemaIntrospection::test_get_table_schema PASSED [ 95%]
tests/test_smoke.py::TestSchemaIntrospection::test_unknown_table PASSED  [100%]

============================== 21 passed in 3.56s ==============================
```

Tests cover: Pydantic models, environment lifecycle, action detection, message-to-action conversion, client tensor serialization, and schema introspection.

### 2. Lint and Format Clean

```
$ uv run ruff check .
All checks passed!

$ uv run ruff format --check .
14 files already formatted
```

### 3. Pydantic Model Contracts

```python
>>> from sql_env.models import SQLAction, SQLObservation, SQLState

SQLAction fields:    ['metadata', 'action_type', 'action_description', 'tokens']
SQLObservation fields: ['done', 'reward', 'metadata', 'messages', 'tokens']
SQLState fields:     ['episode_id', 'step_count', 'history_messages', 'history_tokens', 'current_action_type']
```

`SQLAction.tokens` and `SQLObservation.tokens` carry torch tensors. `SQLState.history_messages` / `history_tokens` accumulate the full conversation for RL context.

### 4. Action Type Detection

The environment classifies natural language messages into action types via keyword matching:

```
  [PASS] "describe the students table..." -> describe
  [PASS] "what columns does Course have..." -> describe
  [PASS] "show me the schema..." -> describe
  [PASS] "show me sample rows from students..." -> sample
  [PASS] "give me example data..." -> sample
  [PASS] "how many rows are in Courses..." -> sample
  [PASS] "find all students enrolled in CS101..." -> query
  [PASS] "select count(*) from students..." -> query
  [PASS] "what is the average score..." -> query
```

Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.

### 5. MockTokenizer Roundtrip

```python
>>> from server.test_sql_env import MockTokenizer
>>> tok = MockTokenizer()
>>> msg = [{'role': 'user', 'content': 'describe the students table'}]
>>> tokens = tok.apply_chat_template(msg, return_tensors='pt')
>>> tokens.shape
torch.Size([1, 27])
>>> tokens[0][:10].tolist()
[100, 101, 115, 99, 114, 105, 98, 101, 32, 116]
>>> tok.decode(tokens[0].tolist())
'describe the students table'
```

`MockTokenizer` encodes each character as `ord(c)` and decodes via `chr(t)`. Deterministic, no downloads, perfect for tests.

### 6. Schema Introspection

SQLAlchemy ORM models are introspected at runtime to produce natural language schema descriptions:

```python
>>> env._get_table_schema('Student')
Table 'Student' has the following columns:

- student_id: integer number
- student_details: text (up to 255 characters)

>>> env._get_table_schema('NonexistentTable')
Table 'NonexistentTable' not found in schema.
```

9 tables available: Address, Person, Student, Course, PersonAddress, StudentCourseRegistration, StudentCourseAttendance, Candidate, CandidateAssessment.

### 7. Full Environment Interaction (Mock Path)

A complete multi-turn episode with no external services:

```python
>>> from server.sql_environment import SQLEnvironment
>>> from server.test_sql_env import MockTokenizer
>>> env = SQLEnvironment(system_prompt='You are a helpful SQL assistant.', tokenizer=MockTokenizer())

>>> obs = env.reset()
>>> obs.messages    # 1 message (system prompt)
>>> obs.tokens.shape
torch.Size([32])
>>> obs.done
False
```

**Turn 1 β€” Describe:**
```python
>>> action = env.message_to_action({'role': 'user', 'content': 'describe the Student table'})
>>> action.action_type
'describe'
>>> obs = env.step(action)
>>> obs.messages[-1]
{'role': 'assistant', 'content': "Table 'Address' has the following columns:\n\n- address_id: integer number\n..."}
>>> obs.tokens.shape
torch.Size([91])
```

Without Ollama, the describe action falls back to the first table (Address). With Ollama, it would correctly select "Student".

**Turn 2 β€” Sample:**
```python
>>> action = env.message_to_action({'role': 'user', 'content': 'show me sample rows from Course'})
>>> action.action_type
'sample'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
"Here's a query to sample data from Address:\n\nSELECT * FROM Address LIMIT 10;"
>>> obs.tokens.shape
torch.Size([503])
```

**Turn 3 β€” Query (no Ollama):**
```python
>>> action = env.message_to_action({'role': 'user', 'content': 'find all students enrolled in CS101'})
>>> action.action_type
'query'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
'Error: Ollama returned status 404'
>>> obs.tokens.shape
torch.Size([1028])
```

The error is graceful β€” it becomes part of the conversation history. Token tensor grows monotonically across turns (32 -> 91 -> 503 -> 1028).

### 8. Client Serialization

`SQLEnvClient` converts tensors to lists for JSON WebSocket transport:

```python
>>> from sql_env.client import SQLEnvClient
>>> from sql_env.models import SQLAction
>>> import torch

>>> action = SQLAction(action_type='query', action_description='select * from students', tokens=torch.tensor([[1, 2, 3, 4, 5]]))
>>> payload = client._step_payload(action)
{
  'action_type': 'query',
  'action_description': 'select * from students',
  'tokens': [[1, 2, 3, 4, 5]],
  'metadata': {}
}
```

Tensor -> list on send, list -> tensor on receive. Symmetric roundtrip verified in tests.

### 9. Spider Question Data

```python
>>> import json
>>> data = json.load(open('data/questions/student_assessment.json'))
>>> len(data)
53
>>> data[0]['question']
'which course has most number of registered students?'
>>> data[0]['query']
'SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1'
```

53 question-answer pairs from the Spider dataset's `student_assessment` database. Each entry has `db_id`, `query`, `question`, `query_toks`, `query_toks_no_value`, and `question_toks`.

---

## What Changed from `main`

| Area | Before (main) | After (this branch) |
|------|---------------|---------------------|
| **Layout** | `envs/sql_env/` nested | Flat root = package |
| **Build** | hatchling | setuptools |
| **Python** | 3.13 | 3.11-3.12 (torch compat) |
| **Models** | Structured obs (question, schema, result) | Chat-based obs (messages + tokens) |
| **Action** | `argument` field | `action_description` + `tokens` tensor |
| **Environment** | Scaffold stubs | Real SQLite + Ollama + keyword dispatch |
| **Client** | Basic EnvClient | Tensor <-> list serialization |
| **Data** | Empty .gitkeep dirs | 9 ORM models + 53 Spider questions |
| **Tests** | 0 | 21 (all passing) |
| **Empty dirs** | `training_pipeline/`, `submission_artifacts/` | Removed |

---

## Known Behaviors (Not Bugs)

1. **Ollama fallback:** Without Ollama, `_call_ollama_to_select_table()` falls back to the first table (`Address`). Query actions return `Error: Ollama returned status 404`. This is by design β€” the mock path is for dev/test, not production.

2. **`message_to_action()` mutates state:** It appends the message to `_state.history_messages` before tokenizing. This is intentional β€” the tokenizer needs the full conversation context.

3. **`MockTokenizer` in production code:** `server/app.py` imports `MockTokenizer` from `server/test_sql_env.py` when `transformers` is unavailable. This is the teammate's design for running without GPU dependencies.

---

## Verification Checklist

- [x] `uv sync` succeeds (all deps install)
- [x] `uv run pytest tests/ -v` β€” 21/21 pass
- [x] `uv run ruff check .` β€” all checks passed
- [x] `uv run ruff format --check .` β€” 14 files formatted
- [x] Pydantic models import from `sql_env.models`
- [x] Environment instantiates with MockTokenizer
- [x] `reset()` returns valid SQLObservation with system prompt
- [x] Action detection: 9/9 keyword classifications correct
- [x] `message_to_action()` creates typed SQLAction with tokens
- [x] `step(describe)` returns schema from SQLAlchemy introspection
- [x] `step(sample)` returns SQL query text
- [x] `step(query)` returns graceful error without Ollama
- [x] Multi-turn conversation state grows correctly
- [x] Client tensor <-> list serialization roundtrips
- [x] Spider data loads (53 questions)

---

## What's Next

**Phase 3:** Reward computation (`server/reward.py`) and answer verification (`server/verifier.py`). Both are currently stubs.

---

*All output captured live on 2026-02-28. Reproduce with `uv sync && uv run pytest tests/ -v`.*