Agentic-Service-Data-Eyond-Catalog

Running

App Files Files Community

Agentic-Service-Data-Eyond-Catalog / PHASE1_TO_PHASE2_REPORT.md

ishaq101

feat/Catalog Retrieval System (#1)

6bff5d9 3 days ago

preview code

raw

history blame contribute delete

19.6 kB

Phase 1 → Phase 2 Migration Report

A walkthrough of what changed between the original retrieval-style backend (Phase 1) and the current catalog-driven backend (Phase 2). Intended as a hand-off for the lead.

1. The conceptual change

Phase 1 was a single retrieval-style RAG pipeline. Every question — whether it pointed at a database, a spreadsheet, or a PDF — went through the same primitive: chunk + embed + top-K over PGVector. Schema and tabular columns were embedded as chunks and ranked alongside prose. When the question needed SQL, the LLM wrote the SQL string directly (via query_executor).

Phase 2 splits the system into two paths governed by an LLM router:

Path	Primitive	Why
Unstructured (PDF / DOCX / TXT)	Dense similarity over prose chunks (PGVector)	Right primitive for free text
Structured (DB / CSV / XLSX / Parquet)	Per-user data catalog → LLM emits a JSON IR of intent → deterministic compiler → executor (SQL or pandas)	A column lookup shouldn't go through a similarity ranking lottery; the LLM emits intent, never SQL syntax

Three explicit LLM call sites only:

Intent router (classifies the user message into chat / unstructured / structured)
Query planner (turns the question + catalog into a Pydantic-validated QueryIR)
Chatbot agent (formats the final answer, streamed over SSE)

Everything else — IR validation, SQL/pandas compilation, execution — is deterministic Python.

2. File-by-file changes

2.1 Deleted (Phase 1 only)

Phase 1 path	Reason it was removed
`src/rag/base.py`, `src/rag/retriever.py`, `src/rag/router.py`	Replaced by `src/retrieval/`
`src/rag/retrievers/baseline.py`, `schema.py`, `document.py`	Schema retrieval gone (catalog replaces it); document retriever rewritten in `src/retrieval/document.py`
`src/tools/search.py` (whole `tools/` folder)	Only consumer was `rag/router.py`
`src/query/base.py`	Duplicate of `query/executor/base.py`
`src/query/query_executor.py`	Replaced by `src/query/service.py`
`src/query/executors/db_executor.py`	Replaced by `src/query/executor/db.py`
`src/query/executors/tabular.py`	Replaced by `src/query/executor/tabular.py`
`src/agents/chatbot.py` (Phase 1 LangChain chatbot)	Phase 2 `ChatbotAgent` lives at the same path now — see §2.2
`src/api/v1/knowledge.py`	Fake `/knowledge/rebuild` endpoint, never wired
`src/config/agents/system_prompt.md`, `guardrails_prompt.md`	Replaced by `src/config/prompts/{chatbot_system,guardrails}.md`
`src/models/structured_output.py` (`IntentClassification`)	Replaced by `IntentRouterDecision` Pydantic model inside `agents/orchestration.py`
`src/models/sql_query.py`	LLM no longer emits SQL; IR replaces it
`src/pipeline/orchestrator.py` (empty stub)	Redundant — `StructuredPipeline` takes the introspector at `run()` time

2.2 Renamed / moved (same role, new home)

Phase 1 location	Phase 2 location	Notes
`src/agents/chatbot.py` (Phase 1) → deleted, then `src/agents/answer_agent.py` (`AnswerAgent`) → renamed	`src/agents/chatbot.py::ChatbotAgent`	Final answer formation; streams via `astream`
`src/knowledge/parquet_service.py`	`src/storage/parquet.py`	Parquet upload/download helper
`src/pipeline/document_pipeline/document_pipeline.py` (folder)	`src/pipeline/document_pipeline.py` (flat)	Single module
`src/rag/retrievers/document.py`	`src/retrieval/document.py`	`DocumentRetriever` migrated; tabular file types filtered out of results
`src/rag/router.py`	`src/retrieval/router.py`	`RetrievalRouter`, Redis-cached, unstructured-only; dead `db: AsyncSession` + `source_hint` params removed
`src/rag/base.py` (`RetrievalResult`, `BaseRetriever`)	`src/retrieval/base.py`	Same dataclass + ABC

Heads-up on the intent router: the Phase 1 file src/agents/orchestration.py and its class OrchestratorAgent were kept in place for Phase 2 — but the body was fully rewritten. The class now emits IntentRouterDecision(needs_search, source_hint ∈ {chat, unstructured, structured}, rewritten_query). The prompt file and test file use the intent_router name (config/prompts/intent_router.md, tests/agents/test_intent_router.py), but the source module is still orchestration.py and the class is still OrchestratorAgent. Existing imports continue to work; only the behavior changed.

2.3 Added (Phase 2 new)

Catalog subsystem (whole new concept)

Path	Role
`src/catalog/models.py`	Pydantic: `Catalog → Source[] → Table[] → Column[]`, `ForeignKey`, `ColumnStats.top_values`
`src/catalog/introspect/base.py`	`BaseIntrospector` ABC
`src/catalog/introspect/database.py`	DB introspector — wraps Phase 1 `db_pipeline/extractor.py` (`get_schema`, `profile_column`, `get_row_count`)
`src/catalog/introspect/tabular.py`	CSV / XLSX / Parquet introspector — one `Table` per XLSX sheet
`src/catalog/render.py`	Renders a `Source` for the planner prompt
`src/catalog/validator.py`	Unique-ID + foreign-key-ref invariants
`src/catalog/store.py`	Postgres `jsonb` upsert keyed by `user_id` (table `data_catalog`)
`src/catalog/reader.py`	Loads + filters catalog by `source_hint`
`src/catalog/pii_detector.py`	Flags PII columns at ingestion → suppresses `sample_values`
`src/security/pii_patterns.py`	Name patterns + value regex used by the detector

JSON IR + query subsystem

Path	Role
`src/query/ir/models.py`	`QueryIR` Pydantic schema
`src/query/ir/operators.py`	`ALLOWED_FILTER_OPS`, `ALLOWED_AGG_FNS`, `LIMIT_HARD_CAP`, `TYPE_COMPATIBILITY`
`src/query/ir/validator.py`	Catalog-aware IR validation (rejects unknown column ids, bad ops, type mismatches, oversize limits)
`src/query/planner/service.py`	`QueryPlannerService.plan(question, catalog, previous_error)` — Azure OpenAI structured output → `QueryIR`
`src/query/planner/prompt.py`	Builds the planner prompt from catalog text
`src/query/compiler/base.py`	Compiler ABC
`src/query/compiler/sql.py`	`SqlCompiler` (Postgres) — all 12 filter ops, params as a dict
`src/query/compiler/pandas.py`	`PandasCompiler` — returns `CompiledPandas(apply, output_columns)`
`src/query/executor/base.py`	`BaseExecutor` + `QueryResult`
`src/query/executor/db.py`	`DbExecutor` — sqlglot SELECT-only guard, RO txn, 30 s `statement_timeout`, 10 k row cap
`src/query/executor/tabular.py`	`TabularExecutor` — Parquet via blob, `asyncio.to_thread`, 10 k cap
`src/query/executor/dispatcher.py`	`ExecutorDispatcher.pick(ir)` — picks by `source.source_type`
`src/query/service.py`	`QueryService.run(user_id, question, catalog)` — plan → validate → retry (max 3) → dispatch → execute

Agents

Path	Role
`src/agents/orchestration.py`	`OrchestratorAgent` — Phase 1 file/class name preserved; Phase 2 body. Emits `IntentRouterDecision`
`src/agents/chatbot.py`	`ChatbotAgent` — formerly `AnswerAgent` in `agents/answer_agent.py`; renamed in Cleanup PR
`src/agents/chat_handler.py`	`ChatHandler.handle(...)` — top-level orchestrator; yields `intent` / `chunk` / `done` / `error` SSE events

Pipelines & API

Path	Role
`src/pipeline/structured_pipeline.py`	DB / tabular ingestion: introspect → merge → validate → upsert
`src/pipeline/triggers.py`	`on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested`
`src/api/v1/data_catalog.py`	`GET /api/v1/data-catalog/{user_id}` + `POST /api/v1/data-catalog/rebuild`
`src/models/api/catalog.py`	Catalog request/response models
`src/config/prompts/intent_router.md`, `query_planner.md`, `chatbot_system.md`, `guardrails.md`	New prompts. `guardrails.md` is appended to `chatbot_system.md` at load time
`src/db/postgres/models.py` (added `Catalog` SQLAlchemy class)	Stores the per-user jsonb document in `data_catalog`

2.4 Rewired API endpoints

Endpoint	Phase 1 wiring	Phase 2 wiring
`POST /api/v1/chat/stream`	Inline in `chat.py`: `OrchestratorAgent` → `retriever` → `query_executor` → `chatbot`	Delegates to `ChatHandler.handle()`. Redis cache, fast intent, history load, and message persistence stay in the endpoint
`POST /api/v1/database-clients/{id}/ingest`	Called `db_pipeline_service.run()` and dual-wrote vectors	Calls only `on_db_registered` (catalog build). Failure → HTTP 500
`POST /api/v1/document/process`	Always pushed to vector store	PDF/DOCX/TXT → `knowledge_processor` (vectors); CSV/XLSX → `on_tabular_uploaded` (catalog only, no vector embedding)
`POST /api/v1/document/upload`	Storage + DB row	Same, plus `on_document_uploaded` trigger
`POST /api/v1/data-catalog/rebuild`	—	New: iterates all sources, re-runs per-source trigger
`GET /api/v1/data-catalog/{user_id}`	—	New: returns `list[CatalogIndexEntry]`

2.5 Phase 1 files still in production use

These were not rewritten — Phase 2 imports them directly:

src/database_client/database_client_service.py
src/utils/db_credential_encryption.py (decrypt_credentials_dict) — src/security/credentials.py is still a stub
src/pipeline/db_pipeline/db_pipeline_service.py (engine_scope context manager — used by both the introspector and DbExecutor)
src/pipeline/db_pipeline/extractor.py (get_schema, profile_column, get_row_count)
src/knowledge/processing_service.py (PDF / DOCX / TXT extraction + embedding)
src/db/postgres/{connection,init_db,vector_store}.py, src/storage/az_blob/, src/middlewares/, src/security/auth.py

3. End-to-end flow (current state)

3.1 Ingestion

User action                Pipeline                                Storage
──────────────             ────────────────────────────             ─────────────────
upload PDF/DOCX/TXT  →   DocumentPipeline                       →  Azure Blob + PGVector
                          (extract → chunk → embed)                 (table: langchain_pg_embedding)
                          + on_document_uploaded                    + retrieval cache invalidate

upload CSV/XLSX     →    TabularIntrospector                   →  Azure Blob (Parquet)
                          (sheets / columns + sample + stats)       + data_catalog jsonb row
                          → CatalogValidator → CatalogStore         (NO vector store — catalog only)
                          via on_tabular_uploaded

register DB         →    DatabaseIntrospector                  →  data_catalog jsonb row
                          (information_schema + sample + FKs)
                          → validate → store
                          via on_db_registered

3.2 Query (per user message → SSE stream)

POST /api/v1/chat/stream
        │
        ├── Redis cache check (24h TTL) — hit returns cached stream
        ├── _fast_intent (greetings / goodbyes) — bypass LLM
        ├── load history from chat_messages
        │
        └── ChatHandler.handle(message, user_id, history)         [src/agents/chat_handler.py]
                │
                ├─ OrchestratorAgent.classify()                    [agents/orchestration.py]
                │     → needs_search, source_hint, rewritten_query
                │
                ├── source_hint == "chat"
                │     → ChatbotAgent.astream() → yield chunk events
                │
                ├── source_hint == "unstructured"
                │     → RetrievalRouter.retrieve()                 [retrieval/router.py, Redis-cached]
                │         → DocumentRetriever (PGVector MMR/cosine/etc.)
                │     → ChatbotAgent.astream(chunks=...)
                │
                └── source_hint == "structured"
                      → CatalogReader.read(user_id, "structured")  [catalog/reader.py]
                      → QueryService.run(user_id, question, catalog)   [query/service.py]
                           │
                           ├─ QueryPlannerService.plan(...)        [query/planner/service.py]
                           │    LLM(catalog, question, prev_error?) → QueryIR
                           │
                           ├─ IRValidator.validate(ir, catalog)    [query/ir/validator.py]
                           │    fail → loop back to planner with error context (max 3)
                           │
                           ├─ ExecutorDispatcher.pick(ir)          [query/executor/dispatcher.py]
                           │    schema source  → DbExecutor
                           │    tabular source → TabularExecutor
                           │
                           ├─ DbExecutor.run(ir):                  [query/executor/db.py]
                           │    SqlCompiler → (sql, params)
                           │      → sqlglot SELECT-only guard
                           │      → engine_scope (Phase 1 utility) in asyncio.to_thread
                           │      → RO txn + statement_timeout=30s + 10k cap
                           │
                           ├─ TabularExecutor.run(ir):             [query/executor/tabular.py]
                           │    resolve Parquet blob path
                           │      → download → PandasCompiler.apply(df)
                           │      → asyncio.to_thread → 10k cap
                           │
                           └─ QueryResult { rows, columns, row_count,
                                            truncated, source_id, error?, elapsed_ms }
                      →
                      ChatbotAgent.astream(query_result=...)
                            → yield chunk events
                │
                └── final events: done / error
        │
        └── persist user + assistant messages to chat_messages
        └── populate Redis cache

Safety invariants for the structured path (read-only at every layer):

IR validated against the catalog before reaching the compiler
Identifiers come from the catalog (trusted; inlined as quoted identifiers)
Values from IR.filters are always parameterized
Compiler is deterministic — no LLM in the hot path
sqlglot rejects anything that isn't a pure SELECT
DB connection is read-only with a 30 s statement_timeout
Hard 10 000 row cap on both executors; neither raises — errors go in QueryResult.error

4. Summary table for review

Concern	Phase 1 — where it lived	Phase 2 — where it lives	Change type
Intent classification	`agents/orchestration.py::OrchestratorAgent` (free-text intent)	Same path + same class name — body rewritten to emit `IntentRouterDecision`	Body rewrite only
Top-level chat orchestration	Inline in `api/v1/chat.py`	`agents/chat_handler.py::ChatHandler`	Extracted to a reusable module
Final answer formation	`agents/chatbot.py` (Phase 1 LangChain)	`agents/chatbot.py::ChatbotAgent` (was `AnswerAgent` in `answer_agent.py` mid-cycle)	Rewritten + renamed
Schema retrieval (DB / tabular)	`rag/retrievers/schema.py` + PGVector chunks	Removed. Replaced by catalog (`catalog/store.py` jsonb) loaded verbatim into planner prompt	Whole concept replaced
Doc retrieval (PDF / DOCX / TXT)	`rag/retrievers/document.py`, `rag/router.py`	`retrieval/document.py`, `retrieval/router.py`	Moved; Redis cache restored; tabular files filtered
Query writing	`query/query_executor.py` + `models/sql_query.py` (LLM writes SQL)	`query/planner/service.py` (LLM writes IR) + `query/compiler/sql.py` (deterministic)	LLM emits intent, not SQL
DB execution	`query/executors/db_executor.py`	`query/executor/db.py::DbExecutor`	Folder renamed (`executors` → `executor`); sqlglot guard + RO txn + 30 s timeout kept
Tabular execution	`query/executors/tabular.py`	`query/executor/tabular.py::TabularExecutor`	Parquet-only; pandas compiler split out
Executor selection	Hard-coded in `query_executor.py`	`query/executor/dispatcher.py::ExecutorDispatcher`	New; routes by `source.source_type`
Catalog (NEW)	—	`catalog/` (models, introspect/, validator, store, reader, pii_detector, render)	New subsystem
Catalog persistence	(data was embedded in PGVector)	Postgres jsonb table `data_catalog`, keyed by `user_id`	New table
Ingestion triggers	Inline in API endpoints	`pipeline/triggers.py` (`on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested`)	Centralized event entry points
Structured pipeline	`pipeline/db_pipeline/db_pipeline_service.py` (still present for `engine_scope` + extractor reuse)	`pipeline/structured_pipeline.py` (orchestrator) — reuses Phase 1 extractor	New orchestrator wraps Phase 1 introspection helpers
Document pipeline	`pipeline/document_pipeline/document_pipeline.py` (folder)	`pipeline/document_pipeline.py` (file)	Flattened; CSV / XLSX now skip the vector store
Parquet helper	`knowledge/parquet_service.py`	`storage/parquet.py`	Moved into `storage/`
Prompts	`config/agents/system_prompt.md`, `guardrails_prompt.md`	`config/prompts/{intent_router,query_planner,chatbot_system,guardrails}.md`	Folder renamed; split into four files; guardrails appended to `chatbot_system` at load
PII detection	—	`catalog/pii_detector.py` + `security/pii_patterns.py`	New. Columns flagged `pii_flag=true` get `sample_values: null` so PII never enters prompts
Chat endpoint	`api/v1/chat.py` (does everything inline)	`api/v1/chat.py` (cache + history + persistence) → delegates to `ChatHandler`	Slimmed; SSE event shape is `intent` / `chunk` / `done` / `error`
DB ingest endpoint	`api/v1/db_client.py::ingest` (Phase 1 `db_pipeline_service.run()`)	`api/v1/db_client.py::ingest` (calls `on_db_registered` only)	Phase 1 dual-write removed
Document process endpoint	`api/v1/document.py::process` (always vectorize)	`api/v1/document.py::process` (PDF/DOCX/TXT → vectors; CSV/XLSX → catalog via `on_tabular_uploaded`)	Routing by file type
Catalog management API	—	`api/v1/data_catalog.py` (GET index + POST rebuild)	New

Bottom line. Every Phase 1 file under src/rag/, src/tools/, src/query/executors/, src/query/query_executor.py, src/query/base.py, src/api/v1/knowledge.py, and src/config/agents/ is gone. Phase 1 introspection helpers under src/pipeline/db_pipeline/ and src/database_client/ are still imported by Phase 2 — they were not rewritten, just wrapped. The three LLM call sites are now explicit and the SQL-writing one no longer exists; the planner emits a Pydantic-validated QueryIR instead.

The one filename gotcha to remember: the intent router still lives at src/agents/orchestration.py as class OrchestratorAgent (Phase 1 name kept for import-site compatibility, Phase 2 body). The matching prompt and tests use the intent_router name, but the source module does not.