Spaces:

MSGEncrypted
/

lesson-agent-dev

Running

MSG commited on 8 days ago

Commit

e7fd66f

1 Parent(s): 59e2c8a

Feat/research tab agent skills (#5)

* research agent plan

* init research mind config

* rag memory

* store stuff

* skills research mind

* agent libs

* search web agent

* wip app research mind

* url validate tools

* skills fix

* search url

* research wip fix

* chat rag wip

* rag wip

* citations rag chunk check

* wip test

* clean response wip

* fix clean response

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.cursor/plans/researchmind_rag_agent_7390b536.plan.md +366 -0
.env.example +8 -0
.gitignore +3 -1
README.md +17 -3
apps/gradio-space/src/gradio_space/app.py +8 -5
apps/gradio-space/src/gradio_space/model_loading.py +3 -1
apps/gradio-space/src/gradio_space/research_helpers.py +196 -0
apps/gradio-space/src/gradio_space/tabs/__init__.py +2 -1
apps/gradio-space/src/gradio_space/tabs/chat.py +56 -9
apps/gradio-space/src/gradio_space/tabs/research_mind.py +366 -0
libs/agent/pyproject.toml +2 -0
libs/agent/src/agent/models.py +52 -0
libs/agent/src/agent/research_prompts.py +36 -0
libs/agent/src/agent/runner.py +257 -1
libs/agent/src/agent/skills.py +5 -0
libs/agent/src/agent/tools/research_tools.py +93 -0
libs/agent/src/agent/tools_registry.py +38 -1
libs/agent/tests/test_research_runner.py +107 -0
libs/inference/src/inference/response_clean.py +87 -0
libs/inference/tests/test_response_clean.py +34 -0
libs/researchmind/README.md +9 -0
libs/researchmind/pyproject.toml +25 -0
libs/researchmind/src/researchmind/__init__.py +11 -0
libs/researchmind/src/researchmind/chunking.py +46 -0
libs/researchmind/src/researchmind/citations.py +92 -0
libs/researchmind/src/researchmind/config.py +32 -0
libs/researchmind/src/researchmind/embeddings.py +32 -0
libs/researchmind/src/researchmind/extract.py +36 -0
libs/researchmind/src/researchmind/ingest.py +105 -0
libs/researchmind/src/researchmind/retrieve.py +57 -0
libs/researchmind/src/researchmind/scrape_pdf.py +30 -0
libs/researchmind/src/researchmind/scrape_web.py +38 -0
libs/researchmind/src/researchmind/search_urls.py +89 -0
libs/researchmind/src/researchmind/store.py +381 -0
libs/researchmind/src/researchmind/url_suggest.py +68 -0
libs/researchmind/src/researchmind/url_validate.py +118 -0
libs/researchmind/tests/test_chunking.py +15 -0
libs/researchmind/tests/test_citations.py +67 -0
libs/researchmind/tests/test_retrieve.py +95 -0
libs/researchmind/tests/test_search_queries.py +29 -0
libs/researchmind/tests/test_store.py +57 -0
libs/researchmind/tests/test_url_validate.py +65 -0
pyproject.toml +2 -0
skills/extract-content/SKILL.md +16 -0
skills/extract-content/references/chunking-policy.md +9 -0
skills/extract-content/scripts/chunk_and_index.py +35 -0
skills/research-mind/SKILL.md +30 -0
skills/research-mind/references/citation-format.md +6 -0
skills/research-mind/references/ingest-modes.md +9 -0
skills/research-mind/scripts/ask.py +33 -0

.cursor/plans/researchmind_rag_agent_7390b536.plan.md ADDED Viewed

	@@ -0,0 +1,366 @@

+---
+name: ResearchMind RAG Agent
+overview: "Add ResearchMind: ingest skills (web/PDF/extract) with references and scripts, a persistent MemRAG store (SQLite + embeddings), an agent runner with citation-backed Q&A, and a new Gradio tab. Topic mode suggests URLs via the local model (user confirms); optional auto-search mode via app dropdown and skill flags."
+todos:
+  - id: pkg-researchmind
+    content: "Create libs/researchmind package: MemRAGStore (SQLite), chunking, sentence-transformers embeddings, retrieve + citations"
+    status: completed
+  - id: skills-scrape-extract
+    content: Add skills/scrape-web, scrape-pdf, extract-content, research-mind with references/ and scripts/ CLIs
+    status: completed
+  - id: agent-runner
+    content: Extend SkillRegistry (flags), ToolRegistry (5 tools), AgentRunner ingest/chat with suggest_urls + auto_search boolean
+    status: completed
+  - id: gradio-tab
+    content: "Add research_mind.py tab: topic/URL/file ingest, mode dropdown, URL confirm, session chat, trace accordion"
+    status: completed
+  - id: tests-docs
+    content: Unit tests for store/retrieve/runner; update .env.example and README for ResearchMind offline Q&A
+    status: completed
+isProject: false
+---
+# ResearchMind — Scraper + RAG + MemRAG Plan
+## Goal
+Ship a **Backyard AI** research agent that:
+1. Accepts a **topic**, **URL**, or **PDF/doc** upload
+2. **Ingests once** (scrape → extract → chunk → embed → graph persist)
+3. Answers questions **offline** across sessions with **citations**
+4. Uses the **active local preset** from [`models.yaml`](models.yaml) (no new training in MVP)
+## Architecture
+```mermaid
+flowchart TB
+  subgraph gradio [Gradio Research Tab]
+    Input[Topic / URL / File]
+    Mode[Ingest mode dropdown]
+    Confirm[URL confirm list]
+    Chat[Research chat]
+  end
+  subgraph skills [skills/]
+    SW[scrape-web]
+    SP[scrape-pdf]
+    EX[extract-content]
+    RM[research-mind]
+  end
+  subgraph lib [libs/researchmind]
+    Ingest[IngestPipeline]
+    Store[MemRAGStore]
+    Retrieve[Retriever]
+    Cite[CitationFormatter]
+  end
+  subgraph agent [libs/agent]
+    Runner[AgentRunner.run_researchmind]
+    Tools[ToolRegistry]
+    Trace[TraceRecorder]
+  end
+  Input --> Runner
+  Mode --> Runner
+  Runner --> SW
+  Runner --> SP
+  Runner --> EX
+  SW --> Ingest
+  SP --> Ingest
+  EX --> Ingest
+  Ingest --> Store
+  Chat --> Retrieve
+  Retrieve --> Store
+  Runner --> Cite
+  Cite --> Chat
+  Runner --> Trace
+```
+**Separation of concerns**
+- **Skills** (`skills/*/SKILL.md` + `references/` + `scripts/`) — workflow docs and thin CLIs the agent/humans can invoke
+- **`libs/researchmind/`** — real Python library: scrape, extract, chunk, embed, SQLite MemRAG, retrieval
+- **`libs/agent/`** — orchestration: `AgentRunner.run_researchmind()`, tool handlers, prompts with citations
+- **`apps/gradio-space/`** — third top-level tab wired like [`education_pptx.py`](apps/gradio-space/src/gradio_space/tabs/education_pptx.py)
+**Not in MVP scope:** wiring [`research/ensemble/src/ensemble/memory.py`](research/ensemble/src/ensemble/memory.py) toy `Embedder` (token-id bound, research-only). Production path uses **sentence-transformers** (`all-MiniLM-L6-v2`) for arbitrary text, fully offline after first model download.
+---
+## 1. New package: `libs/researchmind/`
+Add workspace member in root [`pyproject.toml`](pyproject.toml) and depend from `agent` + `gradio-space`.
+| Module | Responsibility |
+|--------|----------------|
+| `store.py` | **MemRAGStore** — SQLite at `$RESEARCHMIND_DATA_DIR/memory.db` |
+| `ingest.py` | **IngestPipeline** — normalize → chunk → embed → graph edges |
+| `scrape_web.py` | `httpx` + `trafilatura` fetch/clean HTML |
+| `scrape_pdf.py` | `pypdf` text extraction; optional OCR hook stub |
+| `extract.py` | Unified `ExtractedDocument` (title, url, mime, text, metadata) |
+| `chunking.py` | Sliding-window chunks (~512 tokens / 128 overlap) with stable IDs |
+| `embeddings.py` | Lazy-load `SentenceTransformer`, batch encode, L2-normalize |
+| `retrieve.py` | Top-k cosine search + optional graph expansion (same-doc neighbors) |
+| `citations.py` | Map chunks → `[1]` footnotes with source title/URL/page |
+| `search_urls.py` | Optional DuckDuckGo search (`duckduckgo-search`) when `auto_search=True` |
+| `url_suggest.py` | LLM prompt: topic → JSON list of suggested URLs (default path) |
+### MemRAG graph schema (SQLite)
+```
+documents(id, source_type, uri, title, ingested_at, content_hash)
+chunks(id, doc_id, ordinal, text, embedding_blob, meta_json)
+edges(src_id, dst_id, rel)   -- doc->chunk, chunk->next_chunk, chunk->cites
+sessions(id, topic, created_at)
+session_messages(session_id, role, content, chunk_ids_json)
+```
+- **Persistence** enables cross-session memory: chat loads `session_id` or creates new; retrieval searches all ingested docs unless filtered by session/topic tag
+- **Dedup**: skip re-ingest when `content_hash` matches
+- **Graph expansion (light MemRAG)**: when retrieving chunk `k`, also pull adjacent chunks (`chunk->next_chunk`) from same document for context window assembly
+### Dependencies (add to `libs/researchmind/pyproject.toml`)
+- `httpx`, `trafilatura` — web scrape
+- `pypdf` — PDF
+- `python-docx` — already in agent; reuse for `.docx` uploads
+- `sentence-transformers` — offline embeddings
+- `duckduckgo-search` — optional auto-search mode
+- `numpy` — vector ops (or store as bytes in SQLite)
+Env vars (extend [`.env.example`](.env.example)):
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `RESEARCHMIND_DATA_DIR` | `outputs/researchmind` | DB + raw snapshots |
+| `RESEARCHMIND_EMBED_MODEL` | `all-MiniLM-L6-v2` | Embedding model |
+| `RESEARCHMIND_AUTO_SEARCH` | `false` | Global default for auto-search |
+| `RESEARCHMIND_TOP_K` | `5` | Retrieval depth |
+---
+## 2. Skills layout (with references + scripts)
+Create four skill folders under [`skills/`](skills/), mirroring Cursor skill layout but using existing [`SkillRegistry`](libs/agent/src/agent/skills.py) frontmatter (`name`, `description`, `task`, `tools`):
+### `skills/scrape-web/`
+```
+scrape-web/
+├── SKILL.md
+├── references/
+│   ├── allowed-domains.md      # robots.txt / rate-limit notes
+│   └── html-cleanup.md         # trafilatura settings
+└── scripts/
+    └── scrape_url.py           # CLI: python scripts/scrape_url.py <url> --out ...
+```
+- **tools:** `scrape_web`
+- Script calls `researchmind.scrape_web.fetch_and_extract`
+### `skills/scrape-pdf/`
+```
+scrape-pdf/
+├── SKILL.md
+├── references/
+│   └── pdf-limits.md           # max pages, scanned PDF note
+└── scripts/
+    └── extract_pdf.py
+```
+- **tools:** `scrape_pdf`
+### `skills/extract-content/`
+```
+extract-content/
+├── SKILL.md
+├── references/
+│   └── chunking-policy.md
+└── scripts/
+    └── chunk_and_index.py      # ingest into MemRAGStore
+```
+- **tools:** `extract_and_index`
+### `skills/research-mind/` (orchestrator)
+```
+research-mind/
+├── SKILL.md
+├── references/
+│   ├── ingest-modes.md         # suggest / auto_search / direct_url
+│   └── citation-format.md
+└── scripts/
+    ├── suggest_urls.py
+    ├── ingest.py
+    └── ask.py                  # CLI Q&A with citations
+```
+Frontmatter additions (parsed as optional YAML fields in extended `Skill` dataclass):
+```yaml
+---
+name: research-mind
+task: research
+tools:
+  - suggest_urls
+  - scrape_web
+  - scrape_pdf
+  - extract_and_index
+  - research_answer
+flags:
+  auto_search: false   # skill default; overridden by agent + Gradio
+---
+```
+Extend [`libs/agent/src/agent/skills.py`](libs/agent/src/agent/skills.py) to read optional `flags:` dict without breaking existing skills.
+---
+## 3. Agent orchestration
+### New tools in [`libs/agent/src/agent/tools_registry.py`](libs/agent/src/agent/tools_registry.py)
+| Tool | Handler |
+|------|---------|
+| `suggest_urls` | `url_suggest.suggest(topic, backend)` → list[str] |
+| `scrape_web` | fetch + return `ExtractedDocument` |
+| `scrape_pdf` | extract PDF path/bytes |
+| `extract_and_index` | chunk + embed + `MemRAGStore.add_document` |
+| `research_answer` | retrieve + RAG prompt + `backend.chat` → answer + citations |
+### New runner method in [`libs/agent/src/agent/runner.py`](libs/agent/src/agent/runner.py)
+```python
+def run_researchmind_ingest(
+    *, topic: str | None, urls: list[str], files: list[Path],
+    auto_search: bool, session_id: str | None,
+    model_key: str, backend: InferenceBackend,
+) -> ResearchIngestResult: ...
+def run_researchmind_chat(
+    *, question: str, session_id: str,
+    model_key: str, backend: InferenceBackend,
+) -> ResearchChatResult: ...
+```
+**Ingest flow (default — Option C)**
+1. If `topic` and no URLs/files: call `suggest_urls` (local LLM returns JSON URL list)
+2. Return suggested URLs to UI for **user confirmation** (Gradio checkbox group)
+3. On confirm: scrape each URL / PDF / doc → `extract_and_index`
+4. If `auto_search=True`: skip LLM suggest; run DuckDuckGo `search_urls(topic, n=5)` and ingest without confirmation
+**Chat flow**
+1. `retrieve(question, top_k)` from `MemRAGStore`
+2. Build system prompt from `skills/research-mind/SKILL.md` body + `references/citation-format.md`
+3. Inject numbered context blocks; instruct model to cite `[n]`
+4. `TraceRecorder` logs retrieval chunk IDs + LLM I/O (Sharing is Caring badge)
+### Pydantic models in [`libs/agent/src/agent/models.py`](libs/agent/src/agent/models.py)
+- `ResearchIngestInput`, `ResearchChatInput`, `Citation`, `ResearchChatResult`
+---
+## 4. Gradio tab: Research Agent
+New file: [`apps/gradio-space/src/gradio_space/tabs/research_mind.py`](apps/gradio-space/src/gradio_space/tabs/research_mind.py)
+Register in [`app.py`](apps/gradio-space/src/gradio_space/app.py) and [`tabs/__init__.py`](apps/gradio-space/src/gradio_space/tabs/__init__.py).
+### UI layout
+```
+Research Agent tab
+├── Markdown intro (offline-after-ingest, citations)
+├── Session: dropdown of past sessions + "New session"
+├── Ingest section
+│   ├── Textbox: topic (optional)
+│   ├── Textbox: URLs (one per line, optional)
+│   ├── File: PDF/DOCX upload (optional)
+│   ├── Dropdown: ingest mode
+│   │   ├── "Suggest URLs (confirm)"  [default]
+│   │   └── "Auto search & ingest"
+│   ├── Button: "Discover sources" → shows CheckboxGroup of suggested URLs
+│   └── Button: "Ingest selected" → status + doc count
+├── Chat section
+│   ├── Chatbot (history)
+│   ├── Textbox: question
+│   └── Button: Ask
+└── Accordion: trace JSON + ingested sources table
+```
+**Handler pattern:** mirror `generate_lesson_slides()` — `ensure_model_loaded()`, `AgentRunner()`, try/except with user-visible errors, `gradio_allowed_paths()` extended for `RESEARCHMIND_DATA_DIR`.
+Update app header in `app.py` to mention ResearchMind alongside Lesson Agent.
+---
+## 5. Offline-after-ingest guarantee
+| Phase | Network |
+|-------|---------|
+| Ingest (scrape/search) | May use network |
+| Embed model first run | HuggingFace download once |
+| Q&A / chat | **No network** — only SQLite + local LLM |
+Raw HTML/PDF snapshots saved under `RESEARCHMIND_DATA_DIR/raw/{doc_id}/` for audit and re-chunk without re-scrape.
+---
+## 6. Tests
+| Location | Coverage |
+|----------|----------|
+| `libs/researchmind/tests/test_store.py` | SQLite CRUD, dedup hash |
+| `libs/researchmind/tests/test_chunking.py` | chunk boundaries |
+| `libs/researchmind/tests/test_retrieve.py` | top-k with fixture embeddings |
+| `libs/agent/tests/test_research_runner.py` | mock backend; ingest + chat happy path |
+| `libs/researchmind/tests/fixtures/` | small HTML snippet + 1-page PDF |
+Use offline fixtures for CI; mark optional network tests `@pytest.mark.network`.
+---
+## 7. Docker / Space considerations
+- Add `sentence-transformers` + embedding model to Docker image **or** lazy-download on first ingest (document in README)
+- `allowed_paths` must include `RESEARCHMIND_DATA_DIR` for any file previews
+- GPU not required for embeddings on CPU (MiniLM is small); same GPU preset works for chat
+---
+## 8. Implementation order
+1. **`libs/researchmind`** core: store, chunk, embed, retrieve, citations
+2. **Skills** skeleton: four folders with SKILL.md + references + script stubs calling library
+3. **Agent tools + runner** methods
+4. **Gradio tab** with suggest-confirm flow + auto-search dropdown
+5. **Tests + `.env.example` + README** section under Backyard AI track
+---
+## Key files to modify
+| File | Change |
+|------|--------|
+| [`pyproject.toml`](pyproject.toml) | Add `researchmind` workspace member |
+| [`libs/agent/pyproject.toml`](libs/agent/pyproject.toml) | Depend on `researchmind` |
+| [`apps/gradio-space/pyproject.toml`](apps/gradio-space/pyproject.toml) | Transitive via `agent` |
+| [`libs/agent/src/agent/skills.py`](libs/agent/src/agent/skills.py) | Optional `flags` in frontmatter |
+| [`libs/agent/src/agent/runner.py`](libs/agent/src/agent/runner.py) | `run_researchmind_*` |
+| [`apps/gradio-space/src/gradio_space/app.py`](apps/gradio-space/src/gradio_space/app.py) | Third tab |
+| [`.env.example`](.env.example) | ResearchMind env vars |
+| [`README.md`](README.md) | ResearchMind usage blurb |
+---
+## Future (post-MVP, not in this PR)
+- LoRA distillation on ingested corpus via [`research/finetune.py`](research/finetune.py)
+- Bridge to [`research/ensemble`](research/ensemble/) for ablation experiments
+- Entity extraction edges in MemRAG graph (true knowledge graph)

.env.example CHANGED Viewed

@@ -9,6 +9,14 @@ ALLOW_MODEL_SWITCH=false
 # AGENT_TRACES_DIR=outputs/traces
 # SKILLS_DIR=./skills
 # --- Legacy single-model overrides (optional; applied to ACTIVE_MODEL only) ---
 # INFERENCE_BACKEND=transformers
 # MODEL_ID=openbmb/MiniCPM5-1B

 # AGENT_TRACES_DIR=outputs/traces
 # SKILLS_DIR=./skills
+# --- ResearchMind (MemRAG + scraper) ---
+# RESEARCHMIND_DATA_DIR=outputs/researchmind
+# RESEARCHMIND_EMBED_MODEL=all-MiniLM-L6-v2
+# RESEARCHMIND_AUTO_SEARCH=false
+# RESEARCHMIND_TOP_K=5
+# RESEARCHMIND_CHUNK_SIZE=512
+# RESEARCHMIND_CHUNK_OVERLAP=128
 # --- Legacy single-model overrides (optional; applied to ACTIVE_MODEL only) ---
 # INFERENCE_BACKEND=transformers
 # MODEL_ID=openbmb/MiniCPM5-1B

.gitignore CHANGED Viewed

@@ -12,4 +12,6 @@ build/
 outputs/traces
-/results

 outputs/traces
+/results
+outputs/researchmind

README.md CHANGED Viewed

@@ -32,7 +32,10 @@ cp .env.example .env   # optional: edit model settings
 uv run --package gradio-space python -m gradio_space.app
 ```
-Open [http://localhost:7860](http://localhost:7860). Use the **Lesson slides** tab: enter a topic, grade, and slide count. The model loads on first generate.
 ## How it works
@@ -42,13 +45,21 @@ Open [http://localhost:7860](http://localhost:7860). Use the **Lesson slides** t
 4. **Trace** — JSON log saved under `outputs/traces/` for the Sharing is Caring badge
 ```text
-apps/gradio-space/   # Gradio tabs (Lesson slides + Chat debug)
 libs/agent/          # Skill agent runner, tools, trace recorder
 libs/inference/      # Transformers + llama.cpp backends
-skills/              # SKILL.md task definitions
 research/            # Fine-tune, ensemble experiments, agentic evals (optional)
 ```
 Optional research tooling (not required for the Space): see [research/USAGE.md](research/USAGE.md).
 ## Environment variables
@@ -59,6 +70,9 @@ Optional research tooling (not required for the Space): see [research/USAGE.md](
 | `AGENT_OUTPUTS_DIR` | `/tmp/agent_outputs` | Generated `.pptx` files |
 | `AGENT_TRACES_DIR` | `outputs/traces` | Agent trace JSON |
 | `SKILLS_DIR` | `./skills` | Skill definitions root |
 See [`.env.example`](.env.example) and [`models.yaml`](models.yaml) for model presets.

 uv run --package gradio-space python -m gradio_space.app
 ```
+Open [http://localhost:7860](http://localhost:7860).
+- **Lesson slides** — topic, grade, slide count → downloadable PowerPoint
+- **Research Agent** — scrape/index sources into MemRAG, then ask questions offline with citations
 ## How it works
 4. **Trace** — JSON log saved under `outputs/traces/` for the Sharing is Caring badge
 ```text
+apps/gradio-space/   # Gradio tabs (Lesson slides, Research Agent, Chat debug)
 libs/agent/          # Skill agent runner, tools, trace recorder
+libs/researchmind/   # Scraper, chunk/embed, MemRAG SQLite store, retrieval
 libs/inference/      # Transformers + llama.cpp backends
+skills/              # SKILL.md + references/ + scripts/ per task
 research/            # Fine-tune, ensemble experiments, agentic evals (optional)
 ```
+### ResearchMind (offline after ingest)
+1. **Skills** — `skills/scrape-web`, `scrape-pdf`, `extract-content`, `research-mind`
+2. **Ingest** — URL/PDF/DOCX or topic → (optional LLM URL suggest + confirm, or auto search) → chunk + embed → SQLite
+3. **Q&A** — local model + retrieved chunks with `[n]` citations (no network at chat time)
+4. **Memory** — persists under `RESEARCHMIND_DATA_DIR` (default `outputs/researchmind`)
 Optional research tooling (not required for the Space): see [research/USAGE.md](research/USAGE.md).
 ## Environment variables
 | `AGENT_OUTPUTS_DIR` | `/tmp/agent_outputs` | Generated `.pptx` files |
 | `AGENT_TRACES_DIR` | `outputs/traces` | Agent trace JSON |
 | `SKILLS_DIR` | `./skills` | Skill definitions root |
+| `RESEARCHMIND_DATA_DIR` | `outputs/researchmind` | MemRAG DB and raw snapshots |
+| `RESEARCHMIND_EMBED_MODEL` | `all-MiniLM-L6-v2` | Sentence embedding model |
+| `RESEARCHMIND_AUTO_SEARCH` | `false` | Default auto DuckDuckGo ingest |
 See [`.env.example`](.env.example) and [`models.yaml`](models.yaml) for model presets.

apps/gradio-space/src/gradio_space/app.py CHANGED Viewed

@@ -3,8 +3,9 @@ import os
 import gradio as gr
 from gradio_space.model_loading import preload_active_model
-from gradio_space.tabs import build_chat_tab, build_education_pptx_tab
 from gradio_space.tabs.education_pptx import gradio_allowed_paths
 from inference.config import get_app_config
 _app_config = get_app_config()
@@ -18,12 +19,12 @@ def build_demo() -> gr.Blocks:
         else "Using built-in presets (models.yaml not found)."
     )
-    with gr.Blocks(title="Lesson Agent — Build Small Hackathon") as demo:
         gr.Markdown(
             f"""
-# Lesson Agent
-Local skill-based agent for teachers — **topic in, PowerPoint out**.
 - **Model:** `{active.key}` — {active.label}
 - **Backend:** `{active.backend}`
@@ -36,6 +37,8 @@ Part of the [Build Small Hackathon](https://huggingface.co/build-small-hackathon
         with gr.Tabs():
             with gr.Tab("Lesson slides"):
                 build_education_pptx_tab()
             with gr.Tab("Chat (debug)"):
                 build_chat_tab()
@@ -48,7 +51,7 @@ def main() -> None:
     demo.launch(
         server_name="0.0.0.0",
         server_port=int(os.environ.get("PORT", "7860")),
-        allowed_paths=gradio_allowed_paths(),
     )

 import gradio as gr
 from gradio_space.model_loading import preload_active_model
+from gradio_space.tabs import build_chat_tab, build_education_pptx_tab, build_research_mind_tab
 from gradio_space.tabs.education_pptx import gradio_allowed_paths
+from gradio_space.tabs.research_mind import researchmind_allowed_paths
 from inference.config import get_app_config
 _app_config = get_app_config()
         else "Using built-in presets (models.yaml not found)."
     )
+    with gr.Blocks(title="Lesson Agent + ResearchMind — Build Small Hackathon") as demo:
         gr.Markdown(
             f"""
+# Lesson Agent + ResearchMind
+Local skill-based agents — **lesson slides** and **research with MemRAG** (offline Q&A after ingest).
 - **Model:** `{active.key}` — {active.label}
 - **Backend:** `{active.backend}`
         with gr.Tabs():
             with gr.Tab("Lesson slides"):
                 build_education_pptx_tab()
+            with gr.Tab("ResearchMind"):
+                build_research_mind_tab()
             with gr.Tab("Chat (debug)"):
                 build_chat_tab()
     demo.launch(
         server_name="0.0.0.0",
         server_port=int(os.environ.get("PORT", "7860")),
+        allowed_paths=[*gradio_allowed_paths(), *researchmind_allowed_paths()],
     )

apps/gradio-space/src/gradio_space/model_loading.py CHANGED Viewed

@@ -1,5 +1,6 @@
 from inference.config import get_app_config, get_model_config
 from inference.factory import get_backend, reset_backend
 _app_config = get_app_config()
 _current_model_key: str | None = None
@@ -111,4 +112,5 @@ def chat(message: str, history: list, model_key: str) -> str:
     messages = _history_to_messages(history)
     messages.append({"role": "user", "content": message})
-    return get_backend(model_key).chat(messages)

 from inference.config import get_app_config, get_model_config
 from inference.factory import get_backend, reset_backend
+from inference.response_clean import strip_reasoning_output
 _app_config = get_app_config()
 _current_model_key: str | None = None
     messages = _history_to_messages(history)
     messages.append({"role": "user", "content": message})
+    reply = get_backend(model_key).chat(messages)
+    return strip_reasoning_output(reply)

apps/gradio-space/src/gradio_space/research_helpers.py ADDED Viewed

	@@ -0,0 +1,196 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+import gradio as gr
+from agent.models import ResearchIngestResult
+from agent.runner import AgentRunner
+from gradio_space.model_loading import chat, ensure_model_loaded, get_active_model_key
+from inference.factory import get_backend
+from researchmind.ingest import IngestPipeline
+def list_session_choices() -> list[tuple[str, str]]:
+    store = IngestPipeline().store
+    sessions = store.list_sessions()
+    choices: list[tuple[str, str]] = [("New session (chat only)", "")]
+    for s in sessions:
+        label = f"{s.topic or 'Untitled'} ({s.id})"
+        choices.append((label, s.id))
+    return choices
+def refresh_sessions(current: str):
+    choices = list_session_choices()
+    values = [c[1] for c in choices]
+    value = current if current in values else ""
+    return gr.update(choices=choices, value=value)
+def list_doc_choices(session_id: str | None) -> list[tuple[str, str]]:
+    store = IngestPipeline().store
+    docs = store.list_documents(session_id=session_id or None)
+    choices: list[tuple[str, str]] = []
+    for d in docs:
+        label = f"{d.title} ({d.source_type})"
+        if len(d.uri) > 60:
+            label += f" — {d.uri[:57]}…"
+        else:
+            label += f" — {d.uri}"
+        choices.append((label, d.id))
+    return choices
+def refresh_doc_choices(session_id: str, current: list[str] | None):
+    choices = list_doc_choices(session_id or None)
+    valid = {c[1] for c in choices}
+    selected = [doc_id for doc_id in (current or []) if doc_id in valid]
+    default_selected = [c[1] for c in choices] if choices and not selected else selected
+    return gr.update(choices=choices, value=default_selected)
+def load_trace_json(trace_path: str) -> str:
+    if not trace_path:
+        return ""
+    if trace_path.strip().startswith("{"):
+        return trace_path
+    path = Path(trace_path)
+    if path.is_file():
+        return path.read_text(encoding="utf-8")
+    return trace_path
+def trace_summary_markdown(trace_path: str) -> str:
+    raw = load_trace_json(trace_path)
+    if not raw or not raw.strip().startswith("{"):
+        return raw or "_No trace yet._"
+    try:
+        data = json.loads(raw)
+    except json.JSONDecodeError:
+        return f"Trace file: `{trace_path}`"
+    lines = [
+        f"**Run** `{data.get('run_id', '?')}` · skill `{data.get('skill', '?')}`",
+        "",
+    ]
+    for step in data.get("steps", []):
+        if step.get("type") != "note":
+            continue
+        msg = step.get("message", "")
+        extra = {k: v for k, v in step.items() if k not in ("type", "message")}
+        detail = ""
+        if extra:
+            detail = " — " + ", ".join(f"{k}={v!r}" for k, v in extra.items())
+        lines.append(f"- {msg}{detail}")
+    if len(lines) <= 2:
+        lines.append("_No notes in trace. See Trace JSON below._")
+    return "\n".join(lines)
+def format_ingest_status(result: ResearchIngestResult) -> str:
+    lines = [result.message, ""]
+    if result.ingested:
+        lines.append("**Ingested**")
+        lines.extend(f"- {url}" for url in result.ingested)
+        lines.append("")
+    if result.skipped:
+        lines.append("**Skipped (duplicate)**")
+        lines.extend(f"- {url}" for url in result.skipped)
+        lines.append("")
+    if result.failures:
+        lines.append("**Failed**")
+        for failure in result.failures:
+            lines.append(f"- `{failure.url}` — _{failure.stage}_: {failure.reason}")
+        lines.append("")
+        lines.append("_Open the **Trace** tab for full JSON._")
+    return "\n".join(lines).strip()
+def memory_summary(session_id: str) -> str:
+    store = IngestPipeline().store
+    docs = store.list_documents(session_id=session_id or None)
+    chunks = store.count_chunks()
+    if not docs:
+        return f"_No documents indexed yet._ Total chunks in store: **{chunks}**."
+    scope = f"session `{session_id}`" if session_id else "all sessions"
+    lines = [f"**{len(docs)}** document(s) in {scope} · **{chunks}** total chunks in store\n"]
+    for d in docs:
+        lines.append(f"- **{d.title}** (`{d.source_type}`) — {d.uri}")
+    return "\n".join(lines)
+def rag_scope_hint(session_id: str, doc_ids: list[str] | None) -> str:
+    if doc_ids:
+        return f"RAG scope: **{len(doc_ids)}** selected document(s)."
+    if session_id:
+        n = len(IngestPipeline().store.list_documents(session_id=session_id))
+        return f"RAG scope: all **{n}** document(s) in session `{session_id}`."
+    return "RAG scope: **entire** indexed corpus (all sessions)."
+def run_research_question(
+    question: str,
+    *,
+    session_id: str,
+    doc_ids: list[str] | None,
+    model_key: str | None = None,
+) -> tuple[str, str, str]:
+    """Returns (answer_markdown, trace_json, trace_summary_md)."""
+    key = model_key or get_active_model_key()
+    load_error = ensure_model_loaded(key)
+    if load_error:
+        return load_error, load_error, load_error
+    if not question.strip():
+        return "Enter a question.", "", ""
+    sid = session_id
+    if not sid:
+        sid = IngestPipeline().store.create_session().id
+    runner = AgentRunner()
+    result = runner.run_researchmind_chat(
+        question=question,
+        session_id=sid,
+        doc_ids=doc_ids or None,
+        model_key=key,
+        backend=get_backend(key),
+    )
+    trace_json = json.dumps(
+        {
+            "trace_path": result.trace_path,
+            "citations": [c.model_dump() for c in result.citations],
+            "scope": {
+                "session_id": sid,
+                "doc_ids": doc_ids or [],
+            },
+        },
+        indent=2,
+    )
+    return (
+        result.answer,
+        trace_json,
+        trace_summary_markdown(result.trace_path),
+    )
+def rag_aware_chat(
+    message: str,
+    history: list,
+    model_key: str,
+    use_rag: bool,
+    session_id: str,
+    doc_ids: list[str] | None,
+) -> str:
+    if not use_rag:
+        return chat(message, history, model_key)
+    answer, _, _ = run_research_question(
+        message,
+        session_id=session_id,
+        doc_ids=doc_ids,
+        model_key=model_key,
+    )
+    return answer

apps/gradio-space/src/gradio_space/tabs/__init__.py CHANGED Viewed

@@ -1,4 +1,5 @@
 from gradio_space.tabs.chat import build_chat_tab
 from gradio_space.tabs.education_pptx import build_education_pptx_tab
-__all__ = ["build_chat_tab", "build_education_pptx_tab"]

 from gradio_space.tabs.chat import build_chat_tab
 from gradio_space.tabs.education_pptx import build_education_pptx_tab
+from gradio_space.tabs.research_mind import build_research_mind_tab
+__all__ = ["build_chat_tab", "build_education_pptx_tab", "build_research_mind_tab"]

apps/gradio-space/src/gradio_space/tabs/chat.py CHANGED Viewed

@@ -1,6 +1,13 @@
 import gradio as gr
-from gradio_space.model_loading import chat, model_status
 from inference.config import get_app_config
 _app_config = get_app_config()
@@ -11,12 +18,29 @@ def build_chat_tab() -> None:
         """
 ### Model chat (debug)
-Test the active local model with a simple chat interface.
 """
     )
     model_key = _app_config.active_model
     if _app_config.allow_model_switch and len(_app_config.models) > 1:
         model_dropdown = gr.Dropdown(
             choices=_app_config.model_choices(),
@@ -26,19 +50,42 @@ Test the active local model with a simple chat interface.
         status = gr.Markdown(model_status(model_key))
         model_dropdown.change(fn=model_status, inputs=model_dropdown, outputs=status)
         gr.ChatInterface(
-            fn=chat,
-            additional_inputs=[model_dropdown],
             examples=[
-                ["Hello! What can you help me with?", _app_config.active_model],
-                ["Explain photosynthesis in one sentence.", _app_config.active_model],
             ],
         )
     else:
         status = gr.Markdown(model_status(model_key))
         gr.ChatInterface(
-            fn=lambda message, history: chat(message, history, model_key),
             examples=[
-                "Hello! What can you help me with?",
-                "Explain photosynthesis in one sentence.",
             ],
         )

 import gradio as gr
+from gradio_space.model_loading import model_status
+from gradio_space.research_helpers import (
+    list_session_choices,
+    rag_aware_chat,
+    rag_scope_hint,
+    refresh_doc_choices,
+    refresh_sessions,
+)
 from inference.config import get_app_config
 _app_config = get_app_config()
         """
 ### Model chat (debug)
+Test the active local model. Enable **ResearchMind RAG** to answer from ingested sessions and documents with citations.
 """
     )
     model_key = _app_config.active_model
+    with gr.Row():
+        use_rag = gr.Checkbox(label="Use ResearchMind RAG", value=False)
+        session_dd = gr.Dropdown(
+            label="Session",
+            choices=list_session_choices(),
+            value="",
+            interactive=True,
+        )
+        refresh_sessions_btn = gr.Button("Refresh", size="sm")
+    doc_dd = gr.CheckboxGroup(
+        label="Documents to search (empty = all docs in session, or entire corpus if no session)",
+        choices=[],
+        value=[],
+    )
+    rag_hint = gr.Markdown(value=rag_scope_hint("", []))
     if _app_config.allow_model_switch and len(_app_config.models) > 1:
         model_dropdown = gr.Dropdown(
             choices=_app_config.model_choices(),
         status = gr.Markdown(model_status(model_key))
         model_dropdown.change(fn=model_status, inputs=model_dropdown, outputs=status)
         gr.ChatInterface(
+            fn=rag_aware_chat,
+            additional_inputs=[model_dropdown, use_rag, session_dd, doc_dd],
             examples=[
+                ["What do my ingested sources say about AI agents?", _app_config.active_model, True, "", []],
+                ["Hello! What can you help me with?", _app_config.active_model, False, "", []],
             ],
         )
     else:
         status = gr.Markdown(model_status(model_key))
+        def _chat(message, history, use_rag_flag, sid, docs):
+            return rag_aware_chat(message, history, model_key, use_rag_flag, sid, docs)
         gr.ChatInterface(
+            fn=_chat,
+            additional_inputs=[use_rag, session_dd, doc_dd],
             examples=[
+                ["What do my ingested sources say about AI agents?", True, "", []],
+                ["Hello! What can you help me with?", False, "", []],
             ],
         )
+    def _update_hint(sid: str, docs: list[str] | None, rag_on: bool) -> str:
+        if not rag_on:
+            return "_Plain chat — model only, no document retrieval._"
+        return rag_scope_hint(sid, docs)
+    refresh_sessions_btn.click(fn=refresh_sessions, inputs=[session_dd], outputs=[session_dd])
+    session_dd.change(
+        fn=refresh_doc_choices,
+        inputs=[session_dd, doc_dd],
+        outputs=[doc_dd],
+    ).then(
+        fn=_update_hint,
+        inputs=[session_dd, doc_dd, use_rag],
+        outputs=[rag_hint],
+    )
+    doc_dd.change(fn=_update_hint, inputs=[session_dd, doc_dd, use_rag], outputs=[rag_hint])
+    use_rag.change(fn=_update_hint, inputs=[session_dd, doc_dd, use_rag], outputs=[rag_hint])

apps/gradio-space/src/gradio_space/tabs/research_mind.py ADDED Viewed

	@@ -0,0 +1,366 @@

+from __future__ import annotations
+import logging
+from pathlib import Path
+import gradio as gr
+from agent.runner import AgentRunner
+from gradio_space.model_loading import ensure_model_loaded, get_active_model_key, model_status
+from gradio_space.research_helpers import (
+    format_ingest_status,
+    list_session_choices,
+    load_trace_json,
+    memory_summary,
+    rag_scope_hint,
+    refresh_doc_choices,
+    refresh_sessions,
+    run_research_question,
+    trace_summary_markdown,
+)
+from inference.factory import get_backend
+from researchmind.config import get_config
+from researchmind.ingest import IngestPipeline
+logger = logging.getLogger(__name__)
+INGEST_MODES = [
+    ("Suggest URLs (confirm)", "suggest"),
+    ("Auto search & ingest", "auto"),
+]
+def discover_sources(
+    topic: str,
+    ingest_mode: str,
+    session_id: str,
+) -> tuple[str, gr.Update, str, str, str, str, object]:
+    model_key = get_active_model_key()
+    load_error = ensure_model_loaded(model_key)
+    if load_error:
+        return (
+            load_error,
+            gr.update(choices=[], value=[]),
+            session_id,
+            load_error,
+            load_error,
+            memory_summary(session_id),
+            refresh_doc_choices(session_id, []),
+        )
+    if not topic.strip():
+        msg = "Enter a topic to discover sources."
+        return (
+            msg,
+            gr.update(choices=[], value=[]),
+            session_id,
+            msg,
+            msg,
+            memory_summary(session_id),
+            refresh_doc_choices(session_id, []),
+        )
+    auto_search = ingest_mode == "auto"
+    try:
+        runner = AgentRunner()
+        if auto_search:
+            result = runner.run_researchmind_ingest(
+                topic=topic,
+                urls=[],
+                files=[],
+                auto_search=True,
+                session_id=session_id or None,
+                model_key=model_key,
+                backend=get_backend(model_key),
+            )
+            trace_json = load_trace_json(result.trace_path)
+            return (
+                format_ingest_status(result),
+                gr.update(choices=[], value=[]),
+                result.session_id,
+                trace_summary_markdown(result.trace_path),
+                trace_json,
+                memory_summary(result.session_id),
+                refresh_doc_choices(result.session_id, []),
+            )
+        discover = runner.run_researchmind_discover(
+            topic=topic,
+            auto_search=False,
+            session_id=session_id or None,
+            model_key=model_key,
+            backend=get_backend(model_key),
+        )
+        choices = discover.suggested_urls
+        if not choices:
+            summary = (
+                "No verified URLs found. Try a more specific topic, paste URLs manually, "
+                "or switch to **Auto search & ingest**."
+            )
+        else:
+            summary = (
+                f"Found **{len(choices)} verified URL(s)** via web search "
+                f"(Google + fallbacks). Select sources and click **Ingest selected**."
+            )
+        trace_json = load_trace_json(discover.trace_path)
+        return (
+            summary,
+            gr.update(choices=choices, value=choices),
+            discover.session_id,
+            trace_summary_markdown(discover.trace_path),
+            trace_json,
+            memory_summary(discover.session_id),
+            refresh_doc_choices(discover.session_id, []),
+        )
+    except Exception as exc:  # noqa: BLE001
+        msg = f"Discover error: {exc}"
+        return (
+            msg,
+            gr.update(choices=[], value=[]),
+            session_id,
+            msg,
+            msg,
+            memory_summary(session_id),
+            refresh_doc_choices(session_id, []),
+        )
+def ingest_selected(
+    topic: str,
+    urls_text: str,
+    selected_urls: list[str],
+    upload_files: list[str] | None,
+    session_id: str,
+) -> tuple[str, str, str, str, object, object]:
+    model_key = get_active_model_key()
+    load_error = ensure_model_loaded(model_key)
+    if load_error:
+        return (
+            load_error,
+            memory_summary(session_id),
+            load_error,
+            load_error,
+            refresh_sessions(session_id),
+            refresh_doc_choices(session_id, []),
+        )
+    direct_urls = [ln.strip() for ln in urls_text.splitlines() if ln.strip()]
+    all_urls = list(dict.fromkeys([*direct_urls, *(selected_urls or [])]))
+    files = [Path(p) for p in (upload_files or [])]
+    if not all_urls and not files:
+        msg = "Provide URLs, select suggested sources, or upload a file."
+        return (
+            msg,
+            memory_summary(session_id),
+            msg,
+            msg,
+            refresh_sessions(session_id),
+            refresh_doc_choices(session_id, []),
+        )
+    try:
+        logger.info("Ingesting %d URL(s) and %d file(s)", len(all_urls), len(files))
+        runner = AgentRunner()
+        result = runner.run_researchmind_ingest(
+            topic=topic or None,
+            urls=all_urls,
+            files=files,
+            auto_search=False,
+            session_id=session_id or None,
+            model_key=model_key,
+            backend=get_backend(model_key),
+        )
+        trace_json = load_trace_json(result.trace_path)
+        return (
+            format_ingest_status(result),
+            memory_summary(result.session_id),
+            trace_json,
+            trace_summary_markdown(result.trace_path),
+            refresh_sessions(result.session_id),
+            refresh_doc_choices(result.session_id, []),
+        )
+    except Exception as exc:  # noqa: BLE001
+        logger.exception("Ingest failed")
+        msg = f"**Ingest error:** {exc}"
+        return (
+            msg,
+            memory_summary(session_id),
+            msg,
+            msg,
+            refresh_sessions(session_id),
+            refresh_doc_choices(session_id, []),
+        )
+def ask_question(
+    question: str,
+    session_id: str,
+    doc_ids: list[str] | None,
+    chat_history: list[dict],
+) -> tuple[list[dict], str, str, str]:
+    if not question.strip():
+        return chat_history or [], "Enter a question.", "", rag_scope_hint(session_id, doc_ids)
+    try:
+        answer, trace_json, trace_summary = run_research_question(
+            question,
+            session_id=session_id,
+            doc_ids=doc_ids,
+        )
+        history = list(chat_history or [])
+        history.append({"role": "user", "content": question})
+        history.append({"role": "assistant", "content": answer})
+        return history, trace_json, trace_summary, rag_scope_hint(session_id, doc_ids)
+    except Exception as exc:  # noqa: BLE001
+        logger.exception("Research chat failed")
+        history = list(chat_history or [])
+        history.append({"role": "user", "content": question})
+        err = f"Chat error: {exc}"
+        history.append({"role": "assistant", "content": err})
+        return history, err, err, rag_scope_hint(session_id, doc_ids)
+def build_research_mind_tab() -> None:
+    """ResearchMind UI — ingest, memory, trace, and corpus chat."""
+    model_key = get_active_model_key()
+    cfg = get_config()
+    gr.Markdown(
+        """
+### ResearchMind
+Scrape sources once, index into **MemRAG** (local SQLite + embeddings), then ask questions **offline** with citations.
+"""
+    )
+    gr.Markdown(model_status(model_key))
+    gr.Markdown(f"Memory store: `{cfg.data_dir.resolve()}`")
+    with gr.Row():
+        session_dd = gr.Dropdown(
+            label="Session",
+            choices=list_session_choices(),
+            value="",
+            interactive=True,
+        )
+        refresh_btn = gr.Button("Refresh sessions", size="sm")
+    with gr.Tabs():
+        with gr.Tab("Ingest"):
+            gr.Markdown(
+                """
+- **Suggest mode:** Google web search → verified URLs → you confirm → ingest
+- **Auto search:** same search, ingests top verified URLs immediately
+- **Direct:** paste URLs or upload PDF/DOCX
+"""
+            )
+            with gr.Row():
+                topic = gr.Textbox(
+                    label="Topic (optional)",
+                    placeholder="e.g. Photosynthesis, American Revolution",
+                )
+                ingest_mode = gr.Dropdown(
+                    label="Ingest mode",
+                    choices=[m[0] for m in INGEST_MODES],
+                    value=INGEST_MODES[0][0],
+                )
+            urls_text = gr.Textbox(
+                label="URLs (one per line, optional)",
+                lines=3,
+                placeholder="https://en.wikipedia.org/wiki/...",
+            )
+            upload_files = gr.File(
+                label="Upload PDF or DOCX",
+                file_count="multiple",
+                file_types=[".pdf", ".docx"],
+            )
+            discover_btn = gr.Button("Discover sources", variant="secondary")
+            url_choices = gr.CheckboxGroup(label="Suggested URLs to ingest", choices=[])
+            ingest_btn = gr.Button("Ingest selected", variant="primary")
+            ingest_status = gr.Markdown()
+        with gr.Tab("Memory"):
+            gr.Markdown("Indexed documents and chunk counts for the selected session.")
+            memory_md = gr.Markdown(value=memory_summary(""))
+            refresh_memory_btn = gr.Button("Refresh memory view", size="sm")
+        with gr.Tab("Trace"):
+            trace_summary = gr.Markdown()
+            trace_box = gr.Textbox(label="Trace JSON", lines=14, interactive=False)
+    gr.Markdown("---")
+    gr.Markdown("### Chat with your corpus")
+    gr.Markdown(
+        "Ask questions about ingested sources. Limit search to specific documents below, "
+        "or leave all checked to search the whole session."
+    )
+    rag_hint = gr.Markdown(value=rag_scope_hint("", []))
+    doc_dd = gr.CheckboxGroup(
+        label="Documents in session",
+        choices=[],
+        value=[],
+    )
+    chatbot = gr.Chatbot(label="Research chat", height=360)
+    question = gr.Textbox(
+        label="Question",
+        placeholder="What do these sources say about AI agents?",
+    )
+    ask_btn = gr.Button("Ask", variant="primary")
+    refresh_btn.click(fn=refresh_sessions, inputs=[session_dd], outputs=[session_dd])
+    refresh_memory_btn.click(fn=memory_summary, inputs=[session_dd], outputs=[memory_md])
+    session_dd.change(fn=memory_summary, inputs=[session_dd], outputs=[memory_md])
+    session_dd.change(
+        fn=refresh_doc_choices,
+        inputs=[session_dd, doc_dd],
+        outputs=[doc_dd],
+    ).then(
+        fn=rag_scope_hint,
+        inputs=[session_dd, doc_dd],
+        outputs=[rag_hint],
+    )
+    doc_dd.change(fn=rag_scope_hint, inputs=[session_dd, doc_dd], outputs=[rag_hint])
+    discover_btn.click(
+        fn=lambda topic, mode, sid: discover_sources(
+            topic,
+            "auto" if mode == INGEST_MODES[1][0] else "suggest",
+            sid,
+        ),
+        inputs=[topic, ingest_mode, session_dd],
+        outputs=[
+            ingest_status,
+            url_choices,
+            session_dd,
+            trace_summary,
+            trace_box,
+            memory_md,
+            doc_dd,
+        ],
+    )
+    ingest_btn.click(
+        fn=ingest_selected,
+        inputs=[topic, urls_text, url_choices, upload_files, session_dd],
+        outputs=[ingest_status, memory_md, trace_box, trace_summary, session_dd, doc_dd],
+    )
+    ask_btn.click(
+        fn=ask_question,
+        inputs=[question, session_dd, doc_dd, chatbot],
+        outputs=[chatbot, trace_box, trace_summary, rag_hint],
+    )
+    question.submit(
+        fn=ask_question,
+        inputs=[question, session_dd, doc_dd, chatbot],
+        outputs=[chatbot, trace_box, trace_summary, rag_hint],
+    )
+def researchmind_allowed_paths() -> list[str]:
+    cfg = get_config()
+    root = cfg.data_dir.resolve()
+    root.mkdir(parents=True, exist_ok=True)
+    return [str(root)]

libs/agent/pyproject.toml CHANGED Viewed

@@ -9,6 +9,7 @@ authors = [
 requires-python = ">=3.12"
 dependencies = [
     "inference",
     "pillow>=10.0.0",
     "pydantic>=2.0.0",
     "python-docx>=1.1.0",
@@ -18,6 +19,7 @@ dependencies = [
 [tool.uv.sources]
 inference = { workspace = true }
 [build-system]
 requires = ["uv_build>=0.8.13,<0.9.0"]

 requires-python = ">=3.12"
 dependencies = [
     "inference",
+    "researchmind",
     "pillow>=10.0.0",
     "pydantic>=2.0.0",
     "python-docx>=1.1.0",
 [tool.uv.sources]
 inference = { workspace = true }
+researchmind = { workspace = true }
 [build-system]
 requires = ["uv_build>=0.8.13,<0.9.0"]

libs/agent/src/agent/models.py CHANGED Viewed

@@ -18,3 +18,55 @@ class EducationPptxInput(BaseModel):
     topic: str
     grade: str
     slide_count: int = Field(ge=3, le=8)

     topic: str
     grade: str
     slide_count: int = Field(ge=3, le=8)
+class Citation(BaseModel):
+    index: int
+    chunk_id: str
+    doc_title: str
+    doc_uri: str
+    excerpt: str
+class ResearchIngestInput(BaseModel):
+    topic: str = ""
+    urls: list[str] = Field(default_factory=list)
+    auto_search: bool = False
+    session_id: str | None = None
+class ResearchChatInput(BaseModel):
+    question: str
+    session_id: str
+    doc_ids: list[str] = Field(default_factory=list)
+class ResearchDiscoverResult(BaseModel):
+    suggested_urls: list[str]
+    session_id: str
+    trace_path: str
+class IngestFailure(BaseModel):
+    url: str
+    reason: str
+    stage: str = "unknown"
+class ResearchIngestResult(BaseModel):
+    session_id: str
+    ingested: list[str]
+    skipped: list[str]
+    failures: list[IngestFailure] = Field(default_factory=list)
+    doc_count: int
+    chunk_count: int
+    trace_path: str
+    message: str
+class ResearchChatResult(BaseModel):
+    answer: str
+    citations: list[Citation]
+    references_markdown: str
+    session_id: str
+    trace_path: str

libs/agent/src/agent/research_prompts.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from __future__ import annotations
+from pathlib import Path
+def _load_reference(skill_path: Path, rel: str) -> str:
+    ref = skill_path.parent / rel
+    if ref.is_file():
+        return ref.read_text(encoding="utf-8")
+    return ""
+def research_answer_system(skill_body: str, skill_path: Path) -> str:
+    citation_ref = _load_reference(skill_path, "references/citation-format.md")
+    parts = [
+        "You are ResearchMind, a local research assistant.",
+        "Answer ONLY from the provided context.",
+        "Each context block is numbered [1], [2], … — one number per source document.",
+        "Cite with those numbers only (e.g. [1]). Use at most a few citations per answer.",
+        "Ignore any [n] markers inside source text; never list citation numbers in a row.",
+        skill_body,
+    ]
+    if citation_ref:
+        parts.append(citation_ref)
+    return "\n\n".join(parts)
+def research_answer_user(question: str, context: str) -> str:
+    return f"""Context:
+{context}
+Question: {question}
+Write a concise answer with inline [n] citations (one index per source document).
+Do not append a References section — it is added automatically.
+If context is insufficient, say so."""

libs/agent/src/agent/runner.py CHANGED Viewed

@@ -3,11 +3,23 @@ from __future__ import annotations
 import json
 import re
 from dataclasses import dataclass
 from typing import Any
 from inference.base import InferenceBackend
-from agent.models import EducationPptxInput, SlideOutline, SlideSpec
 from agent.preview import outline_to_html, render_slide_images
 from agent.prompts import (
     education_outline_repair,
@@ -21,6 +33,7 @@ from agent.tools_registry import ToolRegistry
 from agent.trace import TraceRecorder
 EDUCATION_PPTX_SKILL = "education-pptx"
 @dataclass
@@ -225,3 +238,246 @@ class AgentRunner:
             if start >= 0 and end > start:
                 cleaned = cleaned[start : end + 1]
         return json.loads(cleaned)

 import json
 import re
 from dataclasses import dataclass
+from pathlib import Path
 from typing import Any
 from inference.base import InferenceBackend
+from researchmind.extract import extract_docx
+from researchmind.ingest import IngestPipeline
+from agent.models import (
+    Citation,
+    EducationPptxInput,
+    ResearchChatInput,
+    ResearchChatResult,
+    ResearchDiscoverResult,
+    ResearchIngestResult,
+    SlideOutline,
+    SlideSpec,
+)
 from agent.preview import outline_to_html, render_slide_images
 from agent.prompts import (
     education_outline_repair,
 from agent.trace import TraceRecorder
 EDUCATION_PPTX_SKILL = "education-pptx"
+RESEARCH_MIND_SKILL = "research-mind"
 @dataclass
             if start >= 0 and end > start:
                 cleaned = cleaned[start : end + 1]
         return json.loads(cleaned)
+    def _research_skill(self) -> Any:
+        return self._skills.get(RESEARCH_MIND_SKILL)
+    def _ensure_session(
+        self,
+        store: Any,
+        session_id: str | None,
+        topic: str = "",
+    ) -> str:
+        if session_id and store.get_session(session_id):
+            return session_id
+        return store.create_session(topic=topic).id
+    def run_researchmind_discover(
+        self,
+        *,
+        topic: str,
+        auto_search: bool,
+        session_id: str | None,
+        model_key: str,
+        backend: InferenceBackend,
+    ) -> ResearchDiscoverResult:
+        skill = self._research_skill()
+        pipeline = IngestPipeline()
+        store = pipeline.store
+        sid = self._ensure_session(store, session_id, topic=topic)
+        trace = TraceRecorder(
+            skill=skill.name,
+            model=model_key,
+            user_input={"topic": topic, "auto_search": auto_search, "phase": "discover"},
+        )
+        backend.load()
+        search_tool = self._tools.get("search_urls")
+        urls = search_tool.handler(topic, n=8)
+        trace.log_tool(
+            "search_urls",
+            {"topic": topic, "n": 8, "queries": "google+ddg"},
+            json.dumps(urls),
+        )
+        if not urls:
+            suggest_tool = self._tools.get("suggest_urls")
+            from researchmind.url_validate import filter_valid_urls
+            raw_llm = suggest_tool.handler(topic, backend)
+            urls = filter_valid_urls(raw_llm, check_reachable=True, max_results=5)
+            trace.log_tool("suggest_urls", {"topic": topic, "fallback": True}, json.dumps(urls))
+        trace_path = str(trace.save())
+        return ResearchDiscoverResult(
+            suggested_urls=urls,
+            session_id=sid,
+            trace_path=trace_path,
+        )
+    def run_researchmind_ingest(
+        self,
+        *,
+        topic: str | None,
+        urls: list[str],
+        files: list[Path],
+        auto_search: bool,
+        session_id: str | None,
+        model_key: str,
+        backend: InferenceBackend,
+    ) -> ResearchIngestResult:
+        skill = self._research_skill()
+        pipeline = IngestPipeline()
+        store = pipeline.store
+        sid = self._ensure_session(store, session_id, topic=topic or "")
+        trace = TraceRecorder(
+            skill=skill.name,
+            model=model_key,
+            user_input={
+                "topic": topic,
+                "urls": urls,
+                "files": [str(f) for f in files],
+                "auto_search": auto_search,
+                "session_id": sid,
+            },
+        )
+        backend.load()
+        targets = [u.strip() for u in urls if u.strip()]
+        if auto_search and topic and not targets and not files:
+            discover = self.run_researchmind_discover(
+                topic=topic,
+                auto_search=True,
+                session_id=sid,
+                model_key=model_key,
+                backend=backend,
+            )
+            targets = discover.suggested_urls
+        from agent.models import IngestFailure
+        ingested: list[str] = []
+        skipped: list[str] = []
+        failures: list[IngestFailure] = []
+        scrape_web = self._tools.get("scrape_web")
+        extract_index = self._tools.get("extract_and_index")
+        from researchmind.url_validate import validate_url
+        for url in targets:
+            ok, reason, normalized = validate_url(url, check_reachable=False)
+            if not ok:
+                trace.log_note(f"Skipped invalid URL {url}", reason=reason, stage="validate")
+                failures.append(IngestFailure(url=url, reason=reason, stage="validate"))
+                continue
+            try:
+                doc = scrape_web.handler(normalized)
+                if not (doc.text or "").strip():
+                    msg = "empty content after scrape"
+                    trace.log_note(f"Ingest failed for {url}", error=msg, stage="scrape")
+                    failures.append(IngestFailure(url=url, reason=msg, stage="scrape"))
+                    continue
+                doc_id, is_new = extract_index.handler(doc, session_id=sid)
+                trace.log_tool("scrape_web", {"url": url}, doc.title)
+                trace.log_tool(
+                    "extract_and_index",
+                    {"uri": doc.uri},
+                    f"{doc_id} new={is_new}",
+                )
+                (ingested if is_new else skipped).append(url)
+            except Exception as exc:  # noqa: BLE001
+                trace.log_note(f"Ingest failed for {url}", error=str(exc), stage="ingest")
+                failures.append(IngestFailure(url=url, reason=str(exc), stage="ingest"))
+        for file_path in files:
+            path = Path(file_path)
+            try:
+                if path.suffix.lower() == ".pdf":
+                    doc = self._tools.get("scrape_pdf").handler(path)
+                elif path.suffix.lower() == ".docx":
+                    doc = extract_docx(path)
+                else:
+                    text = path.read_text(encoding="utf-8", errors="replace")
+                    from researchmind.extract import ExtractedDocument
+                    doc = ExtractedDocument(
+                        source_type="file",
+                        uri=str(path.resolve()),
+                        title=path.stem,
+                        text=text,
+                    )
+                doc_id, is_new = extract_index.handler(doc, session_id=sid)
+                trace.log_tool("extract_and_index", {"file": str(path)}, f"{doc_id} new={is_new}")
+                label = path.name
+                (ingested if is_new else skipped).append(label)
+            except Exception as exc:  # noqa: BLE001
+                trace.log_note(f"Ingest failed for {path}", error=str(exc))
+                skipped.append(path.name)
+        doc_count = len(store.list_documents(session_id=sid))
+        chunk_count = store.count_chunks()
+        fail_n = len(failures)
+        message = (
+            f"Ingested {len(ingested)} source(s), skipped/duplicate {len(skipped)}, "
+            f"failed {fail_n}. Session `{sid}` has {doc_count} document(s); "
+            f"{chunk_count} total chunks."
+        )
+        trace.log_note(message, failures=[f.model_dump() for f in failures])
+        trace_path = str(trace.save())
+        return ResearchIngestResult(
+            session_id=sid,
+            ingested=ingested,
+            skipped=skipped,
+            failures=failures,
+            doc_count=doc_count,
+            chunk_count=chunk_count,
+            trace_path=trace_path,
+            message=message,
+        )
+    def run_researchmind_chat(
+        self,
+        *,
+        question: str,
+        session_id: str,
+        model_key: str,
+        backend: InferenceBackend,
+        doc_ids: list[str] | None = None,
+    ) -> ResearchChatResult:
+        skill = self._research_skill()
+        req = ResearchChatInput(
+            question=question.strip(),
+            session_id=session_id,
+            doc_ids=doc_ids or [],
+        )
+        trace = TraceRecorder(
+            skill=skill.name,
+            model=model_key,
+            user_input=req.model_dump(),
+        )
+        backend.load()
+        answer_tool = self._tools.get("research_answer")
+        raw_answer, citations, refs = answer_tool.handler(
+            req.question,
+            backend,
+            skill_body=skill.body,
+            skill_path=skill.path,
+            session_id=req.session_id,
+            doc_ids=req.doc_ids or None,
+        )
+        trace.log_llm(req.question, raw_answer)
+        trace.log_note(
+            "citations",
+            count=len(citations),
+            session_id=req.session_id,
+            doc_ids=req.doc_ids,
+        )
+        full_answer = raw_answer
+        if refs:
+            full_answer = f"{raw_answer}\n\n{refs}"
+        trace_path = str(trace.save())
+        pydantic_citations = [
+            Citation(
+                index=c.index,
+                chunk_id=c.chunk_id,
+                doc_title=c.doc_title,
+                doc_uri=c.doc_uri,
+                excerpt=c.excerpt,
+            )
+            for c in citations
+        ]
+        return ResearchChatResult(
+            answer=full_answer,
+            citations=pydantic_citations,
+            references_markdown=refs,
+            session_id=req.session_id,
+            trace_path=trace_path,
+        )

libs/agent/src/agent/skills.py CHANGED Viewed

@@ -15,6 +15,7 @@ class Skill:
     task: str
     tools: list[str]
     model_hints: list[str]
     body: str
     path: Path
@@ -44,12 +45,16 @@ def _parse_skill_md(path: Path) -> Skill:
     meta: dict[str, Any] = yaml.safe_load(match.group(1)) or {}
     body = match.group(2).strip()
     return Skill(
         name=str(meta.get("name", path.parent.name)),
         description=str(meta.get("description", "")),
         task=str(meta.get("task", "")),
         tools=[str(t) for t in meta.get("tools", [])],
         model_hints=[str(m) for m in meta.get("model_hints", [])],
         body=body,
         path=path,
     )

     task: str
     tools: list[str]
     model_hints: list[str]
+    flags: dict[str, Any]
     body: str
     path: Path
     meta: dict[str, Any] = yaml.safe_load(match.group(1)) or {}
     body = match.group(2).strip()
+    raw_flags = meta.get("flags") or {}
+    flags = {str(k): v for k, v in raw_flags.items()} if isinstance(raw_flags, dict) else {}
     return Skill(
         name=str(meta.get("name", path.parent.name)),
         description=str(meta.get("description", "")),
         task=str(meta.get("task", "")),
         tools=[str(t) for t in meta.get("tools", [])],
         model_hints=[str(m) for m in meta.get("model_hints", [])],
+        flags=flags,
         body=body,
         path=path,
     )

libs/agent/src/agent/tools/research_tools.py ADDED Viewed

	@@ -0,0 +1,93 @@

+from __future__ import annotations
+from pathlib import Path
+from typing import Any
+from researchmind.citations import Citation, clean_model_answer, format_context_block, format_references
+from researchmind.config import get_config
+from researchmind.extract import ExtractedDocument
+from researchmind.ingest import IngestPipeline
+from researchmind.retrieve import retrieve
+from researchmind.scrape_pdf import extract_pdf
+from researchmind.scrape_web import fetch_and_extract
+from researchmind.search_urls import search_urls
+from researchmind.store import MemRAGStore
+from researchmind.url_suggest import suggest_urls as llm_suggest_urls
+from agent.research_prompts import research_answer_system, research_answer_user
+def get_store() -> MemRAGStore:
+    return IngestPipeline().store
+def tool_suggest_urls(topic: str, backend: Any) -> list[str]:
+    return llm_suggest_urls(topic, backend)
+def tool_scrape_web(url: str) -> ExtractedDocument:
+    return fetch_and_extract(url)
+def tool_scrape_pdf(path: Path) -> ExtractedDocument:
+    return extract_pdf(path)
+def tool_extract_and_index(
+    doc: ExtractedDocument,
+    *,
+    session_id: str | None = None,
+) -> tuple[str, bool]:
+    pipeline = IngestPipeline()
+    return pipeline.ingest_document(doc, session_id=session_id)
+def tool_research_answer(
+    question: str,
+    backend: Any,
+    *,
+    skill_body: str,
+    skill_path: Path,
+    session_id: str | None = None,
+    doc_ids: list[str] | None = None,
+) -> tuple[str, list[Citation], str]:
+    cfg = get_config()
+    store = get_store()
+    scope_session = session_id if session_id and not doc_ids else None
+    scope_docs = doc_ids if doc_ids else None
+    chunks = retrieve(
+        question,
+        store,
+        config=cfg,
+        session_id=scope_session,
+        doc_ids=scope_docs,
+    )
+    if not chunks:
+        if doc_ids:
+            hint = "No chunks for the selected document(s). Try other sources or re-ingest."
+        elif session_id:
+            hint = "No indexed sources in this session yet. Ingest URLs or files first."
+        else:
+            hint = "No indexed sources yet. Ingest URLs or documents first."
+        return hint, [], ""
+    context, citations = format_context_block(chunks)
+    system = research_answer_system(skill_body, skill_path)
+    user = research_answer_user(question, context)
+    messages = [
+        {"role": "system", "content": system},
+        {"role": "user", "content": user},
+    ]
+    answer = clean_model_answer(
+        backend.chat(messages, max_tokens=1024, temperature=0.3)
+    )
+    refs = format_references(citations)
+    if session_id:
+        store.add_message(session_id, "user", question, [c.chunk_id for c in citations])
+        store.add_message(session_id, "assistant", answer, [c.chunk_id for c in citations])
+    return answer, citations, refs
+def tool_search_urls(topic: str, *, n: int = 5, check_reachable: bool = True) -> list[str]:
+    return search_urls(topic, n=n, check_reachable=check_reachable)

libs/agent/src/agent/tools_registry.py CHANGED Viewed

@@ -6,7 +6,14 @@ from typing import Any
 from agent.models import SlideOutline
 from agent.tools.pptx import create_pptx
 @dataclass(frozen=True)
 class ToolSpec:
@@ -23,6 +30,36 @@ class ToolRegistry:
             "Create a PowerPoint file from a validated SlideOutline",
             self._handle_create_pptx,
         )
     def register(self, name: str, description: str, handler: Callable[..., Any]) -> None:
         self._tools[name] = ToolSpec(name=name, description=description, handler=handler)

 from agent.models import SlideOutline
 from agent.tools.pptx import create_pptx
+from agent.tools.research_tools import (
+    tool_extract_and_index,
+    tool_research_answer,
+    tool_scrape_pdf,
+    tool_scrape_web,
+    tool_search_urls,
+    tool_suggest_urls,
+)
 @dataclass(frozen=True)
 class ToolSpec:
             "Create a PowerPoint file from a validated SlideOutline",
             self._handle_create_pptx,
         )
+        self.register(
+            "suggest_urls",
+            "Suggest research URLs for a topic using the local LLM",
+            tool_suggest_urls,
+        )
+        self.register(
+            "scrape_web",
+            "Fetch and extract text from a web URL",
+            tool_scrape_web,
+        )
+        self.register(
+            "scrape_pdf",
+            "Extract text from a PDF file path",
+            tool_scrape_pdf,
+        )
+        self.register(
+            "extract_and_index",
+            "Chunk, embed, and index an ExtractedDocument into MemRAG",
+            tool_extract_and_index,
+        )
+        self.register(
+            "research_answer",
+            "Answer a question with RAG citations from MemRAG",
+            tool_research_answer,
+        )
+        self.register(
+            "search_urls",
+            "Web search for URLs on a topic (DuckDuckGo)",
+            tool_search_urls,
+        )
     def register(self, name: str, description: str, handler: Callable[..., Any]) -> None:
         self._tools[name] = ToolSpec(name=name, description=description, handler=handler)

libs/agent/tests/test_research_runner.py ADDED Viewed

	@@ -0,0 +1,107 @@

+from __future__ import annotations
+from pathlib import Path
+import numpy as np
+import pytest
+from agent.runner import AgentRunner
+from researchmind.config import ResearchMindConfig
+from researchmind.extract import ExtractedDocument
+from researchmind.store import MemRAGStore
+class MockBackend:
+    def load(self) -> None:
+        return None
+    def chat(self, messages, *, max_tokens=512, temperature=0.7):
+        user = messages[-1]["content"]
+        if "Topic:" in user:
+            return '["https://example.com/a", "https://example.com/b"]'
+        return "Plants use photosynthesis [1]."
+    def generate(self, prompt, *, max_tokens=512, temperature=0.7):
+        return self.chat([{"role": "user", "content": prompt}], max_tokens=max_tokens)
+@pytest.fixture
+def research_env(tmp_path, monkeypatch):
+    cfg = ResearchMindConfig(
+        data_dir=tmp_path / "rm",
+        embed_model="test",
+        auto_search=False,
+        top_k=2,
+        max_context_chunks=8,
+        chunk_size=50,
+        chunk_overlap=10,
+    )
+    monkeypatch.setenv("RESEARCHMIND_DATA_DIR", str(cfg.data_dir))
+    def fake_embed(texts, *, model_name):
+        vecs = []
+        for t in texts:
+            vecs.append(np.array([1.0, 0.0, 0.0], dtype=np.float32))
+        return np.stack(vecs) if vecs else np.zeros((0, 3), dtype=np.float32)
+    monkeypatch.setattr("researchmind.ingest.embed_texts", fake_embed)
+    monkeypatch.setattr("researchmind.retrieve.embed_texts", fake_embed)
+    def fake_scrape(url: str):
+        return ExtractedDocument(
+            source_type="web",
+            uri=url,
+            title="Example",
+            text="Photosynthesis converts light to energy in plants.",
+        )
+    monkeypatch.setattr("agent.tools.research_tools.fetch_and_extract", fake_scrape)
+    def fake_search(topic, *, n=5, check_reachable=True):
+        return [f"https://example.com/{topic.replace(' ', '-')}"]
+    monkeypatch.setattr("agent.tools.research_tools.search_urls", fake_search)
+    def fake_validate(url, *, check_reachable=True):
+        normalized = url if url.startswith("http") else f"https://{url}"
+        return True, "ok", normalized
+    monkeypatch.setattr("researchmind.url_validate.validate_url", fake_validate)
+    return cfg
+def test_discover_urls(research_env):
+    runner = AgentRunner()
+    result = runner.run_researchmind_discover(
+        topic="photosynthesis",
+        auto_search=False,
+        session_id=None,
+        model_key="test",
+        backend=MockBackend(),
+    )
+    assert len(result.suggested_urls) >= 1
+    assert result.session_id
+def test_ingest_and_chat(research_env):
+    runner = AgentRunner()
+    ingest = runner.run_researchmind_ingest(
+        topic=None,
+        urls=["https://example.com/a"],
+        files=[],
+        auto_search=False,
+        session_id=None,
+        model_key="test",
+        backend=MockBackend(),
+    )
+    assert ingest.doc_count >= 1
+    assert ingest.chunk_count >= 1
+    chat = runner.run_researchmind_chat(
+        question="How do plants make energy?",
+        session_id=ingest.session_id,
+        model_key="test",
+        backend=MockBackend(),
+    )
+    assert "photosynthesis" in chat.answer.lower() or "[1]" in chat.answer
+    assert chat.session_id == ingest.session_id

libs/inference/src/inference/response_clean.py ADDED Viewed

	@@ -0,0 +1,87 @@

+from __future__ import annotations
+import re
+_RT_OPEN = "<" + "redacted_thinking" + ">"
+_RT_CLOSE = "</" + "redacted_thinking" + ">"
+_THINK_OPEN = "<" + "think" + ">"
+_THINK_CLOSE = "</" + "think" + ">"
+_THINK_BLOCKS = re.compile(
+    "|".join(
+        (
+            re.escape(_RT_OPEN) + r".*?" + re.escape(_RT_CLOSE),
+            re.escape(_THINK_OPEN) + r".*?" + re.escape(_THINK_CLOSE),
+            r"<thinking>.*?</thinking>",
+        )
+    ),
+    re.DOTALL | re.IGNORECASE,
+)
+_MALFORMED_THINK_OPEN = re.compile(r"^think>\s*", re.IGNORECASE)
+_ANSWER_SPLITS = [
+    re.compile(r"(?:Let's draft:|Draft:)\s*", re.IGNORECASE),
+    re.compile(r"\nSummary:\s*", re.IGNORECASE),
+    re.compile(r"\nAnswer:\s*", re.IGNORECASE),
+    re.compile(r"\n\n(?:In summary|To summarize)[,:]\s*", re.IGNORECASE),
+]
+_META_TAIL = re.compile(
+    r"\n\n(?:Now,|We need|Also,|But we|However,|The instruction|So we|"
+    r"That means|We must|We should|We have|We can)\b",
+    re.IGNORECASE,
+)
+_REASONING_OPENERS = (
+    "we need to",
+    "first,",
+    "the user",
+    "let me",
+    "okay,",
+    "now, let",
+    "i need to",
+)
+def _normalize_extracted(text: str) -> str:
+    cleaned = text.strip()
+    cleaned = re.sub(r"^Summary:\s*", "", cleaned, flags=re.IGNORECASE)
+    cleaned = re.sub(r"^Answer:\s*", "", cleaned, flags=re.IGNORECASE)
+    return cleaned.strip()
+def _extract_answer_from_reasoning(text: str) -> str | None:
+    for pattern in _ANSWER_SPLITS:
+        match = pattern.search(text)
+        if not match:
+            continue
+        rest = _normalize_extracted(text[match.end() :])
+        rest = _META_TAIL.split(rest, maxsplit=1)[0].strip()
+        if len(rest) >= 40:
+            return rest
+    return None
+def looks_like_reasoning_only(text: str) -> bool:
+    sample = text[:240].lower()
+    return any(sample.startswith(opener) for opener in _REASONING_OPENERS)
+def strip_reasoning_output(text: str) -> str:
+    """Remove model chain-of-thought / thinking traces from user-visible replies."""
+    cleaned = text.strip()
+    if not cleaned:
+        return ""
+    cleaned = _THINK_BLOCKS.sub("", cleaned).strip()
+    if _MALFORMED_THINK_OPEN.match(cleaned):
+        body = _MALFORMED_THINK_OPEN.sub("", cleaned, count=1).strip()
+        extracted = _extract_answer_from_reasoning(body)
+        if extracted:
+            return extracted
+        cleaned = body
+    if looks_like_reasoning_only(cleaned):
+        extracted = _extract_answer_from_reasoning(cleaned)
+        if extracted:
+            return extracted
+    return cleaned

libs/inference/tests/test_response_clean.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from __future__ import annotations
+from inference.response_clean import strip_reasoning_output
+_RT_OPEN = "<" + "redacted_thinking" + ">"
+_RT_CLOSE = "</" + "redacted_thinking" + ">"
+_THINK_OPEN = "<" + "think" + ">"
+_THINK_CLOSE = "</" + "think" + ">"
+def test_strips_redacted_thinking_block():
+    raw = f"{_RT_OPEN}\nplanning...\n{_RT_CLOSE}\n\nThe capital of France is Paris."
+    assert strip_reasoning_output(raw) == "The capital of France is Paris."
+def test_strips_think_block():
+    raw = f"{_THINK_OPEN}\nplanning...\n{_THINK_CLOSE}\n\nAgents use memory [1]."
+    assert strip_reasoning_output(raw) == "Agents use memory [1]."
+def test_strips_malformed_think_prefix_and_extracts_summary():
+    raw = """think> We need to summarize the document. First, identify sources.
+Let's draft:
+Summary: This review covers AI agent applications, evaluation, and future work [1]."""
+    out = strip_reasoning_output(raw)
+    assert out.startswith("This review covers")
+    assert "We need to summarize" not in out
+def test_preserves_normal_answer():
+    text = "AI agents combine perception, planning, and action [1]."
+    assert strip_reasoning_output(text) == text

libs/researchmind/README.md ADDED Viewed

	@@ -0,0 +1,9 @@

+# researchmind
+Local ingest, MemRAG persistence, and retrieval for the ResearchMind agent.
+- Scrape web (httpx + trafilatura), PDF (pypdf), DOCX (python-docx)
+- Chunk, embed (sentence-transformers), store in SQLite
+- Top-k retrieval with graph neighbor expansion and citation formatting
+Set `RESEARCHMIND_DATA_DIR` (default `outputs/researchmind`) for the memory database and raw snapshots.

libs/researchmind/pyproject.toml ADDED Viewed

	@@ -0,0 +1,25 @@

+[project]
+name = "researchmind"
+version = "0.1.0"
+description = "Local scraper + RAG + MemRAG store for ResearchMind agent"
+readme = "README.md"
+authors = [
+    { name = "MSGhais", email = "msghais135@gmail.com" }
+]
+requires-python = ">=3.12"
+dependencies = [
+    "inference",
+    "ddgs>=9.0.0",
+    "googlesearch-python>=1.3.0",
+    "httpx>=0.28.0",
+    "numpy>=2.0.0",
+    "pydantic>=2.0.0",
+    "pypdf>=5.0.0",
+    "python-docx>=1.1.0",
+    "sentence-transformers>=3.0.0",
+    "trafilatura>=2.0.0",
+]
+[build-system]
+requires = ["uv_build>=0.8.13,<0.9.0"]
+build-backend = "uv_build"

libs/researchmind/src/researchmind/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from researchmind.config import get_config
+from researchmind.extract import ExtractedDocument
+from researchmind.ingest import IngestPipeline
+from researchmind.store import MemRAGStore
+__all__ = [
+    "ExtractedDocument",
+    "IngestPipeline",
+    "MemRAGStore",
+    "get_config",
+]

libs/researchmind/src/researchmind/chunking.py ADDED Viewed

	@@ -0,0 +1,46 @@

+from __future__ import annotations
+import hashlib
+import re
+from dataclasses import dataclass
+@dataclass(frozen=True)
+class TextChunk:
+    chunk_id: str
+    ordinal: int
+    text: str
+def _approx_tokens(text: str) -> int:
+    return len(re.findall(r"\S+", text))
+def chunk_text(
+    text: str,
+    *,
+    doc_id: str,
+    chunk_size: int = 512,
+    chunk_overlap: int = 128,
+) -> list[TextChunk]:
+    words = text.split()
+    if not words:
+        return []
+    chunks: list[TextChunk] = []
+    start = 0
+    ordinal = 0
+    step = max(1, chunk_size - chunk_overlap)
+    while start < len(words):
+        end = min(len(words), start + chunk_size)
+        piece = " ".join(words[start:end]).strip()
+        if piece:
+            digest = hashlib.sha256(f"{doc_id}:{ordinal}:{piece}".encode()).hexdigest()[:16]
+            chunks.append(TextChunk(chunk_id=f"{doc_id}_{ordinal}_{digest}", ordinal=ordinal, text=piece))
+            ordinal += 1
+        if end >= len(words):
+            break
+        start += step
+    return chunks

libs/researchmind/src/researchmind/citations.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from __future__ import annotations
+import re
+from dataclasses import dataclass
+from inference.response_clean import looks_like_reasoning_only, strip_reasoning_output
+from researchmind.store import StoredChunk
+_EXCERPT_LEN = 400
+_PASSAGE_LEN = 700
+_CITATION_RUN = re.compile(r"(?:\[\d{1,4}\]\s*){3,}")
+@dataclass(frozen=True)
+class Citation:
+    index: int
+    chunk_id: str
+    doc_title: str
+    doc_uri: str
+    excerpt: str
+def _clean_passage(text: str) -> str:
+    """Collapse long runs of in-text [n] markers from scraped papers."""
+    cleaned = _CITATION_RUN.sub("[…] ", text)
+    cleaned = re.sub(r"\s+", " ", cleaned).strip()
+    if len(cleaned) > _PASSAGE_LEN:
+        return cleaned[:_PASSAGE_LEN] + "…"
+    return cleaned
+def format_context_block(chunks: list[StoredChunk]) -> tuple[str, list[Citation]]:
+    """Build LLM context with one citation index per source document."""
+    groups: list[tuple[str, str, list[StoredChunk]]] = []
+    seen_uris: set[str] = set()
+    for chunk in chunks:
+        if chunk.doc_uri in seen_uris:
+            for uri, _title, group in groups:
+                if uri == chunk.doc_uri:
+                    group.append(chunk)
+                    break
+        else:
+            seen_uris.add(chunk.doc_uri)
+            groups.append((chunk.doc_uri, chunk.doc_title, [chunk]))
+    citations: list[Citation] = []
+    blocks: list[str] = []
+    for i, (uri, title, doc_chunks) in enumerate(groups, start=1):
+        passages = [_clean_passage(c.text) for c in doc_chunks if c.text.strip()]
+        merged = "\n\n".join(passages)
+        excerpt = merged[:_EXCERPT_LEN] + ("..." if len(merged) > _EXCERPT_LEN else "")
+        citations.append(
+            Citation(
+                index=i,
+                chunk_id=doc_chunks[0].id,
+                doc_title=title,
+                doc_uri=uri,
+                excerpt=excerpt,
+            )
+        )
+        blocks.append(f"[{i}] **{title}**\n{uri}\n\n{merged}")
+    context = "\n\n---\n\n".join(blocks)
+    return context, citations
+def format_references(citations: list[Citation]) -> str:
+    if not citations:
+        return ""
+    lines = ["**References**"]
+    for c in citations:
+        lines.append(f"- [{c.index}] {c.doc_title} — {c.doc_uri}")
+    return "\n".join(lines)
+def clean_model_answer(answer: str) -> str:
+    """Remove thinking traces, duplicate references, and citation spam from model output."""
+    text = strip_reasoning_output(answer)
+    if "**References**" in text:
+        text = text.split("**References**", maxsplit=1)[0].rstrip()
+    if "\nReferences\n" in text:
+        text = text.split("\nReferences\n", maxsplit=1)[0].rstrip()
+    text = _CITATION_RUN.sub("", text)
+    text = re.sub(r"\n{3,}", "\n\n", text)
+    text = text.strip()
+    if not text or looks_like_reasoning_only(text):
+        return (
+            "The model returned planning text without a final answer. "
+            "Try asking again or switch to a non-reasoning model preset."
+        )
+    return text

libs/researchmind/src/researchmind/config.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from __future__ import annotations
+import os
+from dataclasses import dataclass
+from pathlib import Path
+@dataclass(frozen=True)
+class ResearchMindConfig:
+    data_dir: Path
+    embed_model: str
+    auto_search: bool
+    top_k: int
+    max_context_chunks: int
+    chunk_size: int
+    chunk_overlap: int
+def get_config() -> ResearchMindConfig:
+    data_dir = Path(
+        os.environ.get("RESEARCHMIND_DATA_DIR", "outputs/researchmind")
+    ).expanduser()
+    return ResearchMindConfig(
+        data_dir=data_dir,
+        embed_model=os.environ.get("RESEARCHMIND_EMBED_MODEL", "all-MiniLM-L6-v2"),
+        auto_search=os.environ.get("RESEARCHMIND_AUTO_SEARCH", "false").lower()
+        in ("1", "true", "yes"),
+        top_k=int(os.environ.get("RESEARCHMIND_TOP_K", "5")),
+        max_context_chunks=int(os.environ.get("RESEARCHMIND_MAX_CONTEXT_CHUNKS", "8")),
+        chunk_size=int(os.environ.get("RESEARCHMIND_CHUNK_SIZE", "512")),
+        chunk_overlap=int(os.environ.get("RESEARCHMIND_CHUNK_OVERLAP", "128")),
+    )

libs/researchmind/src/researchmind/embeddings.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from __future__ import annotations
+import numpy as np
+_embedder = None
+_embedder_model_name: str | None = None
+def get_embedder(model_name: str):
+    global _embedder, _embedder_model_name
+    if _embedder is None or _embedder_model_name != model_name:
+        from sentence_transformers import SentenceTransformer
+        _embedder = SentenceTransformer(model_name)
+        _embedder_model_name = model_name
+    return _embedder
+def embed_texts(texts: list[str], *, model_name: str) -> np.ndarray:
+    if not texts:
+        return np.zeros((0, 0), dtype=np.float32)
+    model = get_embedder(model_name)
+    vectors = model.encode(texts, normalize_embeddings=True, show_progress_bar=False)
+    return np.asarray(vectors, dtype=np.float32)
+def embedding_to_bytes(vector: np.ndarray) -> bytes:
+    return vector.astype(np.float32).tobytes()
+def bytes_to_embedding(data: bytes, dim: int) -> np.ndarray:
+    return np.frombuffer(data, dtype=np.float32).reshape(dim)

libs/researchmind/src/researchmind/extract.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from __future__ import annotations
+from pathlib import Path
+from pydantic import BaseModel, Field
+class ExtractedDocument(BaseModel):
+    source_type: str
+    uri: str
+    title: str
+    text: str
+    mime: str = "text/plain"
+    metadata: dict[str, str] = Field(default_factory=dict)
+def extract_docx(path: Path) -> ExtractedDocument:
+    from docx import Document
+    doc = Document(path)
+    paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()]
+    text = "\n\n".join(paragraphs)
+    title = path.stem
+    for para in doc.paragraphs:
+        if para.style and para.style.name and "Heading" in para.style.name:
+            if para.text.strip():
+                title = para.text.strip()
+                break
+    return ExtractedDocument(
+        source_type="docx",
+        uri=str(path.resolve()),
+        title=title,
+        text=text or path.name,
+        mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+        metadata={"filename": path.name},
+    )

libs/researchmind/src/researchmind/ingest.py ADDED Viewed

	@@ -0,0 +1,105 @@

+from __future__ import annotations
+from pathlib import Path
+from typing import Any
+import numpy as np
+from researchmind.chunking import chunk_text
+from researchmind.config import ResearchMindConfig, get_config
+from researchmind.embeddings import embed_texts
+from researchmind.extract import ExtractedDocument, extract_docx
+from researchmind.scrape_pdf import extract_pdf
+from researchmind.scrape_web import fetch_and_extract
+from researchmind.store import MemRAGStore
+class IngestPipeline:
+    def __init__(
+        self,
+        store: MemRAGStore | None = None,
+        config: ResearchMindConfig | None = None,
+    ) -> None:
+        self._config = config or get_config()
+        self._store = store or MemRAGStore(self._config)
+    @property
+    def store(self) -> MemRAGStore:
+        return self._store
+    def ingest_document(
+        self,
+        doc: ExtractedDocument,
+        *,
+        session_id: str | None = None,
+        raw_snapshot: str | None = None,
+    ) -> tuple[str, bool]:
+        doc_id_prefix = self._store.content_hash(doc.text)[:12]
+        chunks = chunk_text(
+            doc.text,
+            doc_id=doc_id_prefix,
+            chunk_size=self._config.chunk_size,
+            chunk_overlap=self._config.chunk_overlap,
+        )
+        if not chunks and doc.text.strip():
+            from researchmind.chunking import TextChunk
+            chunks = [
+                TextChunk(
+                    chunk_id=f"{doc_id_prefix}_0",
+                    ordinal=0,
+                    text=doc.text[: self._config.chunk_size],
+                )
+            ]
+        chunks_text = [c.text for c in chunks]
+        embeddings = embed_texts(chunks_text, model_name=self._config.embed_model)
+        chunk_tuples: list[tuple[str, int, str, np.ndarray, dict[str, Any]]] = []
+        for chunk, emb in zip(chunks, embeddings, strict=True):
+            chunk_tuples.append(
+                (
+                    chunk.chunk_id,
+                    chunk.ordinal,
+                    chunk.text,
+                    emb,
+                    {"source_type": doc.source_type},
+                )
+            )
+        return self._store.add_document(
+            source_type=doc.source_type,
+            uri=doc.uri,
+            title=doc.title,
+            text=doc.text,
+            chunks=chunk_tuples,
+            session_id=session_id,
+            raw_snapshot=raw_snapshot or doc.text[:100_000],
+        )
+    def ingest_url(self, url: str, *, session_id: str | None = None) -> tuple[str, bool]:
+        doc = fetch_and_extract(url)
+        return self.ingest_document(doc, session_id=session_id, raw_snapshot=doc.text)
+    def ingest_pdf(self, path: Path, *, session_id: str | None = None) -> tuple[str, bool]:
+        doc = extract_pdf(path)
+        return self.ingest_document(doc, session_id=session_id)
+    def ingest_docx(self, path: Path, *, session_id: str | None = None) -> tuple[str, bool]:
+        doc = extract_docx(path)
+        return self.ingest_document(doc, session_id=session_id)
+    def ingest_path(self, path: Path, *, session_id: str | None = None) -> tuple[str, bool]:
+        suffix = path.suffix.lower()
+        if suffix == ".pdf":
+            return self.ingest_pdf(path, session_id=session_id)
+        if suffix == ".docx":
+            return self.ingest_docx(path, session_id=session_id)
+        text = path.read_text(encoding="utf-8", errors="replace")
+        doc = ExtractedDocument(
+            source_type="file",
+            uri=str(path.resolve()),
+            title=path.stem,
+            text=text,
+            mime="text/plain",
+        )
+        return self.ingest_document(doc, session_id=session_id)

libs/researchmind/src/researchmind/retrieve.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from __future__ import annotations
+import numpy as np
+from researchmind.config import ResearchMindConfig, get_config
+from researchmind.embeddings import embed_texts
+from researchmind.store import MemRAGStore, StoredChunk
+def retrieve(
+    query: str,
+    store: MemRAGStore,
+    *,
+    config: ResearchMindConfig | None = None,
+    top_k: int | None = None,
+    expand_neighbors: bool = True,
+    session_id: str | None = None,
+    doc_ids: list[str] | None = None,
+) -> list[StoredChunk]:
+    cfg = config or get_config()
+    k = top_k if top_k is not None else cfg.top_k
+    all_chunks = store.get_chunks_with_embeddings(
+        session_id=session_id,
+        doc_ids=doc_ids,
+    )
+    if not all_chunks:
+        return []
+    q_vec = embed_texts([query], model_name=cfg.embed_model)[0]
+    scored: list[tuple[float, StoredChunk]] = []
+    for chunk, emb in all_chunks:
+        sim = float(np.dot(q_vec, emb))
+        scored.append((sim, chunk))
+    max_chunks = cfg.max_context_chunks
+    scored.sort(key=lambda x: x[0], reverse=True)
+    selected: list[StoredChunk] = []
+    seen_ids: set[str] = set()
+    for _, chunk in scored[:k]:
+        if len(selected) >= max_chunks:
+            break
+        if chunk.id not in seen_ids:
+            selected.append(chunk)
+            seen_ids.add(chunk.id)
+        if expand_neighbors and len(selected) < max_chunks:
+            for nid in store.get_neighbor_chunk_ids(chunk.id)[:1]:
+                if len(selected) >= max_chunks:
+                    break
+                if nid not in seen_ids:
+                    neighbors = store.get_chunks_by_ids([nid])
+                    for n in neighbors:
+                        selected.append(n)
+                        seen_ids.add(n.id)
+                        break
+    return selected[:max_chunks]

libs/researchmind/src/researchmind/scrape_pdf.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from __future__ import annotations
+from pathlib import Path
+from pypdf import PdfReader
+from researchmind.extract import ExtractedDocument
+def extract_pdf(path: Path, *, max_pages: int = 200) -> ExtractedDocument:
+    reader = PdfReader(str(path))
+    pages: list[str] = []
+    for i, page in enumerate(reader.pages[:max_pages]):
+        page_text = (page.extract_text() or "").strip()
+        if page_text:
+            pages.append(page_text)
+    text = "\n\n".join(pages)
+    title = path.stem
+    if reader.metadata and reader.metadata.title:
+        title = str(reader.metadata.title)
+    return ExtractedDocument(
+        source_type="pdf",
+        uri=str(path.resolve()),
+        title=title,
+        text=text or path.name,
+        mime="application/pdf",
+        metadata={"page_count": str(min(len(reader.pages), max_pages))},
+    )

libs/researchmind/src/researchmind/scrape_web.py ADDED Viewed

	@@ -0,0 +1,38 @@

+from __future__ import annotations
+import httpx
+import trafilatura
+from researchmind.extract import ExtractedDocument
+def fetch_and_extract(url: str, *, timeout: float = 30.0) -> ExtractedDocument:
+    headers = {
+        "User-Agent": "ResearchMind/0.1 (local research agent; hackathon)",
+    }
+    with httpx.Client(follow_redirects=True, timeout=timeout, headers=headers) as client:
+        response = client.get(url)
+        response.raise_for_status()
+        html = response.text
+    extracted = trafilatura.extract(
+        html,
+        url=url,
+        include_comments=False,
+        include_tables=True,
+        output_format="txt",
+    )
+    metadata = trafilatura.extract_metadata(html, default_url=url)
+    title = (metadata.title if metadata and metadata.title else url) or url
+    text = (extracted or "").strip()
+    if not text:
+        text = html[:50_000]
+    return ExtractedDocument(
+        source_type="web",
+        uri=url,
+        title=title,
+        text=text,
+        mime="text/html",
+        metadata={"final_url": str(response.url)},
+    )

libs/researchmind/src/researchmind/search_urls.py ADDED Viewed

	@@ -0,0 +1,89 @@

+from __future__ import annotations
+import logging
+from researchmind.url_validate import filter_valid_urls, normalize_url
+logger = logging.getLogger(__name__)
+def build_search_queries(topic: str) -> list[str]:
+    """Craft Google-friendly queries for a research topic."""
+    t = topic.strip()
+    if not t:
+        return []
+    return [
+        f"{t} site:wikipedia.org",
+        f'"{t}" introduction overview',
+        f"{t} tutorial guide site:.edu OR site:.gov",
+        f"{t} research paper site:arxiv.org",
+        f"what is {t}",
+    ]
+def _google_search(query: str, *, n: int) -> list[str]:
+    urls: list[str] = []
+    try:
+        from googlesearch import search
+        for item in search(query, num_results=n, lang="en", timeout=15):
+            if isinstance(item, str):
+                urls.append(item)
+            else:
+                href = getattr(item, "url", None) or getattr(item, "link", None)
+                if href:
+                    urls.append(str(href))
+    except Exception as exc:  # noqa: BLE001
+        logger.warning("Google search failed for %r: %s", query, exc)
+    return urls
+def _duckduckgo_search(query: str, *, n: int) -> list[str]:
+    urls: list[str] = []
+    try:
+        try:
+            from ddgs import DDGS
+        except ImportError:
+            from duckduckgo_search import DDGS
+        ddgs = DDGS()
+        results = ddgs.text(query, max_results=n)
+        if results is None:
+            return urls
+        for item in results:
+            if not isinstance(item, dict):
+                continue
+            href = item.get("href") or item.get("link")
+            if href:
+                urls.append(str(href))
+    except Exception as exc:  # noqa: BLE001
+        logger.warning("DuckDuckGo search failed for %r: %s", query, exc)
+    return urls
+def _collect_candidates(topic: str, *, per_query: int = 4) -> list[str]:
+    candidates: list[str] = []
+    seen: set[str] = set()
+    for query in build_search_queries(topic):
+        batch = _google_search(query, n=per_query)
+        if not batch:
+            batch = _duckduckgo_search(query, n=per_query)
+        for raw in batch:
+            normalized = normalize_url(raw)
+            if normalized and normalized not in seen:
+                seen.add(normalized)
+                candidates.append(normalized)
+    return candidates
+def search_urls(
+    topic: str,
+    *,
+    n: int = 5,
+    check_reachable: bool = True,
+) -> list[str]:
+    """
+    Search the web (Google first, DuckDuckGo fallback) and return verified URLs.
+    """
+    candidates = _collect_candidates(topic, per_query=max(n, 4))
+    return filter_valid_urls(candidates, check_reachable=check_reachable, max_results=n)

libs/researchmind/src/researchmind/store.py ADDED Viewed

	@@ -0,0 +1,381 @@

+from __future__ import annotations
+import hashlib
+import json
+import sqlite3
+import uuid
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any
+import numpy as np
+from researchmind.config import ResearchMindConfig, get_config
+from researchmind.embeddings import bytes_to_embedding, embedding_to_bytes
+@dataclass(frozen=True)
+class StoredDocument:
+    id: str
+    source_type: str
+    uri: str
+    title: str
+    ingested_at: str
+    content_hash: str
+@dataclass(frozen=True)
+class StoredChunk:
+    id: str
+    doc_id: str
+    ordinal: int
+    text: str
+    doc_title: str
+    doc_uri: str
+    metadata: dict[str, Any]
+@dataclass(frozen=True)
+class SessionInfo:
+    id: str
+    topic: str
+    created_at: str
+class MemRAGStore:
+    def __init__(self, config: ResearchMindConfig | None = None) -> None:
+        self._config = config or get_config()
+        self._config.data_dir.mkdir(parents=True, exist_ok=True)
+        (self._config.data_dir / "raw").mkdir(parents=True, exist_ok=True)
+        self._db_path = self._config.data_dir / "memory.db"
+        self._embed_dim: int | None = None
+        self._init_db()
+    @property
+    def db_path(self) -> Path:
+        return self._db_path
+    @property
+    def embed_dim(self) -> int:
+        if self._embed_dim is None:
+            row = self._conn().execute(
+                "SELECT dim FROM embed_meta LIMIT 1"
+            ).fetchone()
+            self._embed_dim = int(row[0]) if row else 384
+        return self._embed_dim
+    def _conn(self) -> sqlite3.Connection:
+        conn = sqlite3.connect(self._db_path)
+        conn.row_factory = sqlite3.Row
+        return conn
+    def _init_db(self) -> None:
+        with self._conn() as conn:
+            conn.executescript(
+                """
+                CREATE TABLE IF NOT EXISTS embed_meta (
+                    dim INTEGER NOT NULL
+                );
+                CREATE TABLE IF NOT EXISTS documents (
+                    id TEXT PRIMARY KEY,
+                    source_type TEXT NOT NULL,
+                    uri TEXT NOT NULL,
+                    title TEXT NOT NULL,
+                    ingested_at TEXT NOT NULL,
+                    content_hash TEXT NOT NULL UNIQUE,
+                    session_id TEXT
+                );
+                CREATE TABLE IF NOT EXISTS chunks (
+                    id TEXT PRIMARY KEY,
+                    doc_id TEXT NOT NULL,
+                    ordinal INTEGER NOT NULL,
+                    text TEXT NOT NULL,
+                    embedding_blob BLOB NOT NULL,
+                    meta_json TEXT NOT NULL DEFAULT '{}',
+                    FOREIGN KEY (doc_id) REFERENCES documents(id)
+                );
+                CREATE TABLE IF NOT EXISTS edges (
+                    src_id TEXT NOT NULL,
+                    dst_id TEXT NOT NULL,
+                    rel TEXT NOT NULL,
+                    PRIMARY KEY (src_id, dst_id, rel)
+                );
+                CREATE TABLE IF NOT EXISTS sessions (
+                    id TEXT PRIMARY KEY,
+                    topic TEXT NOT NULL,
+                    created_at TEXT NOT NULL
+                );
+                CREATE TABLE IF NOT EXISTS session_messages (
+                    id INTEGER PRIMARY KEY AUTOINCREMENT,
+                    session_id TEXT NOT NULL,
+                    role TEXT NOT NULL,
+                    content TEXT NOT NULL,
+                    chunk_ids_json TEXT NOT NULL DEFAULT '[]',
+                    created_at TEXT NOT NULL,
+                    FOREIGN KEY (session_id) REFERENCES sessions(id)
+                );
+                CREATE INDEX IF NOT EXISTS idx_chunks_doc ON chunks(doc_id);
+                CREATE INDEX IF NOT EXISTS idx_documents_session ON documents(session_id);
+                """
+            )
+    def set_embed_dim(self, dim: int) -> None:
+        with self._conn() as conn:
+            conn.execute("DELETE FROM embed_meta")
+            conn.execute("INSERT INTO embed_meta (dim) VALUES (?)", (dim,))
+        self._embed_dim = dim
+    @staticmethod
+    def content_hash(text: str) -> str:
+        return hashlib.sha256(text.encode()).hexdigest()
+    def create_session(self, topic: str = "") -> SessionInfo:
+        session_id = uuid.uuid4().hex[:12]
+        created_at = datetime.now(UTC).isoformat()
+        with self._conn() as conn:
+            conn.execute(
+                "INSERT INTO sessions (id, topic, created_at) VALUES (?, ?, ?)",
+                (session_id, topic, created_at),
+            )
+        return SessionInfo(id=session_id, topic=topic, created_at=created_at)
+    def list_sessions(self) -> list[SessionInfo]:
+        with self._conn() as conn:
+            rows = conn.execute(
+                "SELECT id, topic, created_at FROM sessions ORDER BY created_at DESC"
+            ).fetchall()
+        return [SessionInfo(id=r["id"], topic=r["topic"], created_at=r["created_at"]) for r in rows]
+    def get_session(self, session_id: str) -> SessionInfo | None:
+        with self._conn() as conn:
+            row = conn.execute(
+                "SELECT id, topic, created_at FROM sessions WHERE id = ?",
+                (session_id,),
+            ).fetchone()
+        if not row:
+            return None
+        return SessionInfo(id=row["id"], topic=row["topic"], created_at=row["created_at"])
+    def document_exists(self, content_hash: str) -> str | None:
+        with self._conn() as conn:
+            row = conn.execute(
+                "SELECT id FROM documents WHERE content_hash = ?",
+                (content_hash,),
+            ).fetchone()
+        return row["id"] if row else None
+    def add_document(
+        self,
+        *,
+        source_type: str,
+        uri: str,
+        title: str,
+        text: str,
+        chunks: list[tuple[str, int, str, np.ndarray, dict[str, Any]]],
+        session_id: str | None = None,
+        raw_snapshot: str | None = None,
+    ) -> tuple[str, bool]:
+        """Returns (doc_id, was_new). Skips if content_hash already indexed."""
+        c_hash = self.content_hash(text)
+        existing = self.document_exists(c_hash)
+        if existing:
+            return existing, False
+        doc_id = uuid.uuid4().hex[:12]
+        ingested_at = datetime.now(UTC).isoformat()
+        if chunks:
+            dim = int(chunks[0][3].shape[0])
+            self.set_embed_dim(dim)
+        with self._conn() as conn:
+            conn.execute(
+                """
+                INSERT INTO documents (id, source_type, uri, title, ingested_at, content_hash, session_id)
+                VALUES (?, ?, ?, ?, ?, ?, ?)
+                """,
+                (doc_id, source_type, uri, title, ingested_at, c_hash, session_id),
+            )
+            for chunk_id, ordinal, chunk_text, emb, meta in chunks:
+                conn.execute(
+                    """
+                    INSERT INTO chunks (id, doc_id, ordinal, text, embedding_blob, meta_json)
+                    VALUES (?, ?, ?, ?, ?, ?)
+                    """,
+                    (
+                        chunk_id,
+                        doc_id,
+                        ordinal,
+                        chunk_text,
+                        embedding_to_bytes(emb),
+                        json.dumps(meta),
+                    ),
+                )
+                conn.execute(
+                    "INSERT OR IGNORE INTO edges (src_id, dst_id, rel) VALUES (?, ?, ?)",
+                    (doc_id, chunk_id, "doc_has_chunk"),
+                )
+            for i in range(len(chunks) - 1):
+                conn.execute(
+                    "INSERT OR IGNORE INTO edges (src_id, dst_id, rel) VALUES (?, ?, ?)",
+                    (chunks[i][0], chunks[i + 1][0], "chunk_next"),
+                )
+        if raw_snapshot is not None:
+            raw_dir = self._config.data_dir / "raw" / doc_id
+            raw_dir.mkdir(parents=True, exist_ok=True)
+            (raw_dir / "snapshot.txt").write_text(raw_snapshot, encoding="utf-8")
+        return doc_id, True
+    def list_documents(self, session_id: str | None = None) -> list[StoredDocument]:
+        query = "SELECT id, source_type, uri, title, ingested_at, content_hash FROM documents"
+        params: tuple[Any, ...] = ()
+        if session_id:
+            query += " WHERE session_id = ?"
+            params = (session_id,)
+        query += " ORDER BY ingested_at DESC"
+        with self._conn() as conn:
+            rows = conn.execute(query, params).fetchall()
+        return [
+            StoredDocument(
+                id=r["id"],
+                source_type=r["source_type"],
+                uri=r["uri"],
+                title=r["title"],
+                ingested_at=r["ingested_at"],
+                content_hash=r["content_hash"],
+            )
+            for r in rows
+        ]
+    def get_chunks_with_embeddings(
+        self,
+        *,
+        session_id: str | None = None,
+        doc_ids: list[str] | None = None,
+    ) -> list[tuple[StoredChunk, np.ndarray]]:
+        dim = self.embed_dim
+        query = """
+                SELECT c.id, c.doc_id, c.ordinal, c.text, c.embedding_blob, c.meta_json,
+                       d.title AS doc_title, d.uri AS doc_uri
+                FROM chunks c
+                JOIN documents d ON d.id = c.doc_id
+                WHERE 1=1
+                """
+        params: list[Any] = []
+        if session_id:
+            query += " AND d.session_id = ?"
+            params.append(session_id)
+        if doc_ids:
+            placeholders = ",".join("?" * len(doc_ids))
+            query += f" AND d.id IN ({placeholders})"
+            params.extend(doc_ids)
+        with self._conn() as conn:
+            rows = conn.execute(query, params).fetchall()
+        result: list[tuple[StoredChunk, np.ndarray]] = []
+        for r in rows:
+            chunk = StoredChunk(
+                id=r["id"],
+                doc_id=r["doc_id"],
+                ordinal=r["ordinal"],
+                text=r["text"],
+                doc_title=r["doc_title"],
+                doc_uri=r["doc_uri"],
+                metadata=json.loads(r["meta_json"] or "{}"),
+            )
+            emb = bytes_to_embedding(r["embedding_blob"], dim)
+            result.append((chunk, emb))
+        return result
+    def get_neighbor_chunk_ids(self, chunk_id: str) -> list[str]:
+        ids: list[str] = []
+        with self._conn() as conn:
+            for row in conn.execute(
+                "SELECT dst_id FROM edges WHERE src_id = ? AND rel = 'chunk_next'",
+                (chunk_id,),
+            ):
+                ids.append(row["dst_id"])
+            for row in conn.execute(
+                "SELECT src_id FROM edges WHERE dst_id = ? AND rel = 'chunk_next'",
+                (chunk_id,),
+            ):
+                ids.append(row["src_id"])
+        return ids
+    def get_chunks_by_ids(self, chunk_ids: list[str]) -> list[StoredChunk]:
+        if not chunk_ids:
+            return []
+        placeholders = ",".join("?" for _ in chunk_ids)
+        with self._conn() as conn:
+            rows = conn.execute(
+                f"""
+                SELECT c.id, c.doc_id, c.ordinal, c.text, c.meta_json,
+                       d.title AS doc_title, d.uri AS doc_uri
+                FROM chunks c
+                JOIN documents d ON d.id = c.doc_id
+                WHERE c.id IN ({placeholders})
+                """,
+                chunk_ids,
+            ).fetchall()
+        by_id = {
+            r["id"]: StoredChunk(
+                id=r["id"],
+                doc_id=r["doc_id"],
+                ordinal=r["ordinal"],
+                text=r["text"],
+                doc_title=r["doc_title"],
+                doc_uri=r["doc_uri"],
+                metadata=json.loads(r["meta_json"] or "{}"),
+            )
+            for r in rows
+        }
+        return [by_id[cid] for cid in chunk_ids if cid in by_id]
+    def add_message(
+        self,
+        session_id: str,
+        role: str,
+        content: str,
+        chunk_ids: list[str] | None = None,
+    ) -> None:
+        with self._conn() as conn:
+            conn.execute(
+                """
+                INSERT INTO session_messages (session_id, role, content, chunk_ids_json, created_at)
+                VALUES (?, ?, ?, ?, ?)
+                """,
+                (
+                    session_id,
+                    role,
+                    content,
+                    json.dumps(chunk_ids or []),
+                    datetime.now(UTC).isoformat(),
+                ),
+            )
+    def get_messages(self, session_id: str) -> list[dict[str, Any]]:
+        with self._conn() as conn:
+            rows = conn.execute(
+                """
+                SELECT role, content, chunk_ids_json, created_at
+                FROM session_messages
+                WHERE session_id = ?
+                ORDER BY id ASC
+                """,
+                (session_id,),
+            ).fetchall()
+        return [
+            {
+                "role": r["role"],
+                "content": r["content"],
+                "chunk_ids": json.loads(r["chunk_ids_json"] or "[]"),
+                "created_at": r["created_at"],
+            }
+            for r in rows
+        ]
+    def count_chunks(self) -> int:
+        with self._conn() as conn:
+            row = conn.execute("SELECT COUNT(*) AS n FROM chunks").fetchone()
+        return int(row["n"])

libs/researchmind/src/researchmind/url_suggest.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from __future__ import annotations
+import json
+import re
+from typing import TYPE_CHECKING, Protocol
+if TYPE_CHECKING:
+    pass
+class ChatBackend(Protocol):
+    def chat(
+        self,
+        messages: list[dict[str, str]],
+        *,
+        max_tokens: int = 512,
+        temperature: float = 0.7,
+    ) -> str: ...
+SUGGEST_SYSTEM = """You suggest reputable web URLs for research on a topic.
+Return ONLY a JSON array of 3-5 full https URLs as strings.
+No markdown, no explanation. Example: ["https://example.com/a", "https://example.com/b"]
+"""
+def suggest_urls(topic: str, backend: ChatBackend, *, max_urls: int = 5) -> list[str]:
+    messages = [
+        {"role": "system", "content": SUGGEST_SYSTEM},
+        {"role": "user", "content": f"Topic: {topic.strip()}"},
+    ]
+    raw = backend.chat(messages, max_tokens=512, temperature=0.2)
+    return _parse_url_list(raw, max_urls=max_urls)
+def _parse_url_list(raw: str, *, max_urls: int) -> list[str]:
+    cleaned = raw.strip()
+    fence = re.search(r"```(?:json)?\s*(\[.*?\])\s*```", cleaned, re.DOTALL)
+    if fence:
+        cleaned = fence.group(1)
+    else:
+        start = cleaned.find("[")
+        end = cleaned.rfind("]")
+        if start >= 0 and end > start:
+            cleaned = cleaned[start : end + 1]
+    try:
+        data = json.loads(cleaned)
+    except json.JSONDecodeError:
+        urls = re.findall(r"https?://[^\s\"'<>]+", raw)
+        return _dedupe_urls(urls, max_urls)
+    if not isinstance(data, list):
+        return []
+    urls = [str(u).strip() for u in data if str(u).strip().startswith("http")]
+    return _dedupe_urls(urls, max_urls)
+def _dedupe_urls(urls: list[str], max_urls: int) -> list[str]:
+    seen: set[str] = set()
+    out: list[str] = []
+    for u in urls:
+        if u not in seen:
+            seen.add(u)
+            out.append(u)
+        if len(out) >= max_urls:
+            break
+    return out

libs/researchmind/src/researchmind/url_validate.py ADDED Viewed

	@@ -0,0 +1,118 @@

+from __future__ import annotations
+import re
+from urllib.parse import urlparse
+import httpx
+# arXiv IDs look like 2301.00001 or 2301.00001v2
+_ARXIV_ABS = re.compile(
+    r"^https?://(?:www\.)?arxiv\.org/abs/(\d{4}\.\d{4,5})(?:v\d+)?/?$",
+    re.IGNORECASE,
+)
+def normalize_url(url: str) -> str:
+    cleaned = url.strip().strip("\"'<>")
+    if not cleaned:
+        return ""
+    if cleaned.startswith("//"):
+        cleaned = "https:" + cleaned
+    if not cleaned.startswith(("http://", "https://")):
+        cleaned = "https://" + cleaned
+    parsed = urlparse(cleaned)
+    if not parsed.netloc:
+        return ""
+    return parsed.geturl().split("#")[0].rstrip("/")
+def is_well_formed(url: str) -> tuple[bool, str]:
+    if not url:
+        return False, "empty url"
+    if "..." in url or "…" in url:
+        return False, "truncated placeholder"
+    if " " in url:
+        return False, "contains spaces"
+    parsed = urlparse(url)
+    if parsed.scheme not in ("http", "https"):
+        return False, f"unsupported scheme {parsed.scheme!r}"
+    host = parsed.netloc.lower()
+    if not host or "." not in host:
+        return False, "missing host"
+    if host in ("localhost", "127.0.0.1"):
+        return False, "local url"
+    path = parsed.path or ""
+    if "arxiv.org" in host and "/abs/" in path:
+        if not _ARXIV_ABS.match(url):
+            return False, "invalid arxiv abs url"
+    if "ieeexplore.ieee.org" in host and path.rstrip("/") in ("", "/document"):
+        return False, "incomplete ieee document url"
+    if _is_tracking_or_junk_url(host, path, parsed.query):
+        return False, "tracking or redirect link (not a content page)"
+    return True, "ok"
+def _is_tracking_or_junk_url(host: str, path: str, query: str) -> bool:
+    """Reject ad/click trackers and other non-content URLs from search results."""
+    if "bing.com" in host and "/aclick" in path:
+        return True
+    if "google." in host and ("/aclk" in path or "googleadservices" in host):
+        return True
+    if "doubleclick.net" in host or "googlesyndication.com" in host:
+        return True
+    if host.endswith("bing.com") and path.startswith("/ck/"):
+        return True
+    # Search result redirect wrappers, not stable content URLs
+    if "google." in host and path.rstrip("/") == "/url" and "q=" in query:
+        return True
+    return False
+def probe_url_reachable(url: str, *, timeout: float = 12.0) -> tuple[bool, str]:
+    headers = {"User-Agent": "ResearchMind/0.1 (url-validator)"}
+    try:
+        with httpx.Client(follow_redirects=True, timeout=timeout, headers=headers) as client:
+            response = client.head(url)
+            if response.status_code in (405, 501):
+                response = client.get(url)
+            if response.status_code >= 400:
+                return False, f"http {response.status_code}"
+        return True, "ok"
+    except httpx.HTTPError as exc:
+        return False, str(exc)
+def validate_url(url: str, *, check_reachable: bool = True) -> tuple[bool, str, str]:
+    """Return (ok, reason, normalized_url)."""
+    normalized = normalize_url(url)
+    ok, reason = is_well_formed(normalized)
+    if not ok:
+        return False, reason, normalized
+    if check_reachable:
+        ok, reason = probe_url_reachable(normalized)
+        if not ok:
+            return False, reason, normalized
+    return True, "ok", normalized
+def filter_valid_urls(
+    urls: list[str],
+    *,
+    check_reachable: bool = True,
+    max_results: int = 5,
+) -> list[str]:
+    seen: set[str] = set()
+    valid: list[str] = []
+    for raw in urls:
+        ok, _reason, normalized = validate_url(raw, check_reachable=check_reachable)
+        if ok and normalized not in seen:
+            seen.add(normalized)
+            valid.append(normalized)
+        if len(valid) >= max_results:
+            break
+    return valid

libs/researchmind/tests/test_chunking.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from __future__ import annotations
+from researchmind.chunking import chunk_text
+def test_chunk_text_splits_long_document():
+    words = ["word"] * 600
+    text = " ".join(words)
+    chunks = chunk_text(text, doc_id="doc1", chunk_size=100, chunk_overlap=20)
+    assert len(chunks) > 1
+    assert chunks[0].ordinal == 0
+def test_chunk_text_empty():
+    assert chunk_text("", doc_id="x") == []

libs/researchmind/tests/test_citations.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from __future__ import annotations
+from researchmind.citations import (
+    clean_model_answer,
+    format_context_block,
+    format_references,
+)
+from researchmind.store import StoredChunk
+def _chunk(chunk_id: str, doc_uri: str, text: str) -> StoredChunk:
+    return StoredChunk(
+        id=chunk_id,
+        doc_id="doc1",
+        ordinal=0,
+        text=text,
+        doc_title="AI Agents Review",
+        doc_uri=doc_uri,
+        metadata={},
+    )
+def test_format_context_groups_chunks_by_document():
+    chunks = [
+        _chunk("c1", "https://example.com/paper", "First passage about agents."),
+        _chunk("c2", "https://example.com/paper", "Second passage about planning."),
+    ]
+    context, citations = format_context_block(chunks)
+    assert context.count("[1]") == 1
+    assert "[2]" not in context
+    assert len(citations) == 1
+    assert "First passage" in context
+    assert "Second passage" in context
+def test_format_references_one_line_per_source():
+    _, citations = format_context_block(
+        [
+            _chunk("c1", "https://a.test", "alpha"),
+            _chunk("c2", "https://a.test", "beta"),
+        ]
+    )
+    refs = format_references(citations)
+    assert refs.count("https://a.test") == 1
+def test_clean_passage_collapses_citation_runs():
+    chunks = [_chunk("c1", "https://a.test", "[1] [2] [3] [4] [5] actual content")]
+    context, _ = format_context_block(chunks)
+    assert "[1] [2] [3] [4] [5]" not in context
+    assert "actual content" in context
+def test_clean_model_answer_strips_reference_spam():
+    raw = "Summary here [1][2][3][4][5].\n\n**References**\n- [1] dup"
+    cleaned = clean_model_answer(raw)
+    assert "**References**" not in cleaned
+    assert "[1][2][3]" not in cleaned
+    assert "Summary here" in cleaned
+def test_clean_model_answer_strips_thinking_block():
+    think_open = "<" + "think" + ">"
+    think_close = "</" + "think" + ">"
+    raw = f"{think_open}\nplan\n{think_close}\n\nAgents use tools and memory [1]."
+    cleaned = clean_model_answer(raw)
+    assert cleaned == "Agents use tools and memory [1]."

libs/researchmind/tests/test_retrieve.py ADDED Viewed

	@@ -0,0 +1,95 @@

+from __future__ import annotations
+import numpy as np
+from researchmind.config import ResearchMindConfig
+from researchmind.retrieve import retrieve
+from researchmind.store import MemRAGStore
+def _fake_embed(monkeypatch):
+    def fake_embed_texts(texts, *, model_name):
+        out = []
+        for t in texts:
+            if "photosynthesis" in t.lower():
+                out.append(np.array([1.0, 0.0], dtype=np.float32))
+            else:
+                out.append(np.array([0.0, 1.0], dtype=np.float32))
+        return np.stack(out)
+    monkeypatch.setattr("researchmind.retrieve.embed_texts", fake_embed_texts)
+def test_retrieve_ranks_by_similarity(tmp_path, monkeypatch):
+    _fake_embed(monkeypatch)
+    cfg = ResearchMindConfig(
+        data_dir=tmp_path,
+        embed_model="test",
+        auto_search=False,
+        top_k=1,
+        max_context_chunks=8,
+        chunk_size=512,
+        chunk_overlap=128,
+    )
+    store = MemRAGStore(cfg)
+    store.set_embed_dim(2)
+    store.add_document(
+        source_type="test",
+        uri="a",
+        title="A",
+        text="photosynthesis in plants",
+        chunks=[("c1", 0, "photosynthesis in plants", np.array([1.0, 0.0], dtype=np.float32), {})],
+    )
+    store.add_document(
+        source_type="test",
+        uri="b",
+        title="B",
+        text="fractions math",
+        chunks=[("c2", 0, "fractions math", np.array([0.0, 1.0], dtype=np.float32), {})],
+    )
+    hits = retrieve("photosynthesis", store, config=cfg, top_k=1, expand_neighbors=False)
+    assert len(hits) == 1
+    assert "photosynthesis" in hits[0].text
+def test_retrieve_filters_by_session(tmp_path, monkeypatch):
+    _fake_embed(monkeypatch)
+    cfg = ResearchMindConfig(
+        data_dir=tmp_path,
+        embed_model="test",
+        auto_search=False,
+        top_k=2,
+        max_context_chunks=8,
+        chunk_size=512,
+        chunk_overlap=128,
+    )
+    store = MemRAGStore(cfg)
+    store.set_embed_dim(2)
+    sid_a = store.create_session(topic="a").id
+    sid_b = store.create_session(topic="b").id
+    store.add_document(
+        source_type="test",
+        uri="a",
+        title="Plants",
+        text="photosynthesis in plants",
+        chunks=[("c1", 0, "photosynthesis in plants", np.array([1.0, 0.0], dtype=np.float32), {})],
+        session_id=sid_a,
+    )
+    store.add_document(
+        source_type="test",
+        uri="b",
+        title="Math",
+        text="fractions math",
+        chunks=[("c2", 0, "fractions math", np.array([0.0, 1.0], dtype=np.float32), {})],
+        session_id=sid_b,
+    )
+    scoped = retrieve(
+        "photosynthesis",
+        store,
+        config=cfg,
+        top_k=2,
+        expand_neighbors=False,
+        session_id=sid_a,
+    )
+    assert len(scoped) == 1
+    assert "photosynthesis" in scoped[0].text

libs/researchmind/tests/test_search_queries.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from __future__ import annotations
+from researchmind.search_urls import build_search_queries, search_urls
+def test_build_search_queries_includes_wikipedia_and_arxiv():
+    queries = build_search_queries("AI agent")
+    joined = " ".join(queries).lower()
+    assert "wikipedia" in joined
+    assert "arxiv" in joined
+    assert "ai agent" in joined
+def test_search_urls_uses_validated_results(monkeypatch):
+    monkeypatch.setattr(
+        "researchmind.search_urls._collect_candidates",
+        lambda topic, per_query=4: [
+            "https://en.wikipedia.org/wiki/Intelligent_agent",
+            "https://arxiv.org/abs/quantcomm/2021/10.0",
+        ],
+    )
+    def fake_filter(urls, *, check_reachable=True, max_results=5):
+        return [u for u in urls if "wikipedia" in u][:max_results]
+    monkeypatch.setattr("researchmind.search_urls.filter_valid_urls", fake_filter)
+    out = search_urls("AI agent", n=3, check_reachable=False)
+    assert len(out) == 1
+    assert "wikipedia" in out[0]

libs/researchmind/tests/test_store.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from __future__ import annotations
+import numpy as np
+from researchmind.config import ResearchMindConfig
+from researchmind.store import MemRAGStore
+def test_store_dedup_and_chunks(tmp_path):
+    cfg = ResearchMindConfig(
+        data_dir=tmp_path,
+        embed_model="test",
+        auto_search=False,
+        top_k=3,
+        max_context_chunks=8,
+        chunk_size=512,
+        chunk_overlap=128,
+    )
+    store = MemRAGStore(cfg)
+    emb = np.array([1.0, 0.0, 0.0], dtype=np.float32)
+    chunks = [("c1", 0, "hello world", emb, {})]
+    doc_id, is_new = store.add_document(
+        source_type="test",
+        uri="test://a",
+        title="A",
+        text="hello world",
+        chunks=chunks,
+    )
+    assert is_new
+    doc_id2, is_new2 = store.add_document(
+        source_type="test",
+        uri="test://a",
+        title="A",
+        text="hello world",
+        chunks=chunks,
+    )
+    assert not is_new2
+    assert doc_id == doc_id2
+    assert store.count_chunks() == 1
+def test_session_messages(tmp_path):
+    cfg = ResearchMindConfig(
+        data_dir=tmp_path,
+        embed_model="test",
+        auto_search=False,
+        top_k=3,
+        max_context_chunks=8,
+        chunk_size=512,
+        chunk_overlap=128,
+    )
+    store = MemRAGStore(cfg)
+    session = store.create_session(topic="test topic")
+    store.add_message(session.id, "user", "hi", [])
+    msgs = store.get_messages(session.id)
+    assert len(msgs) == 1
+    assert msgs[0]["role"] == "user"

libs/researchmind/tests/test_url_validate.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from __future__ import annotations
+from researchmind.url_validate import (
+    filter_valid_urls,
+    is_well_formed,
+    normalize_url,
+    validate_url,
+)
+def test_rejects_truncated_and_bad_arxiv():
+    ok, reason = is_well_formed("https://arxiv.org/abs/quantcomm/2021/10.0")
+    assert not ok
+    assert "arxiv" in reason
+    ok, reason = is_well_formed("https://ieeexplore.ieee.org/document/...")
+    assert not ok
+def test_accepts_valid_arxiv():
+    ok, _ = is_well_formed("https://arxiv.org/abs/2301.00001")
+    assert ok
+def test_normalize_adds_scheme():
+    assert normalize_url("en.wikipedia.org/wiki/AI_agent").startswith("https://")
+def test_validate_url_does_not_shadow_probe(monkeypatch):
+    """Regression: check_reachable=True must not call the bool parameter."""
+    def fake_probe(url, *, timeout=12.0):
+        return True, "ok"
+    monkeypatch.setattr("researchmind.url_validate.probe_url_reachable", fake_probe)
+    ok, reason, normalized = validate_url(
+        "https://en.wikipedia.org/wiki/Agent",
+        check_reachable=True,
+    )
+    assert ok
+    assert reason == "ok"
+    assert "wikipedia" in normalized
+def test_rejects_bing_tracking_links():
+    ok, reason = is_well_formed(
+        "https://www.bing.com/aclick?id=abc&u=aHR0cHM6Ly9leGFtcGxlLmNvbQ"
+    )
+    assert not ok
+    assert "tracking" in reason
+def test_filter_valid_urls_skips_bad(monkeypatch):
+    def fake_validate(url, *, check_reachable=True):
+        if "bad" in url:
+            return False, "bad", url
+        return True, "ok", url
+    monkeypatch.setattr("researchmind.url_validate.validate_url", fake_validate)
+    out = filter_valid_urls(
+        ["https://good.example/a", "https://bad.example/b"],
+        check_reachable=False,
+        max_results=5,
+    )
+    assert out == ["https://good.example/a"]

pyproject.toml CHANGED Viewed

@@ -9,6 +9,7 @@ dependencies = [
     "ensemble",
     "gradio-space",
     "inference",
 ]
 [dependency-groups]
@@ -46,4 +47,5 @@ agent = { workspace = true }
 ensemble = { workspace = true }
 gradio-space = { workspace = true }
 inference = { workspace = true }
 slm-evals = { workspace = true }

     "ensemble",
     "gradio-space",
     "inference",
+    "researchmind",
 ]
 [dependency-groups]
 ensemble = { workspace = true }
 gradio-space = { workspace = true }
 inference = { workspace = true }
+researchmind = { workspace = true }
 slm-evals = { workspace = true }

skills/extract-content/SKILL.md ADDED Viewed

	@@ -0,0 +1,16 @@

+---
+name: extract-content
+description: Chunk, embed, and index extracted text into MemRAG
+task: research
+tools:
+  - extract_and_index
+---
+## Workflow
+1. Receive an `ExtractedDocument` (from web, PDF, or DOCX scrape).
+2. Call `extract_and_index` with optional `session_id`.
+3. Chunks are embedded with sentence-transformers and stored in SQLite.
+4. Duplicate content (same hash) is skipped.
+See `references/chunking-policy.md` for chunk size and overlap defaults.

skills/extract-content/references/chunking-policy.md ADDED Viewed

	@@ -0,0 +1,9 @@

+# Chunking policy
+| Setting | Env var | Default |
+|---------|---------|---------|
+| Chunk size (words) | `RESEARCHMIND_CHUNK_SIZE` | 512 |
+| Overlap (words) | `RESEARCHMIND_CHUNK_OVERLAP` | 128 |
+| Embedding model | `RESEARCHMIND_EMBED_MODEL` | `all-MiniLM-L6-v2` |
+Chunks link via `chunk_next` edges for neighbor expansion at retrieval time.

skills/extract-content/scripts/chunk_and_index.py ADDED Viewed

	@@ -0,0 +1,35 @@

+#!/usr/bin/env python3
+"""CLI: ingest a text file or URL into MemRAG."""
+from __future__ import annotations
+import argparse
+import sys
+from pathlib import Path
+from researchmind.extract import ExtractedDocument
+from researchmind.ingest import IngestPipeline
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Chunk and index content")
+    parser.add_argument("--url", help="Scrape and index URL")
+    parser.add_argument("--file", type=Path, help="Index local file")
+    parser.add_argument("--session", help="Session id to tag document")
+    args = parser.parse_args()
+    pipeline = IngestPipeline()
+    if args.url:
+        doc_id, is_new = pipeline.ingest_url(args.url, session_id=args.session)
+    elif args.file:
+        doc_id, is_new = pipeline.ingest_path(args.file, session_id=args.session)
+    else:
+        parser.error("Provide --url or --file")
+    status = "indexed" if is_new else "deduplicated"
+    print(f"Document {doc_id} ({status}), chunks in store: {pipeline.store.count_chunks()}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

skills/research-mind/SKILL.md ADDED Viewed

	@@ -0,0 +1,30 @@

+---
+name: research-mind
+description: Local research agent — scrape, index, and answer with citations
+task: research
+tools:
+  - suggest_urls
+  - scrape_web
+  - scrape_pdf
+  - extract_and_index
+  - research_answer
+flags:
+  auto_search: false
+---
+## Workflow
+### Ingest
+1. **Topic only (default):** run `search_urls` (Google + verification) → user confirms URLs → scrape → `extract_and_index`.
+2. **Auto search:** when `auto_search` is true, same search pipeline ingests top verified URLs without confirmation.
+3. **Direct URL / file:** scrape and index immediately.
+### Q&A (offline after ingest)
+1. Call `research_answer` with the user question and `session_id`.
+2. Retrieve top-k chunks from MemRAG, expand neighbors.
+3. Answer using the local model with inline `[n]` citations.
+4. Append references from `references/citation-format.md`.
+See `references/ingest-modes.md` for mode details.

skills/research-mind/references/citation-format.md ADDED Viewed

	@@ -0,0 +1,6 @@

+# Citation format
+- Context uses **one number per source document**: `[1]`, `[2]`, …
+- Cite inline sparingly (typically 1–3 markers per answer), not after every phrase.
+- Bracket numbers inside scraped paper text are not citation indices — ignore them.
+- Do not output long runs of `[1][2][3]…` or duplicate **References** lists.

skills/research-mind/references/ingest-modes.md ADDED Viewed

	@@ -0,0 +1,9 @@

+# Ingest modes
+| Mode | `auto_search` | Behavior |
+|------|---------------|----------|
+| Suggest URLs (confirm) | `false` | Google search + URL verification; user checks boxes before ingest |
+| Auto search & ingest | `true` | Same search pipeline; ingests verified URLs without confirmation |
+| Direct URL / file | n/a | Skip discovery; ingest provided sources |
+Global default: `RESEARCHMIND_AUTO_SEARCH=false`. Gradio dropdown and skill `flags.auto_search` override per run.

skills/research-mind/scripts/ask.py ADDED Viewed

	@@ -0,0 +1,33 @@

+#!/usr/bin/env python3
+"""CLI stub: Q&A requires a loaded inference backend (use Gradio/agent)."""
+from __future__ import annotations
+import argparse
+import sys
+from researchmind.config import get_config
+from researchmind.ingest import IngestPipeline
+from researchmind.retrieve import retrieve
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Preview retrieval for a question")
+    parser.add_argument("question", help="Question to retrieve context for")
+    parser.add_argument("--top-k", type=int, default=None)
+    args = parser.parse_args()
+    cfg = get_config()
+    store = IngestPipeline().store
+    chunks = retrieve(args.question, store, config=cfg, top_k=args.top_k)
+    if not chunks:
+        print("No chunks in store. Ingest sources first.")
+        return 1
+    for i, c in enumerate(chunks, 1):
+        print(f"\n--- [{i}] {c.doc_title} ---\n{c.text[:500]}...")
+    print("\nUse AgentRunner.run_researchmind_chat() for a full cited answer.")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())