Spaces:

KBaba7
/

DocsBot

Sleeping

BabaK07 commited on Apr 2

Commit

62d2116

1 Parent(s): 90b7bb0

feat(document-matching): implement two-stage document matching with LLM reranking

- Add hybrid keyword scoring + LLM semantic reranking for document matching
- Implement Stage 1 fast keyword scoring with weighted phrase and word-level matching
- Implement Stage 2 LLM-based semantic reranking for top candidates (up to 8 docs)
- Update README with detailed explanation of two-stage matching system
- Refactor resolve_relevant_document_hashes to use new scoring algorithm
- Add _llm_verify_document_hashes method for LLM-based document ranking
- Remove deployment configs (fly.toml, render.yaml) no longer in use
- Update test documents with new sample file
- Improve retrieval accuracy by balancing speed (keyword filtering) with semantic understanding (LLM reranking)

Files changed (4) hide show

README.md +17 -8
app/services/document_service.py +100 -36
fly.toml +0 -22
render.yaml +0 -32

README.md CHANGED Viewed

@@ -27,9 +27,10 @@ Uploaded PDFs are parsed page by page and split into chunks.
 Each chunk is stored with metadata (document, page number, chunk index) and embedded into `pgvector`.
 At question time:
-1. The app searches relevant chunks from the user’s accessible documents.
-2. The agent answers from those chunks when possible.
-3. If evidence is weak, the agent uses web search and cites external URLs.
 ## Chunking Strategy
@@ -62,7 +63,7 @@ Each turn stores/returns source metadata separately from the answer body.
 - Vector source cards include:
   - document name
   - page number
-  - excerpt (short snippet from retrieved chunk)
 - Web source cards include:
   - title
   - URL
@@ -83,6 +84,13 @@ Why I chose this:
 - avoids duplicate indexing,
 - keeps retrieval secure per user.
 ## Challenges I Ran Into
 1. Heavy embedding dependencies made deployment images too large.
@@ -94,10 +102,11 @@ Why I chose this:
 ## If I Had More Time
-- Add reranking (cross-encoder) for better precision on long multi-doc queries.
-- Add automated citation-faithfulness checks.
-- Add Alembic migrations for cleaner schema evolution.
-- Add stronger eval/observability for routing and retrieval quality.
 ## Local Setup

 Each chunk is stored with metadata (document, page number, chunk index) and embedded into `pgvector`.
 At question time:
+1. Document matching uses keyword scoring + LLM semantic reranking
+2. Relevant chunks are retrieved from matched documents via vector search
+3. The agent answers from those chunks when possible
+4. If evidence is weak, the agent uses web search and cites external URLs
 ## Chunking Strategy
 - Vector source cards include:
   - document name
   - page number
+  - snippet (short snippet from retrieved chunk)
 - Web source cards include:
   - title
   - URL
 - avoids duplicate indexing,
 - keeps retrieval secure per user.
+I also implemented a two-stage document matching system:
+- Stage 1: Fast keyword scoring checks exact phrase matches and word-level matches across filename, summary, and preview text with weighted scoring (filename matches score higher than preview matches).
+- Stage 2: LLM semantic reranking takes the top scored candidates (up to 8) and reranks them based on semantic similarity to the query.
+This hybrid approach balances speed and accuracy - keyword filtering is fast and catches obvious matches, while the LLM handles nuanced semantic understanding without processing every document.
 ## Challenges I Ran Into
 1. Heavy embedding dependencies made deployment images too large.
 ## If I Had More Time
+- Add conversation history UI to display past chat sessions
+- Add reranking (cross-encoder) for better precision on long multi-doc queries
+- Add automated citation-faithfulness checks
+- Add Alembic migrations for cleaner schema evolution
+- Add stronger eval/observability for routing and retrieval quality
 ## Local Setup

app/services/document_service.py CHANGED Viewed

@@ -1,4 +1,6 @@
 import hashlib
 from fastapi import UploadFile
 from langchain_groq import ChatGroq
@@ -18,6 +20,7 @@ class DocumentService:
         self.storage = StorageService()
         self.vector_store = VectorStoreService()
         self.summarizer = None
     async def save_upload(self, upload: UploadFile) -> tuple[bytes, str]:
         content = await upload.read()
@@ -129,45 +132,106 @@ class DocumentService:
         }
     def resolve_relevant_document_hashes(self, db: Session, *, user: User, query: str, limit: int = 5) -> list[str]:
-        stopwords = {
-            "the",
-            "and",
-            "for",
-            "with",
-            "from",
-            "that",
-            "this",
-            "what",
-            "who",
-            "how",
-            "are",
-            "was",
-            "were",
-            "is",
-            "of",
-            "about",
-            "tell",
-            "more",
-            "please",
-            "can",
-            "you",
-            "your",
-        }
-        terms = [term.strip() for term in query.lower().split() if len(term.strip()) > 2 and term.strip() not in stopwords]
         docs = self.list_user_documents(db, user)
-        scored: list[tuple[int, str]] = []
         for doc in docs:
-            haystack = f"{doc.filename} {doc.summary} {doc.extracted_preview}".lower()
-            filename_score = sum(3 for term in terms if term in (doc.filename or "").lower())
-            body_score = sum(1 for term in terms if term in haystack)
-            score = filename_score + body_score
             if score > 0:
-                scored.append((score, doc.file_hash))
-        scored.sort(reverse=True)
-        hashes = [file_hash for _, file_hash in scored[:limit]]
-        if hashes:
-            return hashes
-        return [doc.file_hash for doc in docs[:limit]]
     def ensure_page_metadata_for_user(self, *, db: Session, user: User) -> None:
         docs = self.list_user_documents(db, user)

 import hashlib
+import json
+import re
 from fastapi import UploadFile
 from langchain_groq import ChatGroq
         self.storage = StorageService()
         self.vector_store = VectorStoreService()
         self.summarizer = None
+        self.matcher_llm = None
     async def save_upload(self, upload: UploadFile) -> tuple[bytes, str]:
         content = await upload.read()
         }
     def resolve_relevant_document_hashes(self, db: Session, *, user: User, query: str, limit: int = 5) -> list[str]:
         docs = self.list_user_documents(db, user)
+        if not docs:
+            return []
+        query_lower = query.lower()
+        scored: list[tuple[float, str, Document]] = []
         for doc in docs:
+            score = 0.0
+            # Exact phrase matching (highest priority)
+            if query_lower in (doc.filename or "").lower():
+                score += 10.0
+            if query_lower in (doc.summary or "").lower():
+                score += 5.0
+            if query_lower in (doc.extracted_preview or "").lower():
+                score += 2.0
+            # Word-level matching
+            query_words = query_lower.split()
+            filename_lower = (doc.filename or "").lower()
+            summary_lower = (doc.summary or "").lower()
+            preview_lower = (doc.extracted_preview or "").lower()
+            for word in query_words:
+                if len(word) > 2:  # Skip very short words
+                    if word in filename_lower:
+                        score += 3.0
+                    if word in summary_lower:
+                        score += 1.5
+                    if word in preview_lower:
+                        score += 0.5
             if score > 0:
+                scored.append((score, doc.file_hash, doc))
+        # Sort by score
+        scored.sort(reverse=True, key=lambda x: x[0])
+        # Take top candidates for LLM (up to 8)
+        candidates_count = min(max(limit * 2, 8), len(scored)) if scored else min(limit, len(docs))
+        if scored:
+            ranked_docs = [doc for _, _, doc in scored[:candidates_count]]
+            ranked_hashes = [file_hash for _, file_hash, _ in scored[:candidates_count]]
+        else:
+            # No keyword matches, use all docs up to limit
+            ranked_docs = docs[:candidates_count]
+            ranked_hashes = [doc.file_hash for doc in ranked_docs]
+        # LLM reranking
+        llm_ranked_hashes = self._llm_verify_document_hashes(query=query, candidates=ranked_docs, limit=limit)
+        # Merge: LLM results first, then keyword fallback
+        merged = llm_ranked_hashes + [h for h in ranked_hashes if h not in llm_ranked_hashes]
+        return merged[:limit]
+    def _llm_verify_document_hashes(self, *, query: str, candidates: list[Document], limit: int) -> list[str]:
+        if not self.settings.groq_api_key or not candidates:
+            return []
+        if self.matcher_llm is None:
+            self.matcher_llm = ChatGroq(api_key=self.settings.groq_api_key, model=self.settings.model_name, temperature=0)
+        payload = []
+        for doc in candidates[:8]:
+            payload.append(
+                {
+                    "file_hash": doc.file_hash,
+                    "filename": doc.filename,
+                    "summary": (doc.summary or "")[:1000],
+                    "preview": (doc.extracted_preview or "")[:1200],
+                }
+            )
+        prompt = (
+            "Rank the most relevant documents for the user query based on semantic similarity.\n"
+            "Return ONLY valid JSON with this exact schema:\n"
+            '{"file_hashes": ["<hash1>", "<hash2>"]}\n'
+            f"Return at most {limit} hashes ordered by relevance.\n\n"
+            f"User query:\n{query}\n\n"
+            f"Candidates:\n{json.dumps(payload, ensure_ascii=True)}"
+        )
+        try:
+            response = self.matcher_llm.invoke(prompt)
+            content = response.content if isinstance(response.content, str) else str(response.content)
+            # Handle markdown code blocks
+            if "```json" in content:
+                content = content.split("```json")[1].split("```")[0].strip()
+            elif "```" in content:
+                content = content.split("```")[1].split("```")[0].strip()
+            data = json.loads(content)
+            hashes = data.get("file_hashes", [])
+            valid = {item.get("file_hash", "") for item in payload}
+            return [value for value in hashes if isinstance(value, str) and value in valid][:limit]
+        except Exception:
+            return []
     def ensure_page_metadata_for_user(self, *, db: Session, user: User) -> None:
         docs = self.list_user_documents(db, user)

fly.toml DELETED Viewed

@@ -1,22 +0,0 @@
-app = "docsbot-kbaba7"
-primary_region = "bom"
-[build]
-  dockerfile = "Dockerfile"
-[env]
-  STORAGE_BACKEND = "supabase"
-  WEB_SEARCH_PROVIDER = "tavily"
-[http_service]
-  internal_port = 8080
-  force_https = true
-  auto_stop_machines = "stop"
-  auto_start_machines = true
-  min_machines_running = 0
-  processes = ["app"]
-[[vm]]
-  cpu_kind = "shared"
-  cpus = 1
-  memory_mb = 1024

render.yaml DELETED Viewed

@@ -1,32 +0,0 @@
-services:
-  - type: web
-    name: docsbot
-    runtime: python
-    plan: free
-    autoDeploy: true
-    buildCommand: pip install -e .
-    startCommand: uvicorn app.main:app --host 0.0.0.0 --port $PORT
-    healthCheckPath: /
-    envVars:
-      - key: PYTHON_VERSION
-        value: 3.12.2
-      - key: SECRET_KEY
-        sync: false
-      - key: DATABASE_URL
-        sync: false
-      - key: GROQ_API_KEY
-        sync: false
-      - key: STORAGE_BACKEND
-        value: supabase
-      - key: SUPABASE_URL
-        sync: false
-      - key: SUPABASE_SERVICE_ROLE_KEY
-        sync: false
-      - key: SUPABASE_STORAGE_BUCKET
-        value: documents
-      - key: SUPABASE_STORAGE_PREFIX
-        value: docsqa
-      - key: WEB_SEARCH_PROVIDER
-        value: duckduckgo
-      - key: TAVILY_API_KEY
-        sync: false