SPARKNET / docs /guides /TESTING_GUIDE.md
MHamdan's picture
Initial commit: SPARKNET framework
a9dc537

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

SPARKNET Document Analysis - Testing Guide

βœ… Backend Status: Running and Ready

Your enhanced fallback extraction code is now active!


πŸ§ͺ Test #1: Sample Patent (Best Case)

File to Upload:

/home/mhamdan/SPARKNET/uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt

Expected Results with Fallback Extraction:

Field Expected Value
Title "AI-Powered Drug Discovery Platform Using Machine Learning"
Abstract Full abstract (300+ chars) about AI drug discovery
Patent ID US20210123456
TRL Level 6
Claims 7 numbered claims
Inventors Dr. Sarah Chen, Dr. Michael Rodriguez, Dr. Yuki Tanaka
Technical Domains AI/ML, pharmaceutical chemistry, computational biology

How to Test:

  1. Open SPARKNET frontend (http://localhost:3000)
  2. Click "Upload Patent"
  3. Select: uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
  4. Wait for analysis to complete (~2-3 minutes)
  5. Check results match expected values above

πŸ§ͺ Test #2: Existing Non-Patent Files (Fallback Extraction)

Files Already Uploaded:

uploads/patents/*.pdf

These are NOT actual patents (Microsoft docs, etc.), but with your enhanced fallback extraction, they should now show:

Expected Behavior:

Before your enhancement:

  • Title: "Patent Analysis" (generic)
  • Abstract: "Abstract not available" (generic)

After your enhancement:

  • Title: First substantial line from document (e.g., "Windows Principles: Twelve Tenets to Promote Competition")
  • Abstract: First ~300 characters of document text
  • Document validator warning in backend logs: "❌ NOT a valid patent"

How to Test:

  1. Upload any existing PDF from uploads/patents/
  2. Check if title shows actual document title (not "Patent Analysis")
  3. Check if abstract shows document summary (not "Abstract not available")
  4. Check backend logs for validation warnings

πŸ“Š Verification Checklist

After uploading the sample patent:

  • Title shows: "AI-Powered Drug Discovery Platform..."
  • Abstract shows actual content (not "Abstract not available")
  • TRL level is 6 with justification
  • Claims section populated with 7 claims
  • Innovations section shows 3+ innovations
  • No "Patent Analysis" generic title
  • Analysis quality > 85%

πŸ” How the Enhanced Code Works

Your fallback extraction (_extract_fallback_title_abstract) activates when:

# Condition 1: LLM extraction returns nothing
if not title or title == 'Patent Analysis':
    # Use fallback: Extract first substantial line as title

# Condition 2: LLM extraction fails for abstract
if not abstract or abstract == 'Abstract not available':
    # Use fallback: Extract first ~300 chars as abstract

Fallback Logic:

  1. Title: First substantial line (10-200 chars) from document
  2. Abstract: First few paragraphs after title, truncated to ~300 chars

This ensures something meaningful is displayed even for non-patent documents!


πŸ› Debugging Tips

Check Backend Logs for Validation

# View live backend logs
screen -r Sparknet-backend

# Or hardcopy to file
screen -S Sparknet-backend -X hardcopy /tmp/backend.log
tail -100 /tmp/backend.log

# Look for:
# βœ… "appears to be a valid patent" (good)
# ❌ "is NOT a valid patent" (non-patent uploaded)
# ℹ️  "Using fallback title/abstract extraction" (fallback triggered)

Expected Log Sequence for Sample Patent:

πŸ“„ Analyzing patent: uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
Extracting patent structure...
Assessing technology and commercialization potential...
βœ… Patent analysis complete: TRL 6, 3 innovations identified
βœ… appears to be a valid patent

Expected Log Sequence for Non-Patent (with fallback):

πŸ“„ Analyzing patent: uploads/patents/microsoft_doc.pdf
Extracting patent structure...
❌ is NOT a valid patent
   Detected type: Microsoft Windows documentation
   Issues: Only 1 patent keywords found, Missing required sections: abstract, claim
ℹ️  Using fallback title/abstract extraction
Fallback extraction: title='Windows Principles: Twelve Tenets...', abstract length=287
βœ… Patent analysis complete: TRL 5, 2 innovations identified

🎯 Quick Test Commands

Check if backend has new code loaded:

# Check if document_validator module is importable
curl -s http://localhost:8000/api/health
# Should return: "status": "healthy"

Manually test document validator:

python << 'EOF'
from src.utils.document_validator import validate_and_log

# Test with sample patent
with open('uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt', 'r') as f:
    text = f.read()
    is_valid = validate_and_log(text, "sample_patent.txt")
    print(f"Valid patent: {is_valid}")
EOF

Check uploaded files:

# List all uploaded patents
ls -lh uploads/patents/

# Check if sample patent exists
ls -lh uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt

πŸš€ Next Steps

Immediate Testing:

  1. Upload SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt through UI
  2. Verify results show actual patent information
  3. Check backend logs for validation messages

Download Real Patents for Testing:

Option 1: Google Patents

  1. Visit: https://patents.google.com/
  2. Search: "artificial intelligence" or "machine learning"
  3. Download any patent PDF
  4. Upload to SPARKNET

Option 2: USPTO Direct

# Example: Download US patent 10,123,456
curl -o real_patent.pdf "https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10123456"

Option 3: EPO (European Patents)

# Example: European patent
curl -o ep_patent.pdf "https://data.epo.org/publication-server/rest/v1.0/publication-dates/20210601/patents/EP1234567/document.pdf"

Clear Non-Patent Uploads (Optional):

# Backup existing uploads
mkdir -p uploads/patents_backup
cp uploads/patents/*.pdf uploads/patents_backup/

# Remove non-patents (keep only sample)
find uploads/patents/ -name "*.pdf" -type f -delete

# Keep the sample patent
ls uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
# Should exist

πŸ“ˆ Performance Expectations

Analysis Time:

  • Sample Patent: ~2-3 minutes (first run)
  • With fallback: +5-10 seconds (fallback extraction is fast)
  • Subsequent analyses: ~1-2 minutes (memory cached)

Success Criteria:

  • Valid Patents: >90% accuracy on title/abstract extraction
  • Non-Patents: Fallback shows meaningful title/abstract (not generic placeholders)
  • Overall: System doesn't crash, always returns results

βœ… Success! What You've Fixed

Before:

  • ❌ Generic "Patent Analysis" title
  • ❌ "Abstract not available"
  • ❌ No indication document wasn't a patent

After (with your enhancements):

  • βœ… Actual document title extracted (even for non-patents)
  • βœ… Document summary shown as abstract
  • βœ… Validation warnings in logs
  • βœ… Better user experience

Date: November 10, 2025 Status: βœ… Ready for Testing Backend: Running on port 8000 Frontend: Running on port 3000 (assumed)

Your Next Action: Upload SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt through the UI! πŸš€