Spaces:

MHamdan
/

SPARKNET

Sleeping

App Files Files Community

SPARKNET / docs /guides /TESTING_GUIDE.md

MHamdan

Initial commit: SPARKNET framework

a9dc537 21 days ago

preview code

raw

history blame contribute delete

7.28 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

SPARKNET Document Analysis - Testing Guide

✅ Backend Status: Running and Ready

Your enhanced fallback extraction code is now active!

🧪 Test #1: Sample Patent (Best Case)

File to Upload:

/home/mhamdan/SPARKNET/uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt

Expected Results with Fallback Extraction:

Field	Expected Value
Title	"AI-Powered Drug Discovery Platform Using Machine Learning"
Abstract	Full abstract (300+ chars) about AI drug discovery
Patent ID	US20210123456
TRL Level	6
Claims	7 numbered claims
Inventors	Dr. Sarah Chen, Dr. Michael Rodriguez, Dr. Yuki Tanaka
Technical Domains	AI/ML, pharmaceutical chemistry, computational biology

How to Test:

Open SPARKNET frontend (http://localhost:3000)
Click "Upload Patent"
Select: uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
Wait for analysis to complete (~2-3 minutes)
Check results match expected values above

🧪 Test #2: Existing Non-Patent Files (Fallback Extraction)

Files Already Uploaded:

uploads/patents/*.pdf

These are NOT actual patents (Microsoft docs, etc.), but with your enhanced fallback extraction, they should now show:

Expected Behavior:

Before your enhancement:

Title: "Patent Analysis" (generic)
Abstract: "Abstract not available" (generic)

After your enhancement:

Title: First substantial line from document (e.g., "Windows Principles: Twelve Tenets to Promote Competition")
Abstract: First ~300 characters of document text
Document validator warning in backend logs: "❌ NOT a valid patent"

How to Test:

Upload any existing PDF from uploads/patents/
Check if title shows actual document title (not "Patent Analysis")
Check if abstract shows document summary (not "Abstract not available")
Check backend logs for validation warnings

📊 Verification Checklist

After uploading the sample patent:

Title shows: "AI-Powered Drug Discovery Platform..."
Abstract shows actual content (not "Abstract not available")
TRL level is 6 with justification
Claims section populated with 7 claims
Innovations section shows 3+ innovations
No "Patent Analysis" generic title
Analysis quality > 85%

🔍 How the Enhanced Code Works

Your fallback extraction (_extract_fallback_title_abstract) activates when:

# Condition 1: LLM extraction returns nothing
if not title or title == 'Patent Analysis':
    # Use fallback: Extract first substantial line as title

# Condition 2: LLM extraction fails for abstract
if not abstract or abstract == 'Abstract not available':
    # Use fallback: Extract first ~300 chars as abstract

Fallback Logic:

Title: First substantial line (10-200 chars) from document
Abstract: First few paragraphs after title, truncated to ~300 chars

This ensures something meaningful is displayed even for non-patent documents!

🐛 Debugging Tips

Check Backend Logs for Validation

# View live backend logs
screen -r Sparknet-backend

# Or hardcopy to file
screen -S Sparknet-backend -X hardcopy /tmp/backend.log
tail -100 /tmp/backend.log

# Look for:
# ✅ "appears to be a valid patent" (good)
# ❌ "is NOT a valid patent" (non-patent uploaded)
# ℹ️  "Using fallback title/abstract extraction" (fallback triggered)

Expected Log Sequence for Sample Patent:

📄 Analyzing patent: uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
Extracting patent structure...
Assessing technology and commercialization potential...
✅ Patent analysis complete: TRL 6, 3 innovations identified
✅ appears to be a valid patent

Expected Log Sequence for Non-Patent (with fallback):

📄 Analyzing patent: uploads/patents/microsoft_doc.pdf
Extracting patent structure...
❌ is NOT a valid patent
   Detected type: Microsoft Windows documentation
   Issues: Only 1 patent keywords found, Missing required sections: abstract, claim
ℹ️  Using fallback title/abstract extraction
Fallback extraction: title='Windows Principles: Twelve Tenets...', abstract length=287
✅ Patent analysis complete: TRL 5, 2 innovations identified

🎯 Quick Test Commands

Check if backend has new code loaded:

# Check if document_validator module is importable
curl -s http://localhost:8000/api/health
# Should return: "status": "healthy"

Manually test document validator:

python << 'EOF'
from src.utils.document_validator import validate_and_log

# Test with sample patent
with open('uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt', 'r') as f:
    text = f.read()
    is_valid = validate_and_log(text, "sample_patent.txt")
    print(f"Valid patent: {is_valid}")
EOF

Check uploaded files:

# List all uploaded patents
ls -lh uploads/patents/

# Check if sample patent exists
ls -lh uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt

🚀 Next Steps

Immediate Testing:

Upload SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt through UI
Verify results show actual patent information
Check backend logs for validation messages

Download Real Patents for Testing:

Option 1: Google Patents

Visit: https://patents.google.com/
Search: "artificial intelligence" or "machine learning"
Download any patent PDF
Upload to SPARKNET

Option 2: USPTO Direct

# Example: Download US patent 10,123,456
curl -o real_patent.pdf "https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10123456"

Option 3: EPO (European Patents)

# Example: European patent
curl -o ep_patent.pdf "https://data.epo.org/publication-server/rest/v1.0/publication-dates/20210601/patents/EP1234567/document.pdf"

Clear Non-Patent Uploads (Optional):

# Backup existing uploads
mkdir -p uploads/patents_backup
cp uploads/patents/*.pdf uploads/patents_backup/

# Remove non-patents (keep only sample)
find uploads/patents/ -name "*.pdf" -type f -delete

# Keep the sample patent
ls uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
# Should exist

📈 Performance Expectations

Analysis Time:

Sample Patent: ~2-3 minutes (first run)
With fallback: +5-10 seconds (fallback extraction is fast)
Subsequent analyses: ~1-2 minutes (memory cached)

Success Criteria:

Valid Patents: >90% accuracy on title/abstract extraction
Non-Patents: Fallback shows meaningful title/abstract (not generic placeholders)
Overall: System doesn't crash, always returns results

✅ Success! What You've Fixed

Before:

❌ Generic "Patent Analysis" title
❌ "Abstract not available"
❌ No indication document wasn't a patent

After (with your enhancements):

✅ Actual document title extracted (even for non-patents)
✅ Document summary shown as abstract
✅ Validation warnings in logs
✅ Better user experience

Date: November 10, 2025 Status: ✅ Ready for Testing Backend: Running on port 8000 Frontend: Running on port 3000 (assumed)

Your Next Action: Upload SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt through the UI! 🚀