SPARKNET / docs /guides /TESTING_GUIDE.md
MHamdan's picture
Initial commit: SPARKNET framework
a9dc537
# SPARKNET Document Analysis - Testing Guide
## βœ… Backend Status: Running and Ready
Your enhanced fallback extraction code is now active!
---
## πŸ§ͺ Test #1: Sample Patent (Best Case)
### File to Upload:
```
/home/mhamdan/SPARKNET/uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
```
### Expected Results with Fallback Extraction:
| Field | Expected Value |
|-------|----------------|
| **Title** | "AI-Powered Drug Discovery Platform Using Machine Learning" |
| **Abstract** | Full abstract (300+ chars) about AI drug discovery |
| **Patent ID** | US20210123456 |
| **TRL Level** | 6 |
| **Claims** | 7 numbered claims |
| **Inventors** | Dr. Sarah Chen, Dr. Michael Rodriguez, Dr. Yuki Tanaka |
| **Technical Domains** | AI/ML, pharmaceutical chemistry, computational biology |
### How to Test:
1. Open SPARKNET frontend (http://localhost:3000)
2. Click "Upload Patent"
3. Select: `uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt`
4. Wait for analysis to complete (~2-3 minutes)
5. Check results match expected values above
---
## πŸ§ͺ Test #2: Existing Non-Patent Files (Fallback Extraction)
### Files Already Uploaded:
```
uploads/patents/*.pdf
```
These are **NOT actual patents** (Microsoft docs, etc.), but with your **enhanced fallback extraction**, they should now show:
### Expected Behavior:
**Before your enhancement:**
- Title: "Patent Analysis" (generic)
- Abstract: "Abstract not available" (generic)
**After your enhancement:**
- Title: First substantial line from document (e.g., "Windows Principles: Twelve Tenets to Promote Competition")
- Abstract: First ~300 characters of document text
- Document validator warning in backend logs: "❌ NOT a valid patent"
### How to Test:
1. Upload any existing PDF from `uploads/patents/`
2. Check if title shows actual document title (not "Patent Analysis")
3. Check if abstract shows document summary (not "Abstract not available")
4. Check backend logs for validation warnings
---
## πŸ“Š Verification Checklist
After uploading the sample patent:
- [ ] Title shows: "AI-Powered Drug Discovery Platform..."
- [ ] Abstract shows actual content (not "Abstract not available")
- [ ] TRL level is 6 with justification
- [ ] Claims section populated with 7 claims
- [ ] Innovations section shows 3+ innovations
- [ ] No "Patent Analysis" generic title
- [ ] Analysis quality > 85%
---
## πŸ” How the Enhanced Code Works
Your fallback extraction (`_extract_fallback_title_abstract`) activates when:
```python
# Condition 1: LLM extraction returns nothing
if not title or title == 'Patent Analysis':
# Use fallback: Extract first substantial line as title
# Condition 2: LLM extraction fails for abstract
if not abstract or abstract == 'Abstract not available':
# Use fallback: Extract first ~300 chars as abstract
```
**Fallback Logic:**
1. **Title**: First substantial line (10-200 chars) from document
2. **Abstract**: First few paragraphs after title, truncated to ~300 chars
This ensures **something meaningful** is displayed even for non-patent documents!
---
## πŸ› Debugging Tips
### Check Backend Logs for Validation
```bash
# View live backend logs
screen -r Sparknet-backend
# Or hardcopy to file
screen -S Sparknet-backend -X hardcopy /tmp/backend.log
tail -100 /tmp/backend.log
# Look for:
# βœ… "appears to be a valid patent" (good)
# ❌ "is NOT a valid patent" (non-patent uploaded)
# ℹ️ "Using fallback title/abstract extraction" (fallback triggered)
```
### Expected Log Sequence for Sample Patent:
```
πŸ“„ Analyzing patent: uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
Extracting patent structure...
Assessing technology and commercialization potential...
βœ… Patent analysis complete: TRL 6, 3 innovations identified
βœ… appears to be a valid patent
```
### Expected Log Sequence for Non-Patent (with fallback):
```
πŸ“„ Analyzing patent: uploads/patents/microsoft_doc.pdf
Extracting patent structure...
❌ is NOT a valid patent
Detected type: Microsoft Windows documentation
Issues: Only 1 patent keywords found, Missing required sections: abstract, claim
ℹ️ Using fallback title/abstract extraction
Fallback extraction: title='Windows Principles: Twelve Tenets...', abstract length=287
βœ… Patent analysis complete: TRL 5, 2 innovations identified
```
---
## 🎯 Quick Test Commands
### Check if backend has new code loaded:
```bash
# Check if document_validator module is importable
curl -s http://localhost:8000/api/health
# Should return: "status": "healthy"
```
### Manually test document validator:
```bash
python << 'EOF'
from src.utils.document_validator import validate_and_log
# Test with sample patent
with open('uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt', 'r') as f:
text = f.read()
is_valid = validate_and_log(text, "sample_patent.txt")
print(f"Valid patent: {is_valid}")
EOF
```
### Check uploaded files:
```bash
# List all uploaded patents
ls -lh uploads/patents/
# Check if sample patent exists
ls -lh uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
```
---
## πŸš€ Next Steps
### Immediate Testing:
1. Upload `SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt` through UI
2. Verify results show actual patent information
3. Check backend logs for validation messages
### Download Real Patents for Testing:
**Option 1: Google Patents**
1. Visit: https://patents.google.com/
2. Search: "artificial intelligence" or "machine learning"
3. Download any patent PDF
4. Upload to SPARKNET
**Option 2: USPTO Direct**
```bash
# Example: Download US patent 10,123,456
curl -o real_patent.pdf "https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/10123456"
```
**Option 3: EPO (European Patents)**
```bash
# Example: European patent
curl -o ep_patent.pdf "https://data.epo.org/publication-server/rest/v1.0/publication-dates/20210601/patents/EP1234567/document.pdf"
```
### Clear Non-Patent Uploads (Optional):
```bash
# Backup existing uploads
mkdir -p uploads/patents_backup
cp uploads/patents/*.pdf uploads/patents_backup/
# Remove non-patents (keep only sample)
find uploads/patents/ -name "*.pdf" -type f -delete
# Keep the sample patent
ls uploads/patents/SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt
# Should exist
```
---
## πŸ“ˆ Performance Expectations
### Analysis Time:
- **Sample Patent**: ~2-3 minutes (first run)
- **With fallback**: +5-10 seconds (fallback extraction is fast)
- **Subsequent analyses**: ~1-2 minutes (memory cached)
### Success Criteria:
- **Valid Patents**: >90% accuracy on title/abstract extraction
- **Non-Patents**: Fallback shows meaningful title/abstract (not generic placeholders)
- **Overall**: System doesn't crash, always returns results
---
## βœ… Success! What You've Fixed
### Before:
- ❌ Generic "Patent Analysis" title
- ❌ "Abstract not available"
- ❌ No indication document wasn't a patent
### After (with your enhancements):
- βœ… Actual document title extracted (even for non-patents)
- βœ… Document summary shown as abstract
- βœ… Validation warnings in logs
- βœ… Better user experience
---
**Date**: November 10, 2025
**Status**: βœ… Ready for Testing
**Backend**: Running on port 8000
**Frontend**: Running on port 3000 (assumed)
**Your Next Action**: Upload `SAMPLE_AI_DRUG_DISCOVERY_PATENT.txt` through the UI! πŸš€