Docgenie-API / api /README.md
Ahadhassan-2003
deploy: update HF Space
dc4e6da
# DocGenie API
FastAPI-based REST API for generating synthetic documents using LLMs. This API is **optimized for ML dataset creation** with comprehensive handwriting and visual element support.
## Features
- πŸš€ **Simple REST API** - Easy to integrate with any frontend
- πŸ–ΌοΈ **URL-based seed images** - Provide seed images via URLs
- 🎨 **Customizable prompts** - Control document type, language, and ground truth format
- ✍️ **Handwriting Generation** - WordStylist diffusion model with 339 author styles
- 🎯 **Visual Elements** - Stamps, logos, barcodes, photos, figures
- πŸ“Š **ML-Ready Datasets** - Individual token images with complete metadata
- πŸ“„ **Complete output** - Returns PDF, HTML, CSS, and bounding boxes
- ⚑ **Async processing** - Fast and efficient document generation
## ML Dataset Creation
The API is **fully equipped for ML training dataset creation** with `output_detail: "dataset"` mode:
### βœ… Handwriting Data
- **Individual token images**: Each handwriting field saved as separate PNG (`hw0.png`, `hw1.png`, ...)
- **Author style IDs**: 339 unique writer styles (0-338) for style-consistent generation
- **Text content**: Original text for each handwriting field
- **Position data**: Precise bounding boxes (x, y, width, height) in mm
- **Signature detection**: Boolean flag for signature vs regular handwriting
- **Image dimensions**: Width and height for each generated token
### βœ… Visual Element Data
- **Stamps**: Generated with realistic textures, borders, and rotations
- Text content preserved
- Red/green color variants
- Circle/rectangle shapes
- **Logos**: Random selection from 6+ logo prefabs
- **Barcodes**: Code128 format with customizable content
- **Photos**: Random selection from 5+ photo prefabs
- **Figures/Charts**: Random selection from 6+ chart/diagram prefabs
- **Individual images**: Each element saved as separate PNG with transparency
### βœ… Dataset Metadata
- **Token mapping JSON**: Complete mapping with:
- Token IDs and references
- Style IDs for handwriting
- Element types for visual elements
- Position rectangles
- Image filenames
- Content text
- **Ground truth annotations**: QA pairs, classification labels, NER tags
- **Bounding boxes**: Word, segment, and layout-level bboxes
- **Normalized coordinates**: [0,1] scaled for ML frameworks
- **Msgpack export**: Compatible with datadings library
### βœ… Additional ML Features
- **OCR results**: Word-level bboxes and text for Document AI training
- **Layout elements**: Document structure annotations
- **Page dimensions**: Physical measurements (mm) and pixel dimensions
- **Reproducibility**: Seed-based generation for consistent results
## Pipeline Overview
The API implements a simplified version of the DocGenie generation pipeline:
1. **Download seed images** from URLs
2. **Convert to base64** for LLM input
3. **Build custom prompt** with user parameters
4. **Call Claude API** to generate HTML documents
5. **Extract HTML/CSS** and ground truth from response
6. **Render to PDF** using Playwright
7. **Extract bounding boxes** from PDF
8. **Return results** as JSON with base64-encoded PDF
## Installation
### Prerequisites
- Python 3.10+
- DocGenie main package installed
- Playwright browsers installed
### Setup
1. Install dependencies (all API dependencies are included in the main project):
```bash
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .
# Or install API-specific dependencies
cd api/
pip install -r requirements.txt
```
**Note**: For async endpoint support, ensure you have:
- `redis>=5.0.0` and `rq>=1.15.0` (job queue)
- `supabase>=2.0.0` (database)
- `google-api-python-client>=2.100.0` (Google Drive integration)
2. Install Playwright browsers:
```bash
playwright install chromium
```
3. Install Tesseract OCR (for local OCR support):
```bash
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows
# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
```
4. Set your Anthropic API key:
```bash
export ANTHROPIC_API_KEY="your-api-key-here"
```
5. Configure OCR in `.env`:
```bash
cp .env.example .env
# Edit .env and set:
OCR_SERVICE_ENABLED=true
OCR_USE_LOCAL=true # Use local Tesseract (recommended)
```
## Running the API
### Development Mode
```bash
cd api
python main.py
```
The API will be available at `http://localhost:8000`
### Production Mode
```bash
cd api
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
```
## API Endpoints
### Health Check
```http
GET /health
```
**Response:**
```json
{
"status": "healthy",
"version": "1.0.0"
}
```
### Generate Documents
```http
POST /generate
```
**Request Body:**
```json
{
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business and administrative",
"gt_type": "Multiple questions about each document, with their answers taken **verbatim** from the document.",
"gt_format": "{\"<Text of question 1>\": \"<Answer to question 1>\", \"<Text of question 2>\": \"<Answer to question 2>\", ...}",
"num_solutions": 3
},
"model": "claude-sonnet-4-5-20250929",
"api_key": "optional-api-key"
}
```
**Response:**
```json
{
"success": true,
"message": "Successfully generated 3 documents",
"total_documents": 3,
"documents": [
{
"document_id": "uuid-123_0",
"html": "<!DOCTYPE html>...",
"css": "body { ... }",
"ground_truth": {
"What is the invoice number?": "INV-12345",
"What is the total amount?": "$1,234.56"
},
"pdf_base64": "JVBERi0xLjQK...",
"bboxes": [
{
"text": "Invoice",
"x": 0.1,
"y": 0.05,
"width": 0.2,
"height": 0.03,
"page": 0
}
],
"page_width_mm": 210.0,
"page_height_mm": 297.0
}
]
}
```
### Generate Documents (Async) - **Recommended for Production**
```http
POST /generate/async
```
**🎯 Cost Optimization**: This endpoint uses Claude's **Batch API** for **50% cost savings** ($2.50 vs $5.00 per 1M input tokens).
**⏱️ Latency**: 5-30 minutes (vs 30-120 seconds for direct API)
**βœ… Best For**: Multi-user production systems with non-realtime requirements
**Request Body:**
```json
{
"user_id": 123,
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business and administrative",
"num_solutions": 3,
"enable_handwriting": true,
"enable_visual_elements": true,
"enable_ocr": true,
"output_detail": "dataset"
}
}
```
**Response:**
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"estimated_time_minutes": 10,
"poll_url": "/jobs/550e8400-e29b-41d4-a716-446655440000/status",
"created_at": "2025-01-15T12:00:00Z"
}
```
**Workflow:**
1. Submit generation request β†’ Get `request_id`
2. Poll status endpoint every 30-60 seconds
3. When `status: "completed"`, download from Google Drive
4. Results uploaded to user's Google Drive with shareable link
### Check Job Status
```http
GET /jobs/{request_id}/status
```
**Response (Queued):**
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"created_at": "2025-01-15T12:00:00Z",
"updated_at": "2025-01-15T12:00:00Z"
}
```
**Response (Processing):**
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"created_at": "2025-01-15T12:00:00Z",
"updated_at": "2025-01-15T12:05:00Z",
"progress": "Creating batch request..."
}
```
**Response (Completed):**
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"created_at": "2025-01-15T12:00:00Z",
"updated_at": "2025-01-15T12:15:00Z",
"download_url": "https://drive.google.com/file/d/abc123xyz/view?usp=sharing",
"file_size_mb": 15.4,
"document_count": 3
}
```
**Response (Failed):**
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "failed",
"created_at": "2025-01-15T12:00:00Z",
"updated_at": "2025-01-15T12:08:00Z",
"error_message": "Batch processing timeout"
}
```
**Status Values:**
- `queued`: Job submitted, waiting for worker
- `processing`: Worker picked up job, creating batch
- `generating`: Batch submitted to Claude, waiting for completion
- `completed`: Documents generated and uploaded to Google Drive
- `failed`: Error occurred (see `error_message`)
### List User Jobs
```http
GET /jobs/user/{user_id}?limit=50&offset=0
```
**Response:**
```json
{
"user_id": 123,
"jobs": [
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"created_at": "2025-01-15T12:00:00Z",
"download_url": "https://drive.google.com/...",
"document_count": 3
},
{
"request_id": "660e8400-e29b-41d4-a716-446655440111",
"status": "processing",
"created_at": "2025-01-15T12:30:00Z"
}
],
"count": 2,
"limit": 50,
"offset": 0
}
```
## Usage Examples
### cURL
```bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"seed_images": [
"https://example.com/receipt1.jpg",
"https://example.com/receipt2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "receipts",
"num_solutions": 2
}
}'
```
### Python (Direct API)
```python
import requests
import base64
response = requests.post(
"http://localhost:8000/generate",
json={
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business forms",
"num_solutions": 3
}
}
)
result = response.json()
# Save first PDF
if result["success"]:
pdf_data = base64.b64decode(result["documents"][0]["pdf_base64"])
with open("generated_doc.pdf", "wb") as f:
f.write(pdf_data)
```
### Python (Async API with Polling) - **Recommended**
```python
import requests
import time
# Step 1: Submit job
response = requests.post(
"http://localhost:8000/generate/async",
json={
"user_id": 123,
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "receipts and invoices",
"num_solutions": 5,
"enable_handwriting": True,
"enable_visual_elements": True,
"enable_ocr": True,
"output_detail": "dataset"
}
}
)
job = response.json()
request_id = job["request_id"]
print(f"βœ“ Job submitted: {request_id}")
print(f" Estimated time: {job['estimated_time_minutes']} minutes")
# Step 2: Poll status until complete
while True:
status_response = requests.get(
f"http://localhost:8000/jobs/{request_id}/status"
)
status = status_response.json()
print(f" Status: {status['status']}", end="")
if status.get("progress"):
print(f" - {status['progress']}")
else:
print()
if status["status"] == "completed":
print(f"βœ“ Generation complete!")
print(f" Download: {status['download_url']}")
print(f" Size: {status.get('file_size_mb', 0):.1f} MB")
print(f" Documents: {status.get('document_count', 0)}")
break
elif status["status"] == "failed":
print(f"βœ— Generation failed: {status.get('error_message')}")
break
# Wait 30 seconds before next poll
time.sleep(30)
# Step 3: Download from Google Drive (if completed)
if status["status"] == "completed":
# User can download from their Google Drive using the shareable link
print(f"\nDownload your documents at:\n{status['download_url']}")
```
### JavaScript
```javascript
const response = await fetch('http://localhost:8000/generate', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
seed_images: [
'https://example.com/seed1.jpg',
'https://example.com/seed2.jpg'
],
prompt_params: {
language: 'English',
doc_type: 'invoices',
num_solutions: 2
}
})
});
const result = await response.json();
// Convert base64 PDF to blob
const pdfBlob = await fetch(`data:application/pdf;base64,${result.documents[0].pdf_base64}`)
.then(res => res.blob());
```
## Configuration
### Prompt Parameters
- **language**: Language for generated documents (default: "English")
- **doc_type**: Type of documents to generate (e.g., "business and administrative", "receipts", "forms")
- **gt_type**: Description of ground truth type to generate
- **gt_format**: Format specification for ground truth JSON
- **num_solutions**: Number of document variations (1-5)
### Stage 3-5 Advanced Features
The API supports advanced document synthesis and dataset packaging:
#### Stage 3: Handwriting & Visual Elements
- **enable_handwriting**: Add handwritten text using diffusion model (default: false)
- **handwriting_ratio**: Percentage of text to convert to handwriting 0-1 (default: 0.5)
- **enable_visual_elements**: Add stamps, barcodes, logos (default: false)
- **visual_element_types**: Types of elements to add: ["stamp", "logo", "figure", "barcode", "photo"] (default: all types)
#### Stage 4: OCR
- **enable_ocr**: Perform OCR on generated document (default: false)
- **ocr_language**: OCR language code (default: "en")
#### Stage 5: Dataset Packaging
- **enable_bbox_normalization**: Normalize bboxes to [0,1] scale (default: false)
- **enable_gt_verification**: Verify ground truth quality (default: false)
- **enable_analysis**: Generate dataset statistics (default: false)
- **enable_debug_visualization**: Create bbox overlay images (default: false)
#### Dataset Export (Msgpack Format)
- **enable_dataset_export**: Export as msgpack dataset format (default: false)
- **dataset_export_format**: Export format - only "msgpack" is supported (default: "msgpack")
**Note**: Only msgpack format is implemented in the current pipeline. COCO and HuggingFace export formats mentioned in some documentation are not yet available.
#### Output Detail Level
- **output_detail**: Controls how much data is returned/saved (default: "minimal")
- `"minimal"` (default): Final outputs only (PDFs, images, metadata) - 2-5 MB per document
- `"dataset"`: Includes individual token images for ML training - 10-20 MB per document
- Individual handwriting token images (`handwriting_tokens/hw0.png`, ...)
- Individual visual element images (`visual_elements/logo_0.png`, ...)
- Token mapping JSON with style IDs and positions
- `"complete"`: All intermediate files and debug info - 20-50 MB per document
- Everything from `dataset` mode
- Intermediate PDFs from each processing stage
- Generation logs
- ⚠️ **Warning**: Can result in 50+ MB JSON responses for `/generate` endpoint
**Recommendation**: Use `"minimal"` for production, `"dataset"` for ML research, `"complete"` for debugging (only with `/generate/pdf`).
**Example with dataset output detail:**
```python
import requests
import base64
import json
# Generate ML training dataset
response = requests.post(
"http://localhost:8000/generate",
json={
"seed_images": ["https://example.com/seed.jpg"],
"prompt_params": {
"language": "English",
"doc_type": "receipts and invoices",
"num_solutions": 5,
# Enable handwriting and visual elements
"enable_handwriting": True,
"handwriting_ratio": 0.4,
"enable_visual_elements": True,
"visual_element_types": ["stamp", "logo", "figure", "barcode", "photo"], # All types by default
# Enable dataset features
"enable_ocr": True,
"enable_bbox_normalization": True,
"enable_dataset_export": True,
# IMPORTANT: Set output_detail to "dataset" for ML training
"output_detail": "dataset",
# Use seed for reproducibility
"seed": 42
}
}
)
result = response.json()
# Process each generated document
for doc in result["documents"]:
doc_id = doc["document_id"]
print(f"\\nProcessing {doc_id}:")
# 1. Save individual handwriting token images
if doc.get("handwriting_token_images"):
print(f" - Handwriting tokens: {len(doc['handwriting_token_images'])}")
for hw_id, img_b64 in doc["handwriting_token_images"].items():
with open(f"dataset/{doc_id}/{hw_id}.png", "wb") as f:
f.write(base64.b64decode(img_b64))
# 2. Save individual visual element images
if doc.get("visual_element_images"):
print(f" - Visual elements: {len(doc['visual_element_images'])}")
for ve_id, img_b64 in doc["visual_element_images"].items():
with open(f"dataset/{doc_id}/{ve_id}.png", "wb") as f:
f.write(base64.b64decode(img_b64))
# 3. Save token mapping for ML training
if doc.get("token_mapping"):
mapping = doc["token_mapping"]
print(f" - Mapping: {mapping['handwriting']['total_count']} HW + {mapping['visual_elements']['total_count']} VE")
with open(f"dataset/{doc_id}/token_mapping.json", "w") as f:
json.dump(mapping, f, indent=2)
# 4. Save ground truth annotations
if doc.get("ground_truth"):
with open(f"dataset/{doc_id}/ground_truth.json", "w") as f:
json.dump(doc["ground_truth"], f, indent=2)
# 5. Save bounding boxes (normalized coordinates)
if doc.get("normalized_bboxes_word"):
with open(f"dataset/{doc_id}/bboxes_normalized.json", "w") as f:
json.dump(doc["normalized_bboxes_word"], f, indent=2)
# 6. Save final document image
if doc.get("image_base64"):
with open(f"dataset/{doc_id}/final_image.png", "wb") as f:
f.write(base64.b64decode(doc["image_base64"]))
# 7. Save msgpack dataset file
if doc.get("dataset_export") and doc["dataset_export"].get("msgpack_base64"):
with open(f"dataset/{doc_id}/dataset.msgpack", "wb") as f:
f.write(base64.b64decode(doc["dataset_export"]["msgpack_base64"]))
print(f"\\nβœ… Generated {len(result['documents'])} ML-ready documents")
```
### PDF Generation Endpoint (Recommended for Large Datasets)
For bulk generation with comprehensive file outputs, use `/generate/pdf`:
```bash
curl -X POST http://localhost:8000/generate/pdf \
-H "Content-Type: application/json" \
-d '{
"seed_images": ["https://example.com/seed1.jpg"],
"prompt_params": {
"num_solutions": 3,
"enable_handwriting": true,
"enable_ocr": true,
"enable_bbox_normalization": true,
"enable_dataset_export": true,
"output_detail": "dataset"
}
}' \
--output documents.zip
```
#### ZIP File Contents
Based on `output_detail` level:
**Minimal (default):**
- `document_<id>.pdf` - Generated PDF files
- `document_<id>/` - Per-document directories with:
- `document.html`, `document.css` - Source files
- `ground_truth.json`, `bboxes.json` - Annotations
- `final_image.png` - Final rendered image (if Stage 3 enabled)
- `handwriting_regions.json`, `visual_elements.json` - Stage 3 metadata (if enabled)
- `ocr_results.json` - OCR word-level data (if OCR enabled)
- `README.md` - Package documentation
- `metadata.json` - Combined metadata
**Dataset (for ML training):**
- All files from "minimal" level, plus:
- `handwriting_tokens/` - Individual token images (`hw0.png`, `hw1.png`, ...)
- `visual_elements/` - Individual element images (`logo_0.png`, `stamp_1.png`, ...)
- `token_mapping.json` - Complete mapping with style IDs and positions
- `dataset.msgpack` - Msgpack dataset file (if export enabled)
- `normalized_bboxes_word.json` - Normalized coordinates (if Stage 5 enabled)
**Complete (for debugging):**
- All files from "dataset" level, plus:
- Intermediate PDFs from each processing stage
- Generation logs with timing information
- `debug_visualization.png` - Bbox overlay images
### Supported Models
- `claude-sonnet-4-5-20250929` (default, recommended)
- `claude-3-5-sonnet-20241022`
### Environment Variables
- `ANTHROPIC_API_KEY`: Your Anthropic API key (required if not provided in request)
## API Documentation
Interactive API documentation is available when the server is running:
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
## Error Handling
The API returns appropriate HTTP status codes:
- `200 OK`: Successful generation
- `400 Bad Request`: Invalid input (e.g., invalid image URLs)
- `401 Unauthorized`: Missing or invalid API key
- `500 Internal Server Error`: Processing error
Error response format:
```json
{
"detail": "Error message describing what went wrong"
}
```
## Performance Considerations
- **Concurrent requests**: The API can handle multiple requests concurrently
- **Image size**: Larger seed images take longer to process
- **Number of solutions**: More solutions = longer processing time
- **Model selection**: Sonnet is slower but higher quality than Haiku
## Limitations
- Maximum 10 seed images per request
- Maximum 5 document variations (`num_solutions`)
- Single-page documents only
- Timeout: 60 seconds per PDF render
## Troubleshooting
### Playwright browser not found
```bash
playwright install chromium
```
### API key not working
Make sure your API key is set correctly:
```bash
echo $ANTHROPIC_API_KEY
```
### PDF rendering fails
Ensure Chromium is installed and accessible:
```bash
playwright show-trace
```
## Integration with Frontend
Example React integration:
```jsx
const [loading, setLoading] = useState(false);
const [result, setResult] = useState(null);
const generateDocuments = async () => {
setLoading(true);
try {
const response = await fetch('http://localhost:8000/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
seed_images: seedImageUrls,
prompt_params: {
language: 'English',
doc_type: documentType,
num_solutions: 3
}
})
});
const data = await response.json();
setResult(data);
} catch (error) {
console.error('Generation failed:', error);
} finally {
setLoading(false);
}
};
```
### React Integration (Async API with Progress)
```jsx
import { useState, useEffect } from 'react';
function DocumentGenerator({ userId, seedImages }) {
const [requestId, setRequestId] = useState(null);
const [status, setStatus] = useState(null);
const [progress, setProgress] = useState(0);
// Submit job
const handleGenerate = async () => {
const response = await fetch('http://localhost:8000/generate/async', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
user_id: userId,
seed_images: seedImages,
prompt_params: {
language: 'English',
doc_type: 'receipts',
num_solutions: 3,
enable_handwriting: true,
output_detail: 'dataset'
}
})
});
const job = await response.json();
setRequestId(job.request_id);
setStatus('queued');
};
// Poll job status
useEffect(() => {
if (!requestId || status === 'completed' || status === 'failed') return;
const interval = setInterval(async () => {
const response = await fetch(`http://localhost:8000/jobs/${requestId}/status`);
const jobStatus = await response.json();
setStatus(jobStatus.status);
// Update progress bar
const progressMap = {
'queued': 10,
'processing': 30,
'generating': 60,
'completed': 100,
'failed': 0
};
setProgress(progressMap[jobStatus.status] || 0);
if (jobStatus.status === 'completed') {
// Open Google Drive download link
window.open(jobStatus.download_url, '_blank');
}
}, 30000); // Poll every 30 seconds
return () => clearInterval(interval);
}, [requestId, status]);
return (
<div>
<button onClick={handleGenerate} disabled={status && status !== 'completed'}>
Generate Documents
</button>
{status && (
<div className="progress-container">
<div className="progress-bar" style={{ width: `${progress}%` }} />
<p>Status: {status}</p>
{status === 'completed' && (
<a href={`http://localhost:8000/jobs/${requestId}/status`}>
Download Results
</a>
)}
</div>
)}
</div>
);
}
```
## Background Processing Setup
The async endpoints (`/generate/async`) require a background worker system for job processing.
### Prerequisites
1. **Redis** - Job queue storage
2. **Supabase** - Database for job tracking and user data
3. **Google Drive OAuth** - For uploading results to user's Drive
### Installing Redis
**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install redis-server
sudo systemctl start redis
sudo systemctl enable redis
```
**macOS:**
```bash
brew install redis
brew services start redis
```
**Docker:**
```bash
docker run -d -p 6379:6379 --name redis redis:7-alpine
```
**Verify Redis is running:**
```bash
redis-cli ping
# Should return: PONG
```
### Configuring Supabase
1. Create a Supabase project at [supabase.com](https://supabase.com)
2. Create the required tables in your Supabase SQL Editor:
```sql
-- Document generation requests
CREATE TABLE document_requests (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
user_id INTEGER NOT NULL,
status TEXT NOT NULL CHECK (status IN ('queued', 'processing', 'generating', 'completed', 'failed')),
request_metadata JSONB NOT NULL,
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Generated documents
CREATE TABLE generated_documents (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
request_id UUID NOT NULL REFERENCES document_requests(id),
document_id TEXT NOT NULL,
file_url TEXT,
zip_url TEXT,
file_size_mb DECIMAL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- User integrations (Google Drive OAuth)
CREATE TABLE user_integrations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
user_id INTEGER NOT NULL,
integration_type TEXT NOT NULL CHECK (integration_type IN ('google_drive', 'dropbox')),
access_token TEXT NOT NULL,
refresh_token TEXT,
token_expiry TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(user_id, integration_type)
);
-- Analytics events
CREATE TABLE analytics_events (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
user_id INTEGER,
event_type TEXT NOT NULL,
entity_id UUID,
event_data JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Indexes for performance
CREATE INDEX idx_document_requests_user_id ON document_requests(user_id);
CREATE INDEX idx_document_requests_status ON document_requests(status);
CREATE INDEX idx_generated_documents_request_id ON generated_documents(request_id);
CREATE INDEX idx_user_integrations_user_id ON user_integrations(user_id);
CREATE INDEX idx_analytics_events_user_id ON analytics_events(user_id);
```
3. Add your Supabase credentials to `.env`:
```bash
# In api/.env
SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_KEY=your-anon-or-service-role-key
```
### Configuring Google Drive OAuth
Users need to connect their Google Drive account for result storage:
1. Create a Google Cloud Project at [console.cloud.google.com](https://console.cloud.google.com)
2. Enable Google Drive API
3. Create OAuth 2.0 credentials (Web application)
4. Add authorized redirect URIs (e.g., `http://localhost:3000/auth/google/callback`)
5. Download credentials JSON
6. Users authenticate via OAuth flow (implement in your frontend):
```python
# Example OAuth flow (implement in your auth system)
from google_auth_oauthlib.flow import Flow
flow = Flow.from_client_config(
client_config={
"web": {
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"redirect_uris": ["http://localhost:3000/auth/google/callback"]
}
},
scopes=["https://www.googleapis.com/auth/drive.file"]
)
# User visits auth URL, gets redirected back with code
authorization_url, state = flow.authorization_url(access_type='offline', include_granted_scopes='true')
# Exchange code for tokens
flow.fetch_token(code=authorization_code)
credentials = flow.credentials
# Store in Supabase user_integrations table
supabase.table('user_integrations').insert({
'user_id': user_id,
'integration_type': 'google_drive',
'access_token': credentials.token,
'refresh_token': credentials.refresh_token,
'token_expiry': credentials.expiry
}).execute()
```
### Starting the Background Worker
1. Configure environment variables in `api/.env`:
```bash
# Redis Configuration
REDIS_URL=redis://localhost:6379/0
RQ_QUEUE_NAME=docgenie
# Batch Processing
BATCH_POLL_INTERVAL=30 # seconds
BATCH_DATA_DIR=/tmp/docgenie_batches
MESSAGE_DATA_DIR=/tmp/docgenie_messages
# Google Drive
GOOGLE_DRIVE_FOLDER_NAME=DocGenie Documents
# Supabase (already configured above)
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your_key_here
# Claude API
ANTHROPIC_API_KEY=your_api_key_here
```
2. Start the worker:
```bash
cd api/
./start_worker.sh
```
The worker will:
- βœ“ Check Redis connection
- βœ“ Validate Supabase configuration
- βœ“ Verify Claude API key
- βœ“ Create temporary directories
- βœ“ Start RQ worker listening on `docgenie` queue
**Output:**
```
πŸš€ Starting DocGenie RQ Worker...
βœ“ Loading .env file...
βœ“ Redis connected
βœ“ Supabase configured
βœ“ Claude API key configured
βœ“ Temporary directories created
============================================
Worker Configuration:
Queue: docgenie
Redis: redis://localhost:6379/0
Batch Data: /tmp/docgenie_batches
Message Data: /tmp/docgenie_messages
============================================
βœ… Starting RQ worker (press Ctrl+C to stop)...
12:00:00 RQ worker 'worker-abc123' started on docgenie queue
```
### Running Multiple Workers (Production)
For production systems with high load, run multiple workers:
```bash
# Terminal 1
./start_worker.sh
# Terminal 2
./start_worker.sh
# Terminal 3
./start_worker.sh
```
Each worker processes jobs independently from the same queue.
**For detailed scaling instructions**, see [SCALING.md](SCALING.md).
### Monitoring Workers
```bash
# View worker status
rq info --url redis://localhost:6379/0
# View queue status
rq info --queue docgenie --url redis://localhost:6379/0
# View failed jobs
rq info --queue failed --url redis://localhost:6379/0
```
### Architecture Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI │───────▢│ Redis │◀───────│ RQ Workers β”‚
β”‚ Server β”‚ β”‚ Queue β”‚ β”‚ (1-5 instances)β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ /generate/ β”‚ β”‚ Job Queue: β”‚ β”‚ β€’ Downloads β”‚
β”‚ async β”‚ β”‚ - queued β”‚ β”‚ β€’ Claude Batch β”‚
β”‚ β”‚ β”‚ - pending β”‚ β”‚ β€’ PDF render β”‚
β”‚ /jobs/ β”‚ β”‚ - active β”‚ β”‚ β€’ Handwriting β”‚
β”‚ {id}/ β”‚ β”‚ β”‚ β”‚ β€’ OCR β”‚
β”‚ status β”‚ β”‚ β”‚ β”‚ β€’ ZIP creation β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Supabase β”‚
β”‚ β€’ document_requests (job tracking) β”‚
β”‚ β€’ generated_documents (results metadata) β”‚
β”‚ β€’ user_integrations (Google Drive OAuth) β”‚
β”‚ β€’ analytics_events (usage tracking) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”‚ Upload Results
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Google Drive β”‚
β”‚ β€’ User's "DocGenie Documents" folder β”‚
β”‚ β€’ ZIP files with generated documents β”‚
β”‚ β€’ Shareable links returned to API β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Cost Comparison: Direct vs Batched API
| API Type | Cost (Input) | Cost (Output) | Latency | Use Case |
|----------|-------------|---------------|---------|----------|
| Direct | $5.00/1M tokens | $15.00/1M tokens | 30-120s | Real-time, interactive |
| **Batched** | **$2.50/1M tokens** | **$7.50/1M tokens** | 5-30 min | **Background jobs (recommended)** |
**Example Cost Calculation:**
- Generate 100 documents per day
- Each request: 5,000 input tokens, 10,000 output tokens
**Direct API Cost:**
- Input: (100 Γ— 5,000 / 1M) Γ— $5.00 = $2.50/day
- Output: (100 Γ— 10,000 / 1M) Γ— $15.00 = $15.00/day
- **Total: $17.50/day = $525/month**
**Batched API Cost:**
- Input: (100 Γ— 5,000 / 1M) Γ— $2.50 = $1.25/day
- Output: (100 Γ— 10,000 / 1M) Γ— $7.50 = $7.50/day
- **Total: $8.75/day = $262.50/month**
**πŸ’° Savings: $262.50/month (50% reduction)**
## Scaling Workers
The API uses Redis Queue (RQ) workers for background job processing. Scale workers based on load:
| User Load | Workers | Redis RAM | Notes |
|-----------|---------|-----------|-------|
| < 10 req/hr | 1 | 256 MB | Development |
| 10–50 req/hr | 2–3 | 512 MB | Small production |
| 50–200 req/hr | 3–5 | 1 GB | Medium production |
| > 200 req/hr | 5+ | 2+ GB | Large production |
### Starting Workers
```bash
# Single worker (development)
./start_worker.sh
# Multiple workers (production) β€” run in separate terminals
./start_worker.sh # Terminal 1
./start_worker.sh # Terminal 2
# Docker Compose β€” scale to 3 workers
docker-compose up --scale worker=3
# Monitor
rq info --url redis://localhost:6379/0
rq info --queue docgenie --url redis://localhost:6379/0
```
### Railway Multi-Worker (Separate Service)
1. Railway dashboard β†’ New Service β†’ GitHub Repo (same repo)
2. Name: `docgenie-worker`
3. Custom Start Command: `rq worker --url $REDIS_URL`
4. Add the same environment variables as the API service
> For most use cases the **combined** mode (API + worker in one service, see `railway.json`) is sufficient and cheaper.
## Contributing
This API is a simplified interface to the DocGenie pipeline. For the full pipeline with all features, see the main DocGenie documentation.
## License
Same as DocGenie main project.