Spaces:

Text-to-Document-Generation
/

Docgenie-API

Paused

File size: 36,439 Bytes

dc4e6da

# DocGenie API

FastAPI-based REST API for generating synthetic documents using LLMs. This API is **optimized for ML dataset creation** with comprehensive handwriting and visual element support.

## Features

- 🚀 **Simple REST API** - Easy to integrate with any frontend
- 🖼️ **URL-based seed images** - Provide seed images via URLs
- 🎨 **Customizable prompts** - Control document type, language, and ground truth format
- ✍️ **Handwriting Generation** - WordStylist diffusion model with 339 author styles
- 🎯 **Visual Elements** - Stamps, logos, barcodes, photos, figures
- 📊 **ML-Ready Datasets** - Individual token images with complete metadata
- 📄 **Complete output** - Returns PDF, HTML, CSS, and bounding boxes
- ⚡ **Async processing** - Fast and efficient document generation

## ML Dataset Creation

The API is **fully equipped for ML training dataset creation** with `output_detail: "dataset"` mode:

### ✅ Handwriting Data
- **Individual token images**: Each handwriting field saved as separate PNG (`hw0.png`, `hw1.png`, ...)
- **Author style IDs**: 339 unique writer styles (0-338) for style-consistent generation
- **Text content**: Original text for each handwriting field
- **Position data**: Precise bounding boxes (x, y, width, height) in mm
- **Signature detection**: Boolean flag for signature vs regular handwriting
- **Image dimensions**: Width and height for each generated token

### ✅ Visual Element Data
- **Stamps**: Generated with realistic textures, borders, and rotations
  - Text content preserved
  - Red/green color variants
  - Circle/rectangle shapes
- **Logos**: Random selection from 6+ logo prefabs
- **Barcodes**: Code128 format with customizable content
- **Photos**: Random selection from 5+ photo prefabs
- **Figures/Charts**: Random selection from 6+ chart/diagram prefabs
- **Individual images**: Each element saved as separate PNG with transparency

### ✅ Dataset Metadata
- **Token mapping JSON**: Complete mapping with:
  - Token IDs and references
  - Style IDs for handwriting
  - Element types for visual elements
  - Position rectangles
  - Image filenames
  - Content text
- **Ground truth annotations**: QA pairs, classification labels, NER tags
- **Bounding boxes**: Word, segment, and layout-level bboxes
- **Normalized coordinates**: [0,1] scaled for ML frameworks
- **Msgpack export**: Compatible with datadings library

### ✅ Additional ML Features
- **OCR results**: Word-level bboxes and text for Document AI training
- **Layout elements**: Document structure annotations
- **Page dimensions**: Physical measurements (mm) and pixel dimensions
- **Reproducibility**: Seed-based generation for consistent results

## Pipeline Overview

The API implements a simplified version of the DocGenie generation pipeline:

1. **Download seed images** from URLs
2. **Convert to base64** for LLM input
3. **Build custom prompt** with user parameters
4. **Call Claude API** to generate HTML documents
5. **Extract HTML/CSS** and ground truth from response
6. **Render to PDF** using Playwright
7. **Extract bounding boxes** from PDF
8. **Return results** as JSON with base64-encoded PDF

## Installation

### Prerequisites

- Python 3.10+
- DocGenie main package installed
- Playwright browsers installed

### Setup

1. Install dependencies (all API dependencies are included in the main project):
```bash
# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

# Or install API-specific dependencies
cd api/
pip install -r requirements.txt
```

**Note**: For async endpoint support, ensure you have:
- `redis>=5.0.0` and `rq>=1.15.0` (job queue)
- `supabase>=2.0.0` (database)
- `google-api-python-client>=2.100.0` (Google Drive integration)

2. Install Playwright browsers:
```bash
playwright install chromium
```

3. Install Tesseract OCR (for local OCR support):
```bash
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
```

4. Set your Anthropic API key:
```bash
export ANTHROPIC_API_KEY="your-api-key-here"
```

5. Configure OCR in `.env`:
```bash
cp .env.example .env
# Edit .env and set:
OCR_SERVICE_ENABLED=true
OCR_USE_LOCAL=true  # Use local Tesseract (recommended)
```

## Running the API

### Development Mode

```bash
cd api
python main.py
```

The API will be available at `http://localhost:8000`

### Production Mode

```bash
cd api
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
```

## API Endpoints

### Health Check

```http
GET /health
```

**Response:**
```json
{
  "status": "healthy",
  "version": "1.0.0"
}
```

### Generate Documents

```http
POST /generate
```

**Request Body:**
```json
{
  "seed_images": [
    "https://example.com/seed1.jpg",
    "https://example.com/seed2.jpg"
  ],
  "prompt_params": {
    "language": "English",
    "doc_type": "business and administrative",
    "gt_type": "Multiple questions about each document, with their answers taken **verbatim** from the document.",
    "gt_format": "{\"<Text of question 1>\": \"<Answer to question 1>\", \"<Text of question 2>\": \"<Answer to question 2>\", ...}",
    "num_solutions": 3
  },
  "model": "claude-sonnet-4-5-20250929",
  "api_key": "optional-api-key"
}
```

**Response:**
```json
{
  "success": true,
  "message": "Successfully generated 3 documents",
  "total_documents": 3,
  "documents": [
    {
      "document_id": "uuid-123_0",
      "html": "<!DOCTYPE html>...",
      "css": "body { ... }",
      "ground_truth": {
        "What is the invoice number?": "INV-12345",
        "What is the total amount?": "$1,234.56"
      },
      "pdf_base64": "JVBERi0xLjQK...",
      "bboxes": [
        {
          "text": "Invoice",
          "x": 0.1,
          "y": 0.05,
          "width": 0.2,
          "height": 0.03,
          "page": 0
        }
      ],
      "page_width_mm": 210.0,
      "page_height_mm": 297.0
    }
  ]
}
```

### Generate Documents (Async) - **Recommended for Production**

```http
POST /generate/async
```

**🎯 Cost Optimization**: This endpoint uses Claude's **Batch API** for **50% cost savings** ($2.50 vs $5.00 per 1M input tokens).

**⏱️ Latency**: 5-30 minutes (vs 30-120 seconds for direct API)

**✅ Best For**: Multi-user production systems with non-realtime requirements

**Request Body:**
```json
{
  "user_id": 123,
  "seed_images": [
    "https://example.com/seed1.jpg",
    "https://example.com/seed2.jpg"
  ],
  "prompt_params": {
    "language": "English",
    "doc_type": "business and administrative",
    "num_solutions": 3,
    "enable_handwriting": true,
    "enable_visual_elements": true,
    "enable_ocr": true,
    "output_detail": "dataset"
  }
}
```

**Response:**
```json
{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "estimated_time_minutes": 10,
  "poll_url": "/jobs/550e8400-e29b-41d4-a716-446655440000/status",
  "created_at": "2025-01-15T12:00:00Z"
}
```

**Workflow:**
1. Submit generation request → Get `request_id`
2. Poll status endpoint every 30-60 seconds
3. When `status: "completed"`, download from Google Drive
4. Results uploaded to user's Google Drive with shareable link

### Check Job Status

```http
GET /jobs/{request_id}/status
```

**Response (Queued):**
```json
{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "created_at": "2025-01-15T12:00:00Z",
  "updated_at": "2025-01-15T12:00:00Z"
}
```

**Response (Processing):**
```json
{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing",
  "created_at": "2025-01-15T12:00:00Z",
  "updated_at": "2025-01-15T12:05:00Z",
  "progress": "Creating batch request..."
}
```

**Response (Completed):**
```json
{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "created_at": "2025-01-15T12:00:00Z",
  "updated_at": "2025-01-15T12:15:00Z",
  "download_url": "https://drive.google.com/file/d/abc123xyz/view?usp=sharing",
  "file_size_mb": 15.4,
  "document_count": 3
}
```

**Response (Failed):**
```json
{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "failed",
  "created_at": "2025-01-15T12:00:00Z",
  "updated_at": "2025-01-15T12:08:00Z",
  "error_message": "Batch processing timeout"
}
```

**Status Values:**
- `queued`: Job submitted, waiting for worker
- `processing`: Worker picked up job, creating batch
- `generating`: Batch submitted to Claude, waiting for completion
- `completed`: Documents generated and uploaded to Google Drive
- `failed`: Error occurred (see `error_message`)

### List User Jobs

```http
GET /jobs/user/{user_id}?limit=50&offset=0
```

**Response:**
```json
{
  "user_id": 123,
  "jobs": [
    {
      "request_id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "completed",
      "created_at": "2025-01-15T12:00:00Z",
      "download_url": "https://drive.google.com/...",
      "document_count": 3
    },
    {
      "request_id": "660e8400-e29b-41d4-a716-446655440111",
      "status": "processing",
      "created_at": "2025-01-15T12:30:00Z"
    }
  ],
  "count": 2,
  "limit": 50,
  "offset": 0
}
```

## Usage Examples

### cURL

```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "seed_images": [
      "https://example.com/receipt1.jpg",
      "https://example.com/receipt2.jpg"
    ],
    "prompt_params": {
      "language": "English",
      "doc_type": "receipts",
      "num_solutions": 2
    }
  }'
```

### Python (Direct API)

```python
import requests
import base64

response = requests.post(
    "http://localhost:8000/generate",
    json={
        "seed_images": [
            "https://example.com/seed1.jpg",
            "https://example.com/seed2.jpg"
        ],
        "prompt_params": {
            "language": "English",
            "doc_type": "business forms",
            "num_solutions": 3
        }
    }
)

result = response.json()

# Save first PDF
if result["success"]:
    pdf_data = base64.b64decode(result["documents"][0]["pdf_base64"])
    with open("generated_doc.pdf", "wb") as f:
        f.write(pdf_data)
```

### Python (Async API with Polling) - **Recommended**

```python
import requests
import time

# Step 1: Submit job
response = requests.post(
    "http://localhost:8000/generate/async",
    json={
        "user_id": 123,
        "seed_images": [
            "https://example.com/seed1.jpg",
            "https://example.com/seed2.jpg"
        ],
        "prompt_params": {
            "language": "English",
            "doc_type": "receipts and invoices",
            "num_solutions": 5,
            "enable_handwriting": True,
            "enable_visual_elements": True,
            "enable_ocr": True,
            "output_detail": "dataset"
        }
    }
)

job = response.json()
request_id = job["request_id"]
print(f"✓ Job submitted: {request_id}")
print(f"  Estimated time: {job['estimated_time_minutes']} minutes")

# Step 2: Poll status until complete
while True:
    status_response = requests.get(
        f"http://localhost:8000/jobs/{request_id}/status"
    )
    status = status_response.json()
    
    print(f"  Status: {status['status']}", end="")
    if status.get("progress"):
        print(f" - {status['progress']}")
    else:
        print()
    
    if status["status"] == "completed":
        print(f"✓ Generation complete!")
        print(f"  Download: {status['download_url']}")
        print(f"  Size: {status.get('file_size_mb', 0):.1f} MB")
        print(f"  Documents: {status.get('document_count', 0)}")
        break
    elif status["status"] == "failed":
        print(f"✗ Generation failed: {status.get('error_message')}")
        break
    
    # Wait 30 seconds before next poll
    time.sleep(30)

# Step 3: Download from Google Drive (if completed)
if status["status"] == "completed":
    # User can download from their Google Drive using the shareable link
    print(f"\nDownload your documents at:\n{status['download_url']}")
```

### JavaScript

```javascript
const response = await fetch('http://localhost:8000/generate', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    seed_images: [
      'https://example.com/seed1.jpg',
      'https://example.com/seed2.jpg'
    ],
    prompt_params: {
      language: 'English',
      doc_type: 'invoices',
      num_solutions: 2
    }
  })
});

const result = await response.json();

// Convert base64 PDF to blob
const pdfBlob = await fetch(`data:application/pdf;base64,${result.documents[0].pdf_base64}`)
  .then(res => res.blob());
```

## Configuration

### Prompt Parameters

- **language**: Language for generated documents (default: "English")
- **doc_type**: Type of documents to generate (e.g., "business and administrative", "receipts", "forms")
- **gt_type**: Description of ground truth type to generate
- **gt_format**: Format specification for ground truth JSON
- **num_solutions**: Number of document variations (1-5)

### Stage 3-5 Advanced Features

The API supports advanced document synthesis and dataset packaging:

#### Stage 3: Handwriting & Visual Elements
- **enable_handwriting**: Add handwritten text using diffusion model (default: false)
- **handwriting_ratio**: Percentage of text to convert to handwriting 0-1 (default: 0.5)
- **enable_visual_elements**: Add stamps, barcodes, logos (default: false)
- **visual_element_types**: Types of elements to add: ["stamp", "logo", "figure", "barcode", "photo"] (default: all types)

#### Stage 4: OCR
- **enable_ocr**: Perform OCR on generated document (default: false)
- **ocr_language**: OCR language code (default: "en")

#### Stage 5: Dataset Packaging
- **enable_bbox_normalization**: Normalize bboxes to [0,1] scale (default: false)
- **enable_gt_verification**: Verify ground truth quality (default: false)
- **enable_analysis**: Generate dataset statistics (default: false)
- **enable_debug_visualization**: Create bbox overlay images (default: false)

#### Dataset Export (Msgpack Format)
- **enable_dataset_export**: Export as msgpack dataset format (default: false)
- **dataset_export_format**: Export format - only "msgpack" is supported (default: "msgpack")

**Note**: Only msgpack format is implemented in the current pipeline. COCO and HuggingFace export formats mentioned in some documentation are not yet available.

#### Output Detail Level
- **output_detail**: Controls how much data is returned/saved (default: "minimal")
  - `"minimal"` (default): Final outputs only (PDFs, images, metadata) - 2-5 MB per document
  - `"dataset"`: Includes individual token images for ML training - 10-20 MB per document
    - Individual handwriting token images (`handwriting_tokens/hw0.png`, ...)
    - Individual visual element images (`visual_elements/logo_0.png`, ...)
    - Token mapping JSON with style IDs and positions
  - `"complete"`: All intermediate files and debug info - 20-50 MB per document
    - Everything from `dataset` mode
    - Intermediate PDFs from each processing stage
    - Generation logs
    - ⚠️ **Warning**: Can result in 50+ MB JSON responses for `/generate` endpoint

**Recommendation**: Use `"minimal"` for production, `"dataset"` for ML research, `"complete"` for debugging (only with `/generate/pdf`).

**Example with dataset output detail:**
```python
import requests
import base64
import json

# Generate ML training dataset
response = requests.post(
    "http://localhost:8000/generate",
    json={
        "seed_images": ["https://example.com/seed.jpg"],
        "prompt_params": {
            "language": "English",
            "doc_type": "receipts and invoices",
            "num_solutions": 5,
            
            # Enable handwriting and visual elements
            "enable_handwriting": True,
            "handwriting_ratio": 0.4,
            "enable_visual_elements": True,
            "visual_element_types": ["stamp", "logo", "figure", "barcode", "photo"],  # All types by default
            
            # Enable dataset features
            "enable_ocr": True,
            "enable_bbox_normalization": True,
            "enable_dataset_export": True,
            
            # IMPORTANT: Set output_detail to "dataset" for ML training
            "output_detail": "dataset",
            
            # Use seed for reproducibility
            "seed": 42
        }
    }
)

result = response.json()

# Process each generated document
for doc in result["documents"]:
    doc_id = doc["document_id"]
    print(f"\\nProcessing {doc_id}:")
    
    # 1. Save individual handwriting token images
    if doc.get("handwriting_token_images"):
        print(f"  - Handwriting tokens: {len(doc['handwriting_token_images'])}")
        for hw_id, img_b64 in doc["handwriting_token_images"].items():
            with open(f"dataset/{doc_id}/{hw_id}.png", "wb") as f:
                f.write(base64.b64decode(img_b64))
    
    # 2. Save individual visual element images
    if doc.get("visual_element_images"):
        print(f"  - Visual elements: {len(doc['visual_element_images'])}")
        for ve_id, img_b64 in doc["visual_element_images"].items():
            with open(f"dataset/{doc_id}/{ve_id}.png", "wb") as f:
                f.write(base64.b64decode(img_b64))
    
    # 3. Save token mapping for ML training
    if doc.get("token_mapping"):
        mapping = doc["token_mapping"]
        print(f"  - Mapping: {mapping['handwriting']['total_count']} HW + {mapping['visual_elements']['total_count']} VE")
        with open(f"dataset/{doc_id}/token_mapping.json", "w") as f:
            json.dump(mapping, f, indent=2)
    
    # 4. Save ground truth annotations
    if doc.get("ground_truth"):
        with open(f"dataset/{doc_id}/ground_truth.json", "w") as f:
            json.dump(doc["ground_truth"], f, indent=2)
    
    # 5. Save bounding boxes (normalized coordinates)
    if doc.get("normalized_bboxes_word"):
        with open(f"dataset/{doc_id}/bboxes_normalized.json", "w") as f:
            json.dump(doc["normalized_bboxes_word"], f, indent=2)
    
    # 6. Save final document image
    if doc.get("image_base64"):
        with open(f"dataset/{doc_id}/final_image.png", "wb") as f:
            f.write(base64.b64decode(doc["image_base64"]))
    
    # 7. Save msgpack dataset file
    if doc.get("dataset_export") and doc["dataset_export"].get("msgpack_base64"):
        with open(f"dataset/{doc_id}/dataset.msgpack", "wb") as f:
            f.write(base64.b64decode(doc["dataset_export"]["msgpack_base64"]))

print(f"\\n✅ Generated {len(result['documents'])} ML-ready documents")
```

### PDF Generation Endpoint (Recommended for Large Datasets)

For bulk generation with comprehensive file outputs, use `/generate/pdf`:

```bash
curl -X POST http://localhost:8000/generate/pdf \
  -H "Content-Type: application/json" \
  -d '{
    "seed_images": ["https://example.com/seed1.jpg"],
    "prompt_params": {
      "num_solutions": 3,
      "enable_handwriting": true,
      "enable_ocr": true,
      "enable_bbox_normalization": true,
      "enable_dataset_export": true,
      "output_detail": "dataset"
    }
  }' \
  --output documents.zip
```

#### ZIP File Contents

Based on `output_detail` level:

**Minimal (default):**
- `document_<id>.pdf` - Generated PDF files
- `document_<id>/` - Per-document directories with:
  - `document.html`, `document.css` - Source files
  - `ground_truth.json`, `bboxes.json` - Annotations
  - `final_image.png` - Final rendered image (if Stage 3 enabled)
  - `handwriting_regions.json`, `visual_elements.json` - Stage 3 metadata (if enabled)
  - `ocr_results.json` - OCR word-level data (if OCR enabled)
- `README.md` - Package documentation
- `metadata.json` - Combined metadata

**Dataset (for ML training):**
- All files from "minimal" level, plus:
  - `handwriting_tokens/` - Individual token images (`hw0.png`, `hw1.png`, ...)
  - `visual_elements/` - Individual element images (`logo_0.png`, `stamp_1.png`, ...)
  - `token_mapping.json` - Complete mapping with style IDs and positions
  - `dataset.msgpack` - Msgpack dataset file (if export enabled)
  - `normalized_bboxes_word.json` - Normalized coordinates (if Stage 5 enabled)

**Complete (for debugging):**
- All files from "dataset" level, plus:
  - Intermediate PDFs from each processing stage
  - Generation logs with timing information
  - `debug_visualization.png` - Bbox overlay images

### Supported Models

- `claude-sonnet-4-5-20250929` (default, recommended)
- `claude-3-5-sonnet-20241022`

### Environment Variables

- `ANTHROPIC_API_KEY`: Your Anthropic API key (required if not provided in request)

## API Documentation

Interactive API documentation is available when the server is running:

- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc

## Error Handling

The API returns appropriate HTTP status codes:

- `200 OK`: Successful generation
- `400 Bad Request`: Invalid input (e.g., invalid image URLs)
- `401 Unauthorized`: Missing or invalid API key
- `500 Internal Server Error`: Processing error

Error response format:
```json
{
  "detail": "Error message describing what went wrong"
}
```

## Performance Considerations

- **Concurrent requests**: The API can handle multiple requests concurrently
- **Image size**: Larger seed images take longer to process
- **Number of solutions**: More solutions = longer processing time
- **Model selection**: Sonnet is slower but higher quality than Haiku

## Limitations

- Maximum 10 seed images per request
- Maximum 5 document variations (`num_solutions`)
- Single-page documents only
- Timeout: 60 seconds per PDF render

## Troubleshooting

### Playwright browser not found

```bash
playwright install chromium
```

### API key not working

Make sure your API key is set correctly:
```bash
echo $ANTHROPIC_API_KEY
```

### PDF rendering fails

Ensure Chromium is installed and accessible:
```bash
playwright show-trace
```

## Integration with Frontend

Example React integration:

```jsx
const [loading, setLoading] = useState(false);
const [result, setResult] = useState(null);

const generateDocuments = async () => {
  setLoading(true);
  
  try {
    const response = await fetch('http://localhost:8000/generate', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        seed_images: seedImageUrls,
        prompt_params: {
          language: 'English',
          doc_type: documentType,
          num_solutions: 3
        }
      })
    });
    
    const data = await response.json();
    setResult(data);
  } catch (error) {
    console.error('Generation failed:', error);
  } finally {
    setLoading(false);
  }
};
```

### React Integration (Async API with Progress)

```jsx
import { useState, useEffect } from 'react';

function DocumentGenerator({ userId, seedImages }) {
  const [requestId, setRequestId] = useState(null);
  const [status, setStatus] = useState(null);
  const [progress, setProgress] = useState(0);

  // Submit job
  const handleGenerate = async () => {
    const response = await fetch('http://localhost:8000/generate/async', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        user_id: userId,
        seed_images: seedImages,
        prompt_params: {
          language: 'English',
          doc_type: 'receipts',
          num_solutions: 3,
          enable_handwriting: true,
          output_detail: 'dataset'
        }
      })
    });
    
    const job = await response.json();
    setRequestId(job.request_id);
    setStatus('queued');
  };

  // Poll job status
  useEffect(() => {
    if (!requestId || status === 'completed' || status === 'failed') return;

    const interval = setInterval(async () => {
      const response = await fetch(`http://localhost:8000/jobs/${requestId}/status`);
      const jobStatus = await response.json();
      
      setStatus(jobStatus.status);
      
      // Update progress bar
      const progressMap = {
        'queued': 10,
        'processing': 30,
        'generating': 60,
        'completed': 100,
        'failed': 0
      };
      setProgress(progressMap[jobStatus.status] || 0);
      
      if (jobStatus.status === 'completed') {
        // Open Google Drive download link
        window.open(jobStatus.download_url, '_blank');
      }
    }, 30000); // Poll every 30 seconds

    return () => clearInterval(interval);
  }, [requestId, status]);

  return (
    <div>
      <button onClick={handleGenerate} disabled={status && status !== 'completed'}>
        Generate Documents
      </button>
      
      {status && (
        <div className="progress-container">
          <div className="progress-bar" style={{ width: `${progress}%` }} />
          <p>Status: {status}</p>
          {status === 'completed' && (
            <a href={`http://localhost:8000/jobs/${requestId}/status`}>
              Download Results
            </a>
          )}
        </div>
      )}
    </div>
  );
}
```

## Background Processing Setup

The async endpoints (`/generate/async`) require a background worker system for job processing.

### Prerequisites

1. **Redis** - Job queue storage
2. **Supabase** - Database for job tracking and user data
3. **Google Drive OAuth** - For uploading results to user's Drive

### Installing Redis

**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install redis-server
sudo systemctl start redis
sudo systemctl enable redis
```

**macOS:**
```bash
brew install redis
brew services start redis
```

**Docker:**
```bash
docker run -d -p 6379:6379 --name redis redis:7-alpine
```

**Verify Redis is running:**
```bash
redis-cli ping
# Should return: PONG
```

### Configuring Supabase

1. Create a Supabase project at [supabase.com](https://supabase.com)

2. Create the required tables in your Supabase SQL Editor:

```sql
-- Document generation requests
CREATE TABLE document_requests (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
  user_id INTEGER NOT NULL,
  status TEXT NOT NULL CHECK (status IN ('queued', 'processing', 'generating', 'completed', 'failed')),
  request_metadata JSONB NOT NULL,
  error_message TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Generated documents
CREATE TABLE generated_documents (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
  request_id UUID NOT NULL REFERENCES document_requests(id),
  document_id TEXT NOT NULL,
  file_url TEXT,
  zip_url TEXT,
  file_size_mb DECIMAL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- User integrations (Google Drive OAuth)
CREATE TABLE user_integrations (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
  user_id INTEGER NOT NULL,
  integration_type TEXT NOT NULL CHECK (integration_type IN ('google_drive', 'dropbox')),
  access_token TEXT NOT NULL,
  refresh_token TEXT,
  token_expiry TIMESTAMPTZ,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE(user_id, integration_type)
);

-- Analytics events
CREATE TABLE analytics_events (
  id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
  user_id INTEGER,
  event_type TEXT NOT NULL,
  entity_id UUID,
  event_data JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Indexes for performance
CREATE INDEX idx_document_requests_user_id ON document_requests(user_id);
CREATE INDEX idx_document_requests_status ON document_requests(status);
CREATE INDEX idx_generated_documents_request_id ON generated_documents(request_id);
CREATE INDEX idx_user_integrations_user_id ON user_integrations(user_id);
CREATE INDEX idx_analytics_events_user_id ON analytics_events(user_id);
```

3. Add your Supabase credentials to `.env`:

```bash
# In api/.env
SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_KEY=your-anon-or-service-role-key
```

### Configuring Google Drive OAuth

Users need to connect their Google Drive account for result storage:

1. Create a Google Cloud Project at [console.cloud.google.com](https://console.cloud.google.com)
2. Enable Google Drive API
3. Create OAuth 2.0 credentials (Web application)
4. Add authorized redirect URIs (e.g., `http://localhost:3000/auth/google/callback`)
5. Download credentials JSON

6. Users authenticate via OAuth flow (implement in your frontend):

```python
# Example OAuth flow (implement in your auth system)
from google_auth_oauthlib.flow import Flow

flow = Flow.from_client_config(
    client_config={
        "web": {
            "client_id": "YOUR_CLIENT_ID",
            "client_secret": "YOUR_CLIENT_SECRET",
            "auth_uri": "https://accounts.google.com/o/oauth2/auth",
            "token_uri": "https://oauth2.googleapis.com/token",
            "redirect_uris": ["http://localhost:3000/auth/google/callback"]
        }
    },
    scopes=["https://www.googleapis.com/auth/drive.file"]
)

# User visits auth URL, gets redirected back with code
authorization_url, state = flow.authorization_url(access_type='offline', include_granted_scopes='true')

# Exchange code for tokens
flow.fetch_token(code=authorization_code)
credentials = flow.credentials

# Store in Supabase user_integrations table
supabase.table('user_integrations').insert({
    'user_id': user_id,
    'integration_type': 'google_drive',
    'access_token': credentials.token,
    'refresh_token': credentials.refresh_token,
    'token_expiry': credentials.expiry
}).execute()
```

### Starting the Background Worker

1. Configure environment variables in `api/.env`:

```bash
# Redis Configuration
REDIS_URL=redis://localhost:6379/0
RQ_QUEUE_NAME=docgenie

# Batch Processing
BATCH_POLL_INTERVAL=30  # seconds
BATCH_DATA_DIR=/tmp/docgenie_batches
MESSAGE_DATA_DIR=/tmp/docgenie_messages

# Google Drive
GOOGLE_DRIVE_FOLDER_NAME=DocGenie Documents

# Supabase (already configured above)
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your_key_here

# Claude API
ANTHROPIC_API_KEY=your_api_key_here
```

2. Start the worker:

```bash
cd api/
./start_worker.sh
```

The worker will:
- ✓ Check Redis connection
- ✓ Validate Supabase configuration
- ✓ Verify Claude API key
- ✓ Create temporary directories
- ✓ Start RQ worker listening on `docgenie` queue

**Output:**
```
🚀 Starting DocGenie RQ Worker...
✓ Loading .env file...
✓ Redis connected
✓ Supabase configured
✓ Claude API key configured
✓ Temporary directories created

============================================
Worker Configuration:
  Queue: docgenie
  Redis: redis://localhost:6379/0
  Batch Data: /tmp/docgenie_batches
  Message Data: /tmp/docgenie_messages
============================================

✅ Starting RQ worker (press Ctrl+C to stop)...

12:00:00 RQ worker 'worker-abc123' started on docgenie queue
```

### Running Multiple Workers (Production)

For production systems with high load, run multiple workers:

```bash
# Terminal 1
./start_worker.sh

# Terminal 2
./start_worker.sh

# Terminal 3
./start_worker.sh
```

Each worker processes jobs independently from the same queue.

**For detailed scaling instructions**, see [SCALING.md](SCALING.md).

### Monitoring Workers

```bash
# View worker status
rq info --url redis://localhost:6379/0

# View queue status
rq info --queue docgenie --url redis://localhost:6379/0

# View failed jobs
rq info --queue failed --url redis://localhost:6379/0
```

### Architecture Overview

```
┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
│   FastAPI   │───────▶│    Redis    │◀───────│  RQ Workers     │
│   Server    │        │   Queue     │        │  (1-5 instances)│
│             │        │             │        │                 │
│ /generate/  │        │ Job Queue:  │        │ • Downloads     │
│  async      │        │ - queued    │        │ • Claude Batch  │
│             │        │ - pending   │        │ • PDF render    │
│ /jobs/      │        │ - active    │        │ • Handwriting   │
│  {id}/      │        │             │        │ • OCR           │
│  status     │        │             │        │ • ZIP creation  │
└──────┬──────┘        └─────────────┘        └────────┬────────┘
       │                                               │
       │                                               │
       ▼                                               ▼
┌──────────────────────────────────────────────────────────────┐
│                          Supabase                             │
│  • document_requests (job tracking)                           │
│  • generated_documents (results metadata)                     │
│  • user_integrations (Google Drive OAuth)                     │
│  • analytics_events (usage tracking)                          │
└───────────────────────────────────────────────────────────────┘
       │
       │ Upload Results
       ▼
┌──────────────────────────────────────────────────────────────┐
│                      Google Drive                             │
│  • User's "DocGenie Documents" folder                         │
│  • ZIP files with generated documents                         │
│  • Shareable links returned to API                            │
└──────────────────────────────────────────────────────────────┘
```

### Cost Comparison: Direct vs Batched API

| API Type | Cost (Input) | Cost (Output) | Latency | Use Case |
|----------|-------------|---------------|---------|----------|
| Direct   | $5.00/1M tokens | $15.00/1M tokens | 30-120s | Real-time, interactive |
| **Batched** | **$2.50/1M tokens** | **$7.50/1M tokens** | 5-30 min | **Background jobs (recommended)** |

**Example Cost Calculation:**
- Generate 100 documents per day
- Each request: 5,000 input tokens, 10,000 output tokens

**Direct API Cost:**
- Input: (100 × 5,000 / 1M) × $5.00 = $2.50/day
- Output: (100 × 10,000 / 1M) × $15.00 = $15.00/day
- **Total: $17.50/day = $525/month**

**Batched API Cost:**
- Input: (100 × 5,000 / 1M) × $2.50 = $1.25/day
- Output: (100 × 10,000 / 1M) × $7.50 = $7.50/day
- **Total: $8.75/day = $262.50/month**

**💰 Savings: $262.50/month (50% reduction)**

## Scaling Workers

The API uses Redis Queue (RQ) workers for background job processing. Scale workers based on load:

| User Load | Workers | Redis RAM | Notes |
|-----------|---------|-----------|-------|
| < 10 req/hr | 1 | 256 MB | Development |
| 10–50 req/hr | 2–3 | 512 MB | Small production |
| 50–200 req/hr | 3–5 | 1 GB | Medium production |
| > 200 req/hr | 5+ | 2+ GB | Large production |

### Starting Workers

```bash
# Single worker (development)
./start_worker.sh

# Multiple workers (production) — run in separate terminals
./start_worker.sh   # Terminal 1
./start_worker.sh   # Terminal 2

# Docker Compose — scale to 3 workers
docker-compose up --scale worker=3

# Monitor
rq info --url redis://localhost:6379/0
rq info --queue docgenie --url redis://localhost:6379/0
```

### Railway Multi-Worker (Separate Service)
1. Railway dashboard → New Service → GitHub Repo (same repo)
2. Name: `docgenie-worker`
3. Custom Start Command: `rq worker --url $REDIS_URL`
4. Add the same environment variables as the API service

> For most use cases the **combined** mode (API + worker in one service, see `railway.json`) is sufficient and cheaper.

## Contributing

This API is a simplified interface to the DocGenie pipeline. For the full pipeline with all features, see the main DocGenie documentation.

## License

Same as DocGenie main project.