Spaces:

Text-to-Document-Generation
/

Docgenie-API

Running

File size: 39,176 Bytes

29497b5

# Complete API Flow Documentation

## Overview
The DocGenie API provides three endpoints for synthetic document generation, implementing a 19-stage pipeline that transforms seed images and prompts into complete datasets with OCR, ground truth, and optional handwriting/visual elements.

**Base URL**: `http://localhost:8000` (development) or Railway deployment  
**Documentation**: `/docs` (FastAPI auto-generated Swagger UI)

---

## API Endpoints

### 1. `/generate` - Legacy JSON Response (POST)
**Purpose**: Generate documents and return complete JSON metadata  
**Response**: JSON with HTML, PDF (base64), bounding boxes, optional handwriting/visual elements  
**Use Case**: Testing, development, full metadata inspection  
**Pipeline Stages**: 1-19 (configurable via parameters)

### 2. `/generate/pdf` - Sync PDF+Dataset ZIP (POST)
**Purpose**: Generate documents and return ZIP file with all artifacts  
**Response**: ZIP file containing:
- `*.pdf` - Generated document PDFs
- `*_final.pdf` - PDFs with handwriting/visual elements (if enabled)
- `*.msgpack` - Dataset format (if export enabled)
- `metadata.json` - Complete generation metadata
- `handwriting/` - Individual handwriting images
- `visual_elements/` - Individual visual element images

**Use Case**: Production dataset generation, batch processing  
**Pipeline Stages**: 1-19 (all features available)

### 3. `/generate/async` - Async Batch Processing (POST)
**Purpose**: Queue large batch jobs via background worker (Redis Queue)  
**Response**: Task ID for status polling  
**Status Check**: `GET /generate/async/status/{task_id}`  
**Result Download**: `GET /generate/async/result/{task_id}` (returns ZIP)  
**Use Case**: Large-scale dataset generation (100+ documents)  
**Pipeline Stages**: 1-19 (via worker.py)

---

## Request Parameters

```python
class GenerateDocumentRequest:
    seed_images: List[HttpUrl]              # 1-8 seed images from web URLs
    prompt_params: PromptParameters          # Generation configuration
    
class PromptParameters:
    # Core Parameters
    language: str = "english"                # Document language
    doc_type: str = "invoice"                # Document type (invoice, receipt, form, etc.)
    gt_type: str = "qa"                      # Ground truth format (qa, kie)
    gt_format: str = "json"                  # GT encoding (json, annotation)
    num_solutions: int = 1                   # Documents per seed set
    
    # Feature Toggles (Stages 07-19)
    enable_handwriting: bool = False         # Stage 07-09, 12
    handwriting_ratio: float = 0.2           # Probabilistic filter (0.0-1.0)
    enable_visual_elements: bool = False     # Stage 08, 10, 13
    visual_element_types: List[str] = []     # Filter types: logo, photo, figure, barcode, etc.
    enable_ocr: bool = True                  # Stage 15
    enable_bbox_normalization: bool = True   # Stage 16
    enable_gt_verification: bool = False     # Stage 17
    enable_analysis: bool = False            # Stage 18
    enable_debug_visualization: bool = False # Stage 19
    enable_dataset_export: bool = False      # Stage 19 (msgpack format)
    dataset_export_format: str = "msgpack"   # Currently only msgpack supported
    
    # Reproducibility
    seed: Optional[int] = None               # Random seed (null = random, int = reproducible)
```

---

## Pipeline Architecture: The 19 Stages

The API implements all 19 stages of the original batch pipeline in `docgenie/generation/`. Each stage is mapped to corresponding functions in `api/utils.py`.

### **Phase 1: Core Pipeline (Stages 01-06)**
Generate base documents from seed images and LLM prompts.

#### **Stage 01: Seed Selection & Download**
- **Original**: `pipeline_01_select_seeds.py`
- **API**: `download_seed_images()` in `api/utils.py:117-161`
- **Process**:
  1. Accept user-provided seed image URLs (1-8 images)
  2. Download with retry logic (3 attempts, exponential backoff)
  3. Handle transient HTTP errors (502, 503, 504, 429)
  4. Convert to base64 for LLM input
- **Error Handling**: Retry with 2s, 4s, 8s delays; raise HTTPException on failure

#### **Stage 02: Prompt LLM**
- **Original**: `pipeline_02_prompt_llm.py`
- **API**: `call_claude_api_direct()` in `api/utils.py:550-600`
- **Process**:
  1. Load prompt template: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt`
  2. Build prompt with parameters: language, doc_type, gt_type, num_solutions
  3. Call Claude API (Anthropic Messages API v1)
     - Model: `claude-3-5-sonnet-20241022` (configurable)
     - Max tokens: 16,000
     - Temperature: 1.0
     - Vision: Send base64-encoded seed images
  4. Receive HTML documents with embedded ground truth
- **LLM Output Format**: Multiple `<!DOCTYPE html>...</html>` blocks with:
  - CSS styling with page dimensions
  - HTML elements with semantic classes
  - Handwriting markers: `class="handwritten author1"` (author1, author2, etc.)
  - Visual element placeholders: `data-placeholder="logo"`, `data-content="company-logo"`
  - Ground truth: `<script id="GT">{...json...}</script>`

#### **Stage 03: Process Response & Extract HTML**
- **Original**: `pipeline_03_process_response.py`
- **API**: `extract_html_documents_from_response()` in `api/utils.py:605-635`
- **Process**:
  1. Parse LLM response for `<!DOCTYPE html>...</html>` blocks (regex)
  2. Prettify HTML with BeautifulSoup
  3. Validate HTML structure
  4. Extract ground truth JSON from `<script id="GT">` tag
  5. Remove GT script tag, clean HTML for rendering
- **Validation**: Check for required elements, CSS, proper structure

#### **Stage 04: Render PDF & Extract Geometries**
- **Original**: `pipeline_04_render_pdf_and_extract_geos.py`
- **API**: `render_html_to_pdf()` in `api/utils.py:650-740`
- **Process**:
  1. Launch Playwright browser (Chromium)
  2. Set page dimensions from CSS `@page` rule
  3. Render HTML to PDF via `page.pdf()`
  4. Extract element geometries:
     - Handwriting elements: `.handwritten` class → `{rect, text, classes, selectorTypes: ["handwriting"]}`
     - Visual elements: `[data-placeholder]` attribute → `{rect, dataPlaceholder, dataContent, selectorTypes: ["visual_element"]}`
  5. Save PDF and geometries JSON
- **Output**: 
  - PDF at 72 DPI (PyMuPDF standard)
  - Geometries at 96 DPI (browser rendering)
  - Dimensions in mm

#### **Stage 05: Extract Bounding Boxes**
- **Original**: `pipeline_05_extract_bboxes_from_pdf.py`
- **API**: `extract_bboxes_from_rendered_pdf()` in `api/utils.py:750-825`
- **Process**:
  1. Open PDF with PyMuPDF (fitz)
  2. Extract text at word level: `page.get_text("words")`
  3. Structure bboxes as:
     ```python
     {
         "text": "word",
         "x0": float,  # left
         "y0": float,  # top
         "x1": float,  # right (x2)
         "y1": float,  # bottom (y2)
         "block_no": int,
         "line_no": int,
         "word_no": int
     }
     ```
  4. Filter whitespace-only text
  5. Convert to OCRBox objects for processing
- **Coordinate System**: PDF points (72 DPI), origin top-left

#### **Stage 06: Validation**
- **Original**: `pipeline_06_validation.py` (implicit)
- **API**: `validate_html_structure()`, `validate_pdf()`, `validate_bboxes()` in `api/utils.py:830-890`
- **Checks**:
  - HTML: Required DOCTYPE, head, body, CSS
  - PDF: File readable, page count = 1, has text
  - Bboxes: Minimum count (configurable), valid coordinates

---

### **Phase 2: Feature Synthesis (Stages 07-13)**
Add handwriting and visual elements to base documents.

#### **Stage 07: Extract Handwriting Definitions**
- **Original**: `pipeline_07_extract_handwriting.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1150-1235`
- **Process**:
  1. Filter geometries: `"handwriting" in geo['selectorTypes']`
  2. Parse classes: Extract `author1`, `author2`, etc. from `class="handwritten author1"`
  3. **Probabilistic filtering** (handwriting_ratio):
     ```python
     if random.random() > handwriting_ratio:
         continue  # Skip this element
     ```
     - `ratio=0.0`: No handwriting (0%)
     - `ratio=0.5`: ~50% of marked elements
     - `ratio=1.0`: All marked elements (100%)
  4. Match geometries to word bboxes:
     - Convert browser coords (96 DPI) to PDF coords (72 DPI): `scale = 72/96 = 0.75`
     - Find consecutive word bboxes matching geometry text
     - Check bboxes are within geometry rect (threshold: 0.7)
     - Track taken bbox indices to avoid duplicates
  5. Build handwriting region definitions:
     ```python
     {
         "id": "hw0",
         "text": "Patient Name",
         "author_id": "author1",
         "is_signature": False,
         "rect": {x, y, width, height},  # in points
         "bboxes": ["0_0_0 Patient 10.0 20.0 50.0 35.0", ...]
     }
     ```
- **Reproducibility**: Use `seed + i` for each region to maintain order consistency

#### **Stage 08: Extract Visual Element Definitions**
- **Original**: `pipeline_08_extract_visual_element_definitions.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1237-1275`
- **Process**:
  1. Filter geometries: `"visual_element" in geo['selectorTypes']`
  2. Parse attributes:
     - `data-placeholder`: Element type (logo, photo, figure, chart, barcode, etc.)
     - `data-content`: Semantic description (e.g., "company-logo", "product-photo")
  3. Normalize types using synonyms:
     - "chart" → "figure"
     - "image" → "photo"
  4. Filter by `visual_element_types` parameter (if specified)
  5. Convert coordinates: pixels (96 DPI) → mm
  6. Extract rotation from CSS `transform: rotate(Xdeg)`
  7. Build visual element definitions:
     ```python
     {
         "id": "ve0",
         "type": "logo",  # normalized
         "content": "company-logo",
         "rect": {x, y, width, height},  # in mm
         "rotation": 0  # degrees
     }
     ```

#### **Stage 09: Create Handwriting Images**
- **Original**: `pipeline_09_create_handwriting_images.py`
- **API**: `call_handwriting_service_batch()` in `api/utils.py:785-920`
- **Handwriting Service**: RunPod serverless endpoint hosting WordStylist diffusion model
- **Service Implementation**: `handwriting_service/handler.py`, `handwriting_service/inference.py`

**🔄 Handwriting Service Integration Details:**

##### **Service Architecture**
- **Platform**: RunPod Serverless (GPU: NVIDIA A4000, Cost: ~$0.00025/s active)
- **Model**: WordStylist (Diffusion-based handwriting synthesis)
  - Architecture: UNet with conditional style embeddings
  - Input: Text (A-Z, a-z only, no spaces), Writer style ID (0-656)
  - Output: PNG image with transparent background
  - Inference time: ~18s per text on A4000
  - Weights: `handwriting_service/WordStylist/models/`
- **Endpoints**:
  - `/run` (async): Queue job, return ID, poll `/status/{id}` (10MB limit)
  - `/runsync` (sync): Wait for completion, return result (20MB limit, used by API)

##### **Batch Processing (Cost Optimization)**
The API uses TRUE batch processing to minimize RunPod activation overhead:

```python
# ✅ NEW: Batch all texts in ONE request
runpod_request = {
    "input": {
        "texts": [
            {"text": "Hello", "author_id": 42, "hw_id": "hw0_b0_l0_w0"},
            {"text": "World", "author_id": 42, "hw_id": "hw0_b0_l0_w1"},
            # ... 10-100 texts
        ],
        "apply_blur": True
    }
}
# Result: 1 worker activation × (N × 18s) = ~40-60% cost savings
```

**Cost Comparison for 10 texts:**
- ❌ OLD (parallel): 10 workers × 18s = 180 worker-seconds + 10× activation fee
- ✅ NEW (batched): 1 worker × 190s = 190 worker-seconds + 1× activation fee

##### **API Processing Flow**
1. **Group by region and line**: Split handwriting regions into word-level requests
   ```python
   # Text: "Patient Name" → 2 word-level generations
   texts_to_generate = [
       {"text": "Patient", "author_id": 42, "hw_id": "hw0_b0_l0_w0"},
       {"text": "Name", "author_id": 42, "hw_id": "hw0_b0_l0_w1"}
   ]
   ```

2. **Map author IDs to numeric styles**:
   ```python
   # "author1" → WRITER_STYLES[1] = 42 (deterministic)
   # "author2" → WRITER_STYLES[2] = 137
   # 657 total writer styles available
   ```

3. **Sanitize text** (WordStylist constraint):
   ```python
   # Only A-Z, a-z allowed (no spaces, numbers, punctuation)
   "Hello123!" → "Hello"
   "first-name" → "firstname"
   ```

4. **Send batch request** to RunPod `/runsync` endpoint:
   ```python
   POST https://api.runpod.ai/v2/{endpoint_id}/runsync
   Authorization: Bearer {RUNPOD_API_KEY}
   Content-Type: application/json
   
   {
       "input": {
           "texts": [...],
           "apply_blur": True  # Gaussian blur for realism
       }
   }
   ```

5. **Handle async responses**:
   - If `status: "IN_PROGRESS"`: Poll `/status/{job_id}` every 5-10s (max 30 polls)
   - If `status: "COMPLETED"`: Extract `output.images[]`
   - If `status: "FAILED"`: Raise exception (stops entire generation)

6. **Response format**:
   ```python
   {
       "status": "COMPLETED",
       "output": {
           "images": [
               {
                   "image_base64": "iVBORw0KGgoAAAANSU...",
                   "width": 200,
                   "height": 64,
                   "text": "Patient",
                   "author_id": 42,
                   "hw_id": "hw0_b0_l0_w0"
               },
               ...
           ],
           "total_generated": 2
       }
   }
   ```

7. **Store generated images**: Map `hw_id → image_base64` for insertion

##### **Error Handling**
- **Retry logic**: 3 attempts with exponential backoff (matching seed download)
- **Timeouts**: Dynamic based on batch size: `20s × num_texts + 30s buffer`
- **Failure behavior**: **RAISE EXCEPTION** (since session fix)
  - ❌ OLD: Silent continue → Documents without handwriting
  - ✅ NEW: Raise exception → Generation fails when user requested handwriting

##### **Service Code Structure**
**`handwriting_service/handler.py`** (RunPod handler):
```python
# Initialize model ONCE at module level (not per request)
generator = HandwritingGenerator(
    model_dir="WordStylist",
    checkpoint_path="WordStylist/models",
    device="cuda"
)

def handler(job):
    """RunPod entry point - supports both /run and /runsync"""
    texts = job["input"]["texts"]  # Batch input
    results = generator.generate_batch(
        texts=[t["text"] for t in texts],
        author_ids=[t["author_id"] for t in texts],
        num_inference_steps=50,
        temperature=1.0,
        apply_blur=True
    )
    return {"images": results, "total_generated": len(results)}
```

**`handwriting_service/inference.py`** (WordStylist wrapper):
```python
class HandwritingGenerator:
    def generate_batch(self, texts, author_ids, ...):
        results = []
        for text, author_id in zip(texts, author_ids):
            # Load model checkpoint
            unet = Unet(...)
            unet.load_state_dict(checkpoint)
            
            # Prepare style condition
            style_id_tensor = torch.tensor([author_id])
            
            # Diffusion reverse process (50 steps)
            img = self.sample(unet, style_id_tensor, text_length=len(text))
            
            # Post-process: crop, resize, apply blur
            img_pil = postprocess_image(img)
            if apply_blur:
                img_pil = img_pil.filter(ImageFilter.GaussianBlur(1.2))
            
            # Encode to base64
            img_base64 = encode_pil_to_base64(img_pil)
            results.append({
                "image_base64": img_base64,
                "width": img_pil.width,
                "height": img_pil.height
            })
        
        return results
```

#### **Stage 10: Create Visual Element Images**
- **Original**: `pipeline_10_create_visual_elements.py`
- **API**: `generate_visual_element_images()` in `api/utils.py:925-1020`
- **Process**:
  1. Load prefab images from `data/visual_element_prefabs/{type}/`:
     - `logo/`: Company logos (50+ SVGs)
     - `photo/`: Stock photos (100+ JPGs)
     - `figure/`: Charts, graphs (30+ PNGs)
     - `barcode/`: Generated barcodes
     - `qr_code/`, `stamp/`, `signature/`, `checkbox/`, etc.
  2. **Random selection** (seed-based if provided):
     ```python
     if seed is not None:
         random.seed(seed)
     prefab_path = random.choice(list(prefab_dir.glob("*")))
     ```
  3. **Special handling**:
     - **Barcode**: Generate on-the-fly using `python-barcode` library
       ```python
       # Generate random EAN-13 barcode (12 digits + checksum)
       barcode_num = random.randint(100000000000, 999999999999)
       barcode = EAN13(str(barcode_num), writer=ImageWriter())
       ```
     - **QR Code**: Generate using `qrcode` library
     - **Checkbox**: Render checked/unchecked SVG
  4. Load and convert to base64:
     ```python
     with open(prefab_path, 'rb') as f:
         img_bytes = f.read()
         img_base64 = base64.b64encode(img_bytes).decode('utf-8')
     ```
  5. Return mapping: `ve_id → image_base64`

#### **Stage 11: Make Text Transparent (Implicit)**
- **Original**: `pipeline_11_make_text_transparent.py`
- **API**: Implemented as "whiteout" in `process_stage3_complete()` at `api/utils.py:1415-1427`
- **Process**:
  ```python
  # Draw white rectangles over original text to hide it
  for hw_region in handwriting_regions:
      for bbox_str in hw_region['bboxes']:
          bbox = parse_bbox(bbox_str)
          rect = fitz.Rect(bbox.x0, bbox.y0, bbox.x2, bbox.y2)
          page.draw_rect(rect, color=(1,1,1), fill=(1,1,1))  # White fill
  ```
- **Why not transparent?**: PyMuPDF doesn't support making existing text transparent, so we use white rectangles instead (same visual result)

#### **Stage 12: Insert Handwriting Images**
- **Original**: `pipeline_12_insert_handwriting_images.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1429-1520`
- **Process**:
  1. **Position calculation**:
     ```python
     # Get word bbox from PDF extraction
     bbox_w = bbox.x2 - bbox.x0  # Width in points
     bbox_h = bbox.y2 - bbox.y0  # Height in points
     
     # Resize handwriting image with aspect ratio
     scale = min(bbox_w / img_width, bbox_h / img_height)
     new_w = int(img_width * scale * SCALE_UP_FACTOR)  # 3x upscale
     new_h = int(img_height * scale * SCALE_UP_FACTOR)
     
     # Add random offsets for natural variation
     offset_x = random.randint(-MAX_OFFSET_LEFT, MAX_OFFSET_RIGHT) + FIXED_OFFSET
     offset_y = random.randint(-MAX_OFFSET_UP, MAX_OFFSET_DOWN)
     
     # Position at bbox coordinates
     x0 = bbox.x0 + offset_x
     y0 = bbox.y0 + offset_y - y_padding
     ```
  
  2. **Insert into PDF**:
     ```python
     img_resized = img.resize((new_w, new_h), Image.LANCZOS).convert("RGBA")
     img_bytes = pil_to_bytes(img_resized)
     rect = fitz.Rect(x0, y0, x0 + bbox_w, y0 + bbox_h)
     page.insert_image(rect, stream=img_bytes)
     ```
  
  3. Save intermediate PDF: `{doc_id}_with_handwriting.pdf`

#### **Stage 13: Insert Visual Elements**
- **Original**: `pipeline_13_insert_visual_elements.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1523-1625`
- **Process**:
  1. Convert mm → points: `mm_to_pt = 72 / 25.4`
  2. Resize with aspect ratio preservation (same as handwriting)
  3. Center image on white background (maintains bbox size)
  4. Insert into PDF at geometry coordinates
  5. Save final PDF: `{doc_id}_final.pdf` (includes both handwriting + visual elements)

---

### **Phase 3: Image Finalization & OCR (Stages 14-15)**
Convert final PDF to high-resolution image and extract OCR data.

#### **Stage 14: Render Image**
- **Original**: `pipeline_14_render_image.py`
- **API**: `process_stage4_ocr()` in `api/utils.py:1899-1940`
- **Process**:
  ```python
  # Render PDF page to high-res PNG
  page = fitz.open(pdf_path)[0]
  pix = page.get_pixmap(matrix=fitz.Matrix(3, 3))  # 3x scale = ~220 DPI
  img_bytes = pix.tobytes("png")
  img_base64 = base64.b64encode(img_bytes).decode('utf-8')
  ```
- **Output**: Base64-encoded PNG at 220 DPI (configurable via scale factor)

#### **Stage 15: Perform OCR**
- **Original**: `pipeline_15_perform_ocr.py`
- **API**: `run_paddle_ocr()` in `api/utils.py:1950-2080`
- **OCR Engine**: PaddleOCR v4 (multilingual)
  - Models: `PP-OCRv4` detection + recognition
  - Languages: Supports 80+ languages
  - Accuracy: State-of-the-art open-source OCR
- **Process**:
  1. Render PDF to image via `pdf2image` at specified DPI (default: 300)
  2. Initialize PaddleOCR with language parameter
  3. Run detection + recognition:
     ```python
     ocr = PaddleOCR(lang=language, use_gpu=True)
     results = ocr.ocr(img_array, cls=True)
     ```
  4. Parse results into word-level bboxes:
     ```python
     {
         "text": "word",
         "bbox": {
             "x0": float,
             "y0": float,
             "x1": float,  # right
             "y1": float   # bottom
         },
         "confidence": 0.95
     }
     ```
- **Output**: Dictionary with `words` list, image dimensions, OCR engine info

---

### **Phase 4: Dataset Packaging (Stages 16-19)**
Normalize, verify, analyze, and export final dataset.

#### **Stage 16: Normalize Bboxes**
- **Original**: `pipeline_16_normalize_bboxes.py`
- **API**: `normalize_bboxes()` in `api/utils.py:2100-2180`
- **Process**:
  1. Convert absolute pixel coordinates → normalized [0, 1] range:
     ```python
     norm_bbox = [
         bbox['x0'] / img_width,
         bbox['y0'] / img_height,
         bbox['x1'] / img_width,
         bbox['y1'] / img_height
     ]
     ```
  2. Clip to [0, 1]: `[max(0, min(1, x)) for x in norm_bbox]`
  3. Create word-level and segment-level bboxes
- **Output**: List of `{text, bbox: [x0, y0, x1, y1]}` where bbox is normalized

#### **Stage 17: Ground Truth Verification**
- **Original**: `pipeline_17_gt_preparation_verification.py`
- **API**: `verify_ground_truth()` in `api/utils.py:2185-2250`
- **Checks**:
  - GT structure: Valid JSON, required fields
  - Text matching: GT text exists in OCR output
  - Bbox coverage: GT answers have corresponding bboxes
- **Output**: Verification report with pass/fail status

#### **Stage 18: Analyze**
- **Original**: `pipeline_18_analyze.py`
- **API**: `analyze_document()` in `api/utils.py:2255-2320`
- **Metrics**:
  - Word count, character count
  - Average word length
  - Handwriting regions count, coverage %
  - Visual elements count by type
  - OCR confidence statistics (mean, min, max)
- **Output**: Analysis dictionary with computed metrics

#### **Stage 19: Create Debug Data & Export**
- **Original**: `pipeline_19_create_debug_data.py`
- **API**: `export_to_msgpack()` in `api/utils.py:2350-2520`
- **Debug Visualization**:
  - Draw bboxes on image with different colors:
    - Green: Word bboxes
    - Red: Handwriting regions
    - Blue: Visual elements
    - Yellow: Ground truth target regions
  - Save annotated image
- **Dataset Export (msgpack)**:
  ```python
  dataset_entry = {
      "image": img_bytes,  # PNG bytes
      "words": ["hello", "world"],
      "word_bboxes": [[0.1, 0.2, 0.15, 0.25], ...],  # Normalized
      "segment_bboxes": [...],
      "ground_truth": {"question": "answer"},
      "metadata": {
          "document_id": "...",
          "has_handwriting": True,
          "num_visual_elements": 3
      }
  }
  msgpack.dump(dataset_entry, f)
  ```
- **Output**: `.msgpack` file compatible with PyTorch DataLoader

---

## Pipeline Verification: API vs Original Implementation

### ✅ **Stage-by-Stage Mapping**

| Stage | Original File | API Function | Status |
|-------|--------------|--------------|--------|
| 01 | `pipeline_01_select_seeds.py` | `download_seed_images()` | ✅ Mapped (with retry logic) |
| 02 | `pipeline_02_prompt_llm.py` | `call_claude_api_direct()` | ✅ Mapped (uses Messages API) |
| 03 | `pipeline_03_process_response.py` | `extract_html_documents_from_response()` | ✅ Mapped |
| 04 | `pipeline_04_render_pdf_and_extract_geos.py` | `render_html_to_pdf()` | ✅ Mapped (Playwright) |
| 05 | `pipeline_05_extract_bboxes_from_pdf.py` | `extract_bboxes_from_rendered_pdf()` | ✅ Mapped |
| 06 | `pipeline_06_validation.py` | `validate_html_structure()`, `validate_pdf()` | ✅ Mapped |
| 07 | `pipeline_07_extract_handwriting.py` | `process_stage3_complete()` section | ✅ Mapped (with ratio filter) |
| 08 | `pipeline_08_extract_visual_element_definitions.py` | `process_stage3_complete()` section | ✅ Mapped |
| 09 | `pipeline_09_create_handwriting_images.py` | `call_handwriting_service_batch()` | ✅ Mapped (RunPod integration) |
| 10 | `pipeline_10_create_visual_elements.py` | `generate_visual_element_images()` | ✅ Mapped |
| 11 | `pipeline_11_make_text_transparent.py` | `process_stage3_complete()` (whiteout) | ✅ Mapped (white rectangles) |
| 12 | `pipeline_12_insert_handwriting_images.py` | `process_stage3_complete()` section | ✅ Mapped |
| 13 | `pipeline_13_insert_visual_elements.py` | `process_stage3_complete()` section | ✅ Mapped |
| 14 | `pipeline_14_render_image.py` | `process_stage4_ocr()` | ✅ Mapped |
| 15 | `pipeline_15_perform_ocr.py` | `run_paddle_ocr()` | ✅ Mapped |
| 16 | `pipeline_16_normalize_bboxes.py` | `normalize_bboxes()` | ✅ Mapped |
| 17 | `pipeline_17_gt_preparation_verification.py` | `verify_ground_truth()` | ✅ Mapped |
| 18 | `pipeline_18_analyze.py` | `analyze_document()` | ✅ Mapped |
| 19 | `pipeline_19_create_debug_data.py` | `export_to_msgpack()` | ✅ Mapped |

### 📊 **Key Differences: API vs Batch Pipeline**

#### **Processing Model**
- **Original**: Batch processing with file-based state management
  - Input: CSV of seed selections, prompt parameters in JSON
  - Output: Folder structure with intermediate files
  - State: JSON logs per document + message
  - Resumability: Can restart from any stage

- **API**: Request/response with in-memory processing
  - Input: JSON request with seed URLs
  - Output: JSON response or ZIP file
  - State: Ephemeral (temporary directories)
  - Resumability: None (single-shot generation)

#### **Handwriting Generation**
- **Original**: Local GPU with WordStylist model loaded in-process
  - Location: `docgenie/generation/handwriting_diffusion/`
  - Execution: `generate_handwriting_diffusion_raw.py`
  - Cost: Free (local GPU)

- **API**: Remote RunPod serverless endpoint
  - Location: `handwriting_service/` (deployed separately)
  - Execution: HTTP POST to RunPod API
  - Cost: ~$0.00025/s GPU time (pay-per-use)
  - Benefit: No local GPU required, scales automatically

#### **Seed Selection**
- **Original**: Pre-crawled dataset with systematic selection
  - Seeds stored in: `data/datasets/base_v2/`
  - Selection: Clustering algorithm → balanced subset
  - Tracking: CSV manifest with seed IDs

- **API**: User-provided URLs
  - Seeds: Any publicly accessible image URL
  - Selection: User chooses 1-8 images per request
  - Tracking: URLs stored in request metadata

#### **Prompt Templates**
- **Original**: Multiple template versions in folders
  - Path: `data/prompt_templates/{version}/seed-based-json.txt`
  - Versioning: ClaudeRefined1 → ClaudeRefined12
  - Selection: Configurable per dataset

- **API**: Fixed template (latest version)
  - Path: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt`
  - Hardcoded in: `api/main.py:171`
  - **Future improvement**: Make template selectable via API parameter

---

## Complete Request Flow Example

### Example Request (Sync Endpoint)
```bash
POST /generate/pdf HTTP/1.1
Content-Type: application/json

{
  "seed_images": [
    "https://example.com/seed1.jpg",
    "https://example.com/seed2.jpg"
  ],
  "prompt_params": {
    "language": "english",
    "doc_type": "medical_form",
    "gt_type": "kie",
    "gt_format": "json",
    "num_solutions": 2,
    "enable_handwriting": true,
    "handwriting_ratio": 0.3,
    "enable_visual_elements": true,
    "visual_element_types": ["logo", "signature"],
    "enable_ocr": true,
    "enable_dataset_export": true,
    "seed": 42
  }
}
```

### Processing Flow (Stages Executed)

**Phase 1: Core Document Generation (30-60s)**
1. ✅ Download 2 seed images with retry → `[img1_b64, img2_b64]`
2. ✅ Load prompt template → Build prompt for medical_form + KIE
3. ✅ Call Claude API → LLM generates 2 HTML documents (~25s)
4. ✅ Extract HTML + ground truth → 2 clean HTML files with GT JSON
5. ✅ Render each HTML to PDF via Playwright → 2 PDFs + geometries
6. ✅ Extract word bboxes from PDFs → ~200-500 words per document

**Phase 2: Feature Synthesis (120-180s if handwriting enabled)**
7. ✅ Parse geometries for handwriting markers
   - Found: 12 elements with `class="handwritten"`
   - Filtered by ratio: 12 × 0.3 = ~4 elements selected (probabilistic)
   - Matched to word bboxes: 4 regions with 15 total words
8. ✅ Parse geometries for visual elements
   - Found: 3 elements (`data-placeholder="logo"`, `"signature"`, `"logo"`)
   - Filtered by types: Keep logo + signature, remove others
   - Result: 2 visual element definitions
9. ✅ Generate handwriting images via RunPod
   - **Batch request**: 15 words in ONE API call
   - Map author IDs: `author1 → style 42`, `author2 → style 137`
   - RunPod processing: 1 worker × (15 × 18s) = ~270s
   - Result: 15 PNG images (base64-encoded)
10. ✅ Generate visual element images
    - Logo: Random selection from `data/visual_element_prefabs/logo/` (seed=42)
    - Signature: Generate on-the-fly using signature prefab
    - Result: 2 PNG images
11. ✅ Whiteout original text: Draw white rectangles over 15 word positions
12. ✅ Insert handwriting: Place 15 generated images at word bboxes with offsets
    - Save: `doc1_with_handwriting.pdf`, `doc2_with_handwriting.pdf`
13. ✅ Insert visual elements: Place logo + signature at geometry coords
    - Save: `doc1_final.pdf`, `doc2_final.pdf`

**Phase 3: Image + OCR (5-10s)**
14. ✅ Render each final PDF to 220 DPI image → 2 PNG files (base64)
15. ✅ Run PaddleOCR on each image
    - Doc1: Detected 187 words, avg confidence 0.91
    - Doc2: Detected 203 words, avg confidence 0.94

**Phase 4: Dataset Packaging (2-5s)**
16. ✅ Normalize OCR bboxes: Convert pixels → [0,1] range
17. ✅ Verify ground truth: Check GT fields match OCR output (enabled=false, skipped)
18. ✅ Analyze documents: Compute metrics (enabled=false, skipped)
19. ✅ Export to msgpack:
    - Doc1: Pack image + words + normalized bboxes + GT → `doc1.msgpack`
    - Doc2: Pack image + words + normalized bboxes + GT → `doc2.msgpack`

**Final Output: ZIP File Contents**
```
dataset.zip
├── doc1_uuid_0.pdf               # Original rendered PDF
├── doc1_uuid_0_final.pdf         # PDF with handwriting + visual elements
├── doc1_uuid_0.msgpack           # Dataset format
├── doc2_uuid_1.pdf
├── doc2_uuid_1_final.pdf
├── doc2_uuid_1.msgpack
├── metadata.json                 # Complete generation metadata
└── handwriting/
    ├── hw0_b0_l0_w0.png          # Individual handwriting images
    ├── hw0_b0_l0_w1.png
    └── ... (13 more)
```

### Response (JSON Metadata)
```json
{
  "task_id": "uuid-here",
  "status": "completed",
  "num_documents": 2,
  "processing_time_seconds": 305.7,
  "stages_completed": [
    "seed_download", "llm_prompt", "html_extraction",
    "pdf_render", "bbox_extraction", "handwriting_extraction",
    "visual_element_extraction", "handwriting_generation",
    "visual_element_generation", "handwriting_insertion",
    "visual_element_insertion", "image_render", "ocr",
    "bbox_normalization", "dataset_export"
  ],
  "documents": [
    {
      "document_id": "doc1_uuid_0",
      "ground_truth": {"patient_name": "John Doe", "date": "2024-01-15"},
      "num_words": 187,
      "num_handwriting_regions": 2,
      "num_visual_elements": 2,
      "ocr_confidence_avg": 0.91
    },
    {
      "document_id": "doc2_uuid_1",
      "ground_truth": {"patient_name": "Jane Smith", "date": "2024-01-16"},
      "num_words": 203,
      "num_handwriting_regions": 2,
      "num_visual_elements": 2,
      "ocr_confidence_avg": 0.94
    }
  ],
  "download_url": "/download/dataset_uuid.zip"
}
```

---

## Configuration & Environment

### Required Environment Variables
```bash
# LLM API
ANTHROPIC_API_KEY=sk-ant-...              # Claude API key
CLAUDE_MODEL=claude-3-5-sonnet-20241022   # Default model

# Handwriting Service (RunPod)
HANDWRITING_SERVICE_ENABLED=true
HANDWRITING_SERVICE_URL=https://api.runpod.ai/v2/{endpoint_id}/runsync
RUNPOD_API_KEY=...                        # RunPod API key
HANDWRITING_APPLY_BLUR=true               # Gaussian blur for realism
HANDWRITING_SERVICE_MAX_RETRIES=3
HANDWRITING_SERVICE_TIMEOUT=600           # 10 minutes for large batches

# OCR Configuration
OCR_DPI=300                               # Image resolution for OCR
OCR_LANGUAGE=en                           # PaddleOCR language code

# File Paths
PROMPT_TEMPLATES_DIR=/path/to/data/prompt_templates
VISUAL_ELEMENT_PREFABS_DIR=/path/to/data/visual_element_prefabs
```

### Docker Deployment (Railway)
```dockerfile
# Dockerfile (api service)
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
    chromium chromium-driver \  # Playwright dependencies
    libgl1 libglib2.0-0 \      # PaddleOCR dependencies
    && rm -rf /var/lib/apt/lists/*

COPY api/ /app/api
COPY docgenie/ /app/docgenie
COPY data/ /app/data
WORKDIR /app/api
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```

**Handwriting service**: See `handwriting_service/Dockerfile` (deployed separately to RunPod)

---

## Performance & Costs

### Timing Breakdown (Single Document)
| Stage | Time | Notes |
|-------|------|-------|
| Seed download | 0.5-2s | Depends on image size + network |
| LLM prompt | 20-40s | Claude API latency |
| PDF render | 1-3s | Playwright initialization |
| Handwriting (10 words) | 180s | RunPod: 1 worker × (10×18s) |
| Visual elements | 0.5-1s | Local file selection |
| OCR | 3-5s | PaddleOCR inference |
| Dataset export | 0.5-1s | msgpack serialization |
| **TOTAL (no handwriting)** | **25-50s** |
| **TOTAL (with handwriting)** | **200-230s** | Batched |

### Cost Breakdown (Per Document)
| Component | Cost | Notes |
|-----------|------|-------|
| Claude API | $0.015-0.03 | ~5K input + 16K output tokens |
| RunPod GPU (10 words) | $0.045 | 180s × $0.00025/s |
| Storage | Negligible | Temporary files deleted |
| **TOTAL (no handwriting)** | **$0.015-0.03** |
| **TOTAL (with handwriting)** | **$0.06-0.08** |

**Optimization**: Batch multiple documents in ONE RunPod call to share worker activation overhead.

---

## Error Handling & Reliability

### Retry Mechanisms
1. **Seed image download**: 3 attempts, exponential backoff (2s, 4s, 8s)
2. **Handwriting service**: 3 attempts, status polling up to 30 times
3. **LLM API**: Built-in Anthropic SDK retries (rate limits, 529 errors)

### Failure Modes
| Error Type | Behavior | User Impact |
|------------|----------|-------------|
| Seed download failure | Raise HTTP 400 | Request rejected immediately |
| LLM API error | Raise HTTP 500 | No charge, can retry |
| Handwriting service failure | **Raise exception** (NEW) | Generation fails, prevents invalid outputs |
| OCR failure | Log warning, continue | Document generated without OCR data |
| PDF render failure | Raise HTTP 500 | Request fails, no partial results |

### Session Fixes Applied
- ✅ **Handwriting service failure now raises exception** (previously silent)
- ✅ **Seed parameter defaults to null** (previously 0)
- ✅ **Seed image download retry logic** (handles 503 timeout errors)
- ✅ **API docs show correct examples** (seed: null, not 0)

---

## Future Enhancements

### Short-term
1. **Configurable prompt templates** via API parameter
2. **Async endpoint progress tracking** (websocket or polling)
3. **Batch ZIP download** with multiple documents in one archive
4. **Cost estimation** before generation (preview mode)

### Long-term
1. **Custom visual element upload** (user-provided logos, signatures)
2. **Multi-page document support** (currently single-page only)
3. **Additional export formats** (COCO, YOLO, HuggingFace Datasets)
4. **Fine-tuning handwriting styles** (train on user's handwriting samples)
5. **LLM caching** (reduce cost for similar prompts)

---

## Troubleshooting

### Common Issues

**Q: "Handwriting service not called, but enable_handwriting=true"**
- Check: LLM output contains `class="handwritten"` in HTML
- Check: `handwriting_ratio` > 0 (default 0.2)
- Check: `HANDWRITING_SERVICE_ENABLED=true` in environment
- Debug: Look for "🔍 DEBUG - Handwriting Service Check" in logs

**Q: "RunPod job stuck IN_PROGRESS"**
- Cause: Large batch timing out
- Solution: Increase `HANDWRITING_SERVICE_TIMEOUT` (default 600s)
- Or: Reduce batch size by lowering `handwriting_ratio`

**Q: "503 first byte timeout" on seed download**
- Cause: CDN/storage provider temporary unavailability
- Solution: Retry logic automatically handles this (3 attempts)
- If persists: Use different image hosting (imgur, cloudinary)

**Q: "Seed parameter still shows 0 in API docs"**
- Fixed: Added `examples=[None, 42]` to Field definition
- Clear browser cache if seeing old docs

---

## Testing

### Unit Tests
```bash
# Test individual stages
pytest api/tests/test_utils.py::test_download_seed_images
pytest api/tests/test_utils.py::test_handwriting_service_batch
```

### Integration Tests
```bash
# Test sync endpoint (included in repo)
python api/test_sync_pdf_api.py

# Test async endpoint
python api/test_async_api.py
```

### Manual Testing via Docs UI
1. Navigate to `http://localhost:8000/docs`
2. Expand `/generate/pdf` endpoint
3. Click "Try it out"
4. Paste example request JSON
5. Click "Execute"
6. Download resulting ZIP file

### Example Test Request (Minimal)
```json
{
  "seed_images": [
    "https://i.imgur.com/example.jpg"
  ],
  "prompt_params": {
    "language": "english",
    "doc_type": "invoice",
    "num_solutions": 1,
    "enable_handwriting": false,
    "enable_visual_elements": false,
    "enable_ocr": true,
    "enable_dataset_export": true
  }
}
```

---

## Conclusion

The DocGenie API successfully implements all 19 stages of the original batch pipeline in a request/response model suitable for real-time generation. Key architectural differences:

1. **Handwriting generation**: Offloaded to RunPod serverless (cost-efficient batching)
2. **Seed selection**: User-provided URLs instead of pre-crawled dataset
3. **State management**: Ephemeral in-memory processing vs file-based
4. **Scalability**: Horizontal scaling via FastAPI workers + async processing

The API maintains feature parity with the batch pipeline while providing a simpler interface for integration with external systems (web apps, mobile apps, data pipelines).

**Total Processing Time**: 25-50s (no handwriting) or 200-230s (with handwriting)  
**Cost Per Document**: $0.015-0.08 depending on features  
**Output Formats**: PDF, PNG, msgpack, ZIP archive

For questions or issues, see `api/README.md` or `TESTING.md`.