Docgenie-API / API_FLOW_DOCUMENTATION.md
Ahadhassan-2003
deploy: update HF Space
9650a1d
# Complete API Flow Documentation
## Overview
The DocGenie API provides three endpoints for synthetic document generation, implementing a 19-stage pipeline that transforms seed images and prompts into complete datasets with OCR, ground truth, and optional handwriting/visual elements.
**Base URL**: `http://localhost:8000` (development) or Railway deployment
**Documentation**: `/docs` (FastAPI auto-generated Swagger UI)
---
## API Endpoints
### 1. `/generate` - Legacy JSON Response (POST)
**Purpose**: Generate documents and return complete JSON metadata
**Response**: JSON with HTML, PDF (base64), bounding boxes, optional handwriting/visual elements
**Use Case**: Testing, development, full metadata inspection
**Pipeline Stages**: 1-19 (configurable via parameters)
### 2. `/generate/pdf` - Sync PDF+Dataset ZIP (POST)
**Purpose**: Generate documents and return ZIP file with all artifacts
**Response**: ZIP file containing:
- `*.pdf` - Generated document PDFs
- `*_final.pdf` - PDFs with handwriting/visual elements (if enabled)
- `*.msgpack` - Dataset format (if export enabled)
- `metadata.json` - Complete generation metadata
- `handwriting/` - Individual handwriting images
- `visual_elements/` - Individual visual element images
**Use Case**: Production dataset generation, batch processing
**Pipeline Stages**: 1-19 (all features available)
### 3. `/generate/async` - Async Batch Processing (POST)
**Purpose**: Queue large batch jobs via background worker (Redis Queue)
**Response**: Task ID for status polling
**Status Check**: `GET /generate/async/status/{task_id}`
**Result Download**: `GET /generate/async/result/{task_id}` (returns ZIP)
**Use Case**: Large-scale dataset generation (100+ documents)
**Pipeline Stages**: 1-19 (via worker.py)
---
## Request Parameters
```python
class GenerateDocumentRequest:
seed_images: List[HttpUrl] # 1-8 seed images from web URLs
prompt_params: PromptParameters # Generation configuration
class PromptParameters:
# Core Parameters
language: str = "english" # Document language
doc_type: str = "invoice" # Document type (invoice, receipt, form, etc.)
gt_type: str = "qa" # Ground truth format (qa, kie)
gt_format: str = "json" # GT encoding (json, annotation)
num_solutions: int = 1 # Documents per seed set
# Feature Toggles (Stages 07-19)
enable_handwriting: bool = False # Stage 07-09, 12
handwriting_ratio: float = 0.2 # Probabilistic filter (0.0-1.0)
enable_visual_elements: bool = False # Stage 08, 10, 13
visual_element_types: List[str] = [] # Filter types: logo, photo, figure, barcode, etc.
enable_ocr: bool = True # Stage 15
enable_bbox_normalization: bool = True # Stage 16
enable_gt_verification: bool = False # Stage 17
enable_analysis: bool = False # Stage 18
enable_debug_visualization: bool = False # Stage 19
enable_dataset_export: bool = False # Stage 19 (msgpack format)
dataset_export_format: str = "msgpack" # Currently only msgpack supported
# Reproducibility
seed: Optional[int] = None # Random seed (null = random, int = reproducible)
```
---
## Pipeline Architecture: The 19 Stages
The API implements all 19 stages of the original batch pipeline in `docgenie/generation/`. Each stage is mapped to corresponding functions in `api/utils.py`.
### **Phase 1: Core Pipeline (Stages 01-06)**
Generate base documents from seed images and LLM prompts.
#### **Stage 01: Seed Selection & Download**
- **Original**: `pipeline_01_select_seeds.py`
- **API**: `download_seed_images()` in `api/utils.py:117-161`
- **Process**:
1. Accept user-provided seed image URLs (1-8 images)
2. Download with retry logic (3 attempts, exponential backoff)
3. Handle transient HTTP errors (502, 503, 504, 429)
4. Convert to base64 for LLM input
- **Error Handling**: Retry with 2s, 4s, 8s delays; raise HTTPException on failure
#### **Stage 02: Prompt LLM**
- **Original**: `pipeline_02_prompt_llm.py`
- **API**: `call_claude_api_direct()` in `api/utils.py:550-600`
- **Process**:
1. Load prompt template: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt`
2. Build prompt with parameters: language, doc_type, gt_type, num_solutions
3. Call Claude API (Anthropic Messages API v1)
- Model: `claude-3-5-sonnet-20241022` (configurable)
- Max tokens: 16,000
- Temperature: 1.0
- Vision: Send base64-encoded seed images
4. Receive HTML documents with embedded ground truth
- **LLM Output Format**: Multiple `<!DOCTYPE html>...</html>` blocks with:
- CSS styling with page dimensions
- HTML elements with semantic classes
- Handwriting markers: `class="handwritten author1"` (author1, author2, etc.)
- Visual element placeholders: `data-placeholder="logo"`, `data-content="company-logo"`
- Ground truth: `<script id="GT">{...json...}</script>`
#### **Stage 03: Process Response & Extract HTML**
- **Original**: `pipeline_03_process_response.py`
- **API**: `extract_html_documents_from_response()` in `api/utils.py:605-635`
- **Process**:
1. Parse LLM response for `<!DOCTYPE html>...</html>` blocks (regex)
2. Prettify HTML with BeautifulSoup
3. Validate HTML structure
4. Extract ground truth JSON from `<script id="GT">` tag
5. Remove GT script tag, clean HTML for rendering
- **Validation**: Check for required elements, CSS, proper structure
#### **Stage 04: Render PDF & Extract Geometries**
- **Original**: `pipeline_04_render_pdf_and_extract_geos.py`
- **API**: `render_html_to_pdf()` in `api/utils.py:650-740`
- **Process**:
1. Launch Playwright browser (Chromium)
2. Set page dimensions from CSS `@page` rule
3. Render HTML to PDF via `page.pdf()`
4. Extract element geometries:
- Handwriting elements: `.handwritten` class β†’ `{rect, text, classes, selectorTypes: ["handwriting"]}`
- Visual elements: `[data-placeholder]` attribute β†’ `{rect, dataPlaceholder, dataContent, selectorTypes: ["visual_element"]}`
5. Save PDF and geometries JSON
- **Output**:
- PDF at 72 DPI (PyMuPDF standard)
- Geometries at 96 DPI (browser rendering)
- Dimensions in mm
#### **Stage 05: Extract Bounding Boxes**
- **Original**: `pipeline_05_extract_bboxes_from_pdf.py`
- **API**: `extract_bboxes_from_rendered_pdf()` in `api/utils.py:750-825`
- **Process**:
1. Open PDF with PyMuPDF (fitz)
2. Extract text at word level: `page.get_text("words")`
3. Structure bboxes as:
```python
{
"text": "word",
"x0": float, # left
"y0": float, # top
"x1": float, # right (x2)
"y1": float, # bottom (y2)
"block_no": int,
"line_no": int,
"word_no": int
}
```
4. Filter whitespace-only text
5. Convert to OCRBox objects for processing
- **Coordinate System**: PDF points (72 DPI), origin top-left
#### **Stage 06: Validation**
- **Original**: `pipeline_06_validation.py` (implicit)
- **API**: `validate_html_structure()`, `validate_pdf()`, `validate_bboxes()` in `api/utils.py:830-890`
- **Checks**:
- HTML: Required DOCTYPE, head, body, CSS
- PDF: File readable, page count = 1, has text
- Bboxes: Minimum count (configurable), valid coordinates
---
### **Phase 2: Feature Synthesis (Stages 07-13)**
Add handwriting and visual elements to base documents.
#### **Stage 07: Extract Handwriting Definitions**
- **Original**: `pipeline_07_extract_handwriting.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1150-1235`
- **Process**:
1. Filter geometries: `"handwriting" in geo['selectorTypes']`
2. Parse classes: Extract `author1`, `author2`, etc. from `class="handwritten author1"`
3. **Probabilistic filtering** (handwriting_ratio):
```python
if random.random() > handwriting_ratio:
continue # Skip this element
```
- `ratio=0.0`: No handwriting (0%)
- `ratio=0.5`: ~50% of marked elements
- `ratio=1.0`: All marked elements (100%)
4. Match geometries to word bboxes:
- Convert browser coords (96 DPI) to PDF coords (72 DPI): `scale = 72/96 = 0.75`
- Find consecutive word bboxes matching geometry text
- Check bboxes are within geometry rect (threshold: 0.7)
- Track taken bbox indices to avoid duplicates
5. Build handwriting region definitions:
```python
{
"id": "hw0",
"text": "Patient Name",
"author_id": "author1",
"is_signature": False,
"rect": {x, y, width, height}, # in points
"bboxes": ["0_0_0 Patient 10.0 20.0 50.0 35.0", ...]
}
```
- **Reproducibility**: Use `seed + i` for each region to maintain order consistency
#### **Stage 08: Extract Visual Element Definitions**
- **Original**: `pipeline_08_extract_visual_element_definitions.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1237-1275`
- **Process**:
1. Filter geometries: `"visual_element" in geo['selectorTypes']`
2. Parse attributes:
- `data-placeholder`: Element type (logo, photo, figure, chart, barcode, etc.)
- `data-content`: Semantic description (e.g., "company-logo", "product-photo")
3. Normalize types using synonyms:
- "chart" β†’ "figure"
- "image" β†’ "photo"
4. Filter by `visual_element_types` parameter (if specified)
5. Convert coordinates: pixels (96 DPI) β†’ mm
6. Extract rotation from CSS `transform: rotate(Xdeg)`
7. Build visual element definitions:
```python
{
"id": "ve0",
"type": "logo", # normalized
"content": "company-logo",
"rect": {x, y, width, height}, # in mm
"rotation": 0 # degrees
}
```
#### **Stage 09: Create Handwriting Images**
- **Original**: `pipeline_09_create_handwriting_images.py`
- **API**: `call_handwriting_service_batch()` in `api/utils.py:785-920`
- **Handwriting Service**: RunPod serverless endpoint hosting WordStylist diffusion model
- **Service Implementation**: `handwriting_service/handler.py`, `handwriting_service/inference.py`
**πŸ”„ Handwriting Service Integration Details:**
##### **Service Architecture**
- **Platform**: RunPod Serverless (GPU: NVIDIA A4000, Cost: ~$0.00025/s active)
- **Model**: WordStylist (Diffusion-based handwriting synthesis)
- Architecture: UNet with conditional style embeddings
- Input: Text (A-Z, a-z only, no spaces), Writer style ID (0-656)
- Output: PNG image with transparent background
- Inference time: ~18s per text on A4000
- Weights: `handwriting_service/WordStylist/models/`
- **Endpoints**:
- `/run` (async): Queue job, return ID, poll `/status/{id}` (10MB limit)
- `/runsync` (sync): Wait for completion, return result (20MB limit, used by API)
##### **Batch Processing (Cost Optimization)**
The API uses TRUE batch processing to minimize RunPod activation overhead:
```python
# βœ… NEW: Batch all texts in ONE request
runpod_request = {
"input": {
"texts": [
{"text": "Hello", "author_id": 42, "hw_id": "hw0_b0_l0_w0"},
{"text": "World", "author_id": 42, "hw_id": "hw0_b0_l0_w1"},
# ... 10-100 texts
],
"apply_blur": True
}
}
# Result: 1 worker activation Γ— (N Γ— 18s) = ~40-60% cost savings
```
**Cost Comparison for 10 texts:**
- ❌ OLD (parallel): 10 workers Γ— 18s = 180 worker-seconds + 10Γ— activation fee
- βœ… NEW (batched): 1 worker Γ— 190s = 190 worker-seconds + 1Γ— activation fee
##### **API Processing Flow**
1. **Group by region and line**: Split handwriting regions into word-level requests
```python
# Text: "Patient Name" β†’ 2 word-level generations
texts_to_generate = [
{"text": "Patient", "author_id": 42, "hw_id": "hw0_b0_l0_w0"},
{"text": "Name", "author_id": 42, "hw_id": "hw0_b0_l0_w1"}
]
```
2. **Map author IDs to numeric styles**:
```python
# "author1" β†’ WRITER_STYLES[1] = 42 (deterministic)
# "author2" β†’ WRITER_STYLES[2] = 137
# 657 total writer styles available
```
3. **Sanitize text** (WordStylist constraint):
```python
# Only A-Z, a-z allowed (no spaces, numbers, punctuation)
"Hello123!" β†’ "Hello"
"first-name" β†’ "firstname"
```
4. **Send batch request** to RunPod `/runsync` endpoint:
```python
POST https://api.runpod.ai/v2/{endpoint_id}/runsync
Authorization: Bearer {RUNPOD_API_KEY}
Content-Type: application/json
{
"input": {
"texts": [...],
"apply_blur": True # Gaussian blur for realism
}
}
```
5. **Handle async responses**:
- If `status: "IN_PROGRESS"`: Poll `/status/{job_id}` every 5-10s (max 30 polls)
- If `status: "COMPLETED"`: Extract `output.images[]`
- If `status: "FAILED"`: Raise exception (stops entire generation)
6. **Response format**:
```python
{
"status": "COMPLETED",
"output": {
"images": [
{
"image_base64": "iVBORw0KGgoAAAANSU...",
"width": 200,
"height": 64,
"text": "Patient",
"author_id": 42,
"hw_id": "hw0_b0_l0_w0"
},
...
],
"total_generated": 2
}
}
```
7. **Store generated images**: Map `hw_id β†’ image_base64` for insertion
##### **Error Handling**
- **Retry logic**: 3 attempts with exponential backoff (matching seed download)
- **Timeouts**: Dynamic based on batch size: `20s Γ— num_texts + 30s buffer`
- **Failure behavior**: **RAISE EXCEPTION** (since session fix)
- ❌ OLD: Silent continue β†’ Documents without handwriting
- βœ… NEW: Raise exception β†’ Generation fails when user requested handwriting
##### **Service Code Structure**
**`handwriting_service/handler.py`** (RunPod handler):
```python
# Initialize model ONCE at module level (not per request)
generator = HandwritingGenerator(
model_dir="WordStylist",
checkpoint_path="WordStylist/models",
device="cuda"
)
def handler(job):
"""RunPod entry point - supports both /run and /runsync"""
texts = job["input"]["texts"] # Batch input
results = generator.generate_batch(
texts=[t["text"] for t in texts],
author_ids=[t["author_id"] for t in texts],
num_inference_steps=50,
temperature=1.0,
apply_blur=True
)
return {"images": results, "total_generated": len(results)}
```
**`handwriting_service/inference.py`** (WordStylist wrapper):
```python
class HandwritingGenerator:
def generate_batch(self, texts, author_ids, ...):
results = []
for text, author_id in zip(texts, author_ids):
# Load model checkpoint
unet = Unet(...)
unet.load_state_dict(checkpoint)
# Prepare style condition
style_id_tensor = torch.tensor([author_id])
# Diffusion reverse process (50 steps)
img = self.sample(unet, style_id_tensor, text_length=len(text))
# Post-process: crop, resize, apply blur
img_pil = postprocess_image(img)
if apply_blur:
img_pil = img_pil.filter(ImageFilter.GaussianBlur(1.2))
# Encode to base64
img_base64 = encode_pil_to_base64(img_pil)
results.append({
"image_base64": img_base64,
"width": img_pil.width,
"height": img_pil.height
})
return results
```
#### **Stage 10: Create Visual Element Images**
- **Original**: `pipeline_10_create_visual_elements.py`
- **API**: `generate_visual_element_images()` in `api/utils.py:925-1020`
- **Process**:
1. Load prefab images from `data/visual_element_prefabs/{type}/`:
- `logo/`: Company logos (50+ SVGs)
- `photo/`: Stock photos (100+ JPGs)
- `figure/`: Charts, graphs (30+ PNGs)
- `barcode/`: Generated barcodes
- `qr_code/`, `stamp/`, `signature/`, `checkbox/`, etc.
2. **Random selection** (seed-based if provided):
```python
if seed is not None:
random.seed(seed)
prefab_path = random.choice(list(prefab_dir.glob("*")))
```
3. **Special handling**:
- **Barcode**: Generate on-the-fly using `python-barcode` library
```python
# Generate random EAN-13 barcode (12 digits + checksum)
barcode_num = random.randint(100000000000, 999999999999)
barcode = EAN13(str(barcode_num), writer=ImageWriter())
```
- **QR Code**: Generate using `qrcode` library
- **Checkbox**: Render checked/unchecked SVG
4. Load and convert to base64:
```python
with open(prefab_path, 'rb') as f:
img_bytes = f.read()
img_base64 = base64.b64encode(img_bytes).decode('utf-8')
```
5. Return mapping: `ve_id β†’ image_base64`
#### **Stage 11: Make Text Transparent (Implicit)**
- **Original**: `pipeline_11_make_text_transparent.py`
- **API**: Implemented as "whiteout" in `process_stage3_complete()` at `api/utils.py:1415-1427`
- **Process**:
```python
# Draw white rectangles over original text to hide it
for hw_region in handwriting_regions:
for bbox_str in hw_region['bboxes']:
bbox = parse_bbox(bbox_str)
rect = fitz.Rect(bbox.x0, bbox.y0, bbox.x2, bbox.y2)
page.draw_rect(rect, color=(1,1,1), fill=(1,1,1)) # White fill
```
- **Why not transparent?**: PyMuPDF doesn't support making existing text transparent, so we use white rectangles instead (same visual result)
#### **Stage 12: Insert Handwriting Images**
- **Original**: `pipeline_12_insert_handwriting_images.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1429-1520`
- **Process**:
1. **Position calculation**:
```python
# Get word bbox from PDF extraction
bbox_w = bbox.x2 - bbox.x0 # Width in points
bbox_h = bbox.y2 - bbox.y0 # Height in points
# Resize handwriting image with aspect ratio
scale = min(bbox_w / img_width, bbox_h / img_height)
new_w = int(img_width * scale * SCALE_UP_FACTOR) # 3x upscale
new_h = int(img_height * scale * SCALE_UP_FACTOR)
# Add random offsets for natural variation
offset_x = random.randint(-MAX_OFFSET_LEFT, MAX_OFFSET_RIGHT) + FIXED_OFFSET
offset_y = random.randint(-MAX_OFFSET_UP, MAX_OFFSET_DOWN)
# Position at bbox coordinates
x0 = bbox.x0 + offset_x
y0 = bbox.y0 + offset_y - y_padding
```
2. **Insert into PDF**:
```python
img_resized = img.resize((new_w, new_h), Image.LANCZOS).convert("RGBA")
img_bytes = pil_to_bytes(img_resized)
rect = fitz.Rect(x0, y0, x0 + bbox_w, y0 + bbox_h)
page.insert_image(rect, stream=img_bytes)
```
3. Save intermediate PDF: `{doc_id}_with_handwriting.pdf`
#### **Stage 13: Insert Visual Elements**
- **Original**: `pipeline_13_insert_visual_elements.py`
- **API**: `process_stage3_complete()` section in `api/utils.py:1523-1625`
- **Process**:
1. Convert mm β†’ points: `mm_to_pt = 72 / 25.4`
2. Resize with aspect ratio preservation (same as handwriting)
3. Center image on white background (maintains bbox size)
4. Insert into PDF at geometry coordinates
5. Save final PDF: `{doc_id}_final.pdf` (includes both handwriting + visual elements)
---
### **Phase 3: Image Finalization & OCR (Stages 14-15)**
Convert final PDF to high-resolution image and extract OCR data.
#### **Stage 14: Render Image**
- **Original**: `pipeline_14_render_image.py`
- **API**: `process_stage4_ocr()` in `api/utils.py:1899-1940`
- **Process**:
```python
# Render PDF page to high-res PNG
page = fitz.open(pdf_path)[0]
pix = page.get_pixmap(matrix=fitz.Matrix(3, 3)) # 3x scale = ~220 DPI
img_bytes = pix.tobytes("png")
img_base64 = base64.b64encode(img_bytes).decode('utf-8')
```
- **Output**: Base64-encoded PNG at 220 DPI (configurable via scale factor)
#### **Stage 15: Perform OCR**
- **Original**: `pipeline_15_perform_ocr.py`
- **API**: `run_paddle_ocr()` in `api/utils.py:1950-2080`
- **OCR Engine**: PaddleOCR v4 (multilingual)
- Models: `PP-OCRv4` detection + recognition
- Languages: Supports 80+ languages
- Accuracy: State-of-the-art open-source OCR
- **Process**:
1. Render PDF to image via `pdf2image` at specified DPI (default: 300)
2. Initialize PaddleOCR with language parameter
3. Run detection + recognition:
```python
ocr = PaddleOCR(lang=language, use_gpu=True)
results = ocr.ocr(img_array, cls=True)
```
4. Parse results into word-level bboxes:
```python
{
"text": "word",
"bbox": {
"x0": float,
"y0": float,
"x1": float, # right
"y1": float # bottom
},
"confidence": 0.95
}
```
- **Output**: Dictionary with `words` list, image dimensions, OCR engine info
---
### **Phase 4: Dataset Packaging (Stages 16-19)**
Normalize, verify, analyze, and export final dataset.
#### **Stage 16: Normalize Bboxes**
- **Original**: `pipeline_16_normalize_bboxes.py`
- **API**: `normalize_bboxes()` in `api/utils.py:2100-2180`
- **Process**:
1. Convert absolute pixel coordinates β†’ normalized [0, 1] range:
```python
norm_bbox = [
bbox['x0'] / img_width,
bbox['y0'] / img_height,
bbox['x1'] / img_width,
bbox['y1'] / img_height
]
```
2. Clip to [0, 1]: `[max(0, min(1, x)) for x in norm_bbox]`
3. Create word-level and segment-level bboxes
- **Output**: List of `{text, bbox: [x0, y0, x1, y1]}` where bbox is normalized
#### **Stage 17: Ground Truth Verification**
- **Original**: `pipeline_17_gt_preparation_verification.py`
- **API**: `verify_ground_truth()` in `api/utils.py:2185-2250`
- **Checks**:
- GT structure: Valid JSON, required fields
- Text matching: GT text exists in OCR output
- Bbox coverage: GT answers have corresponding bboxes
- **Output**: Verification report with pass/fail status
#### **Stage 18: Analyze**
- **Original**: `pipeline_18_analyze.py`
- **API**: `analyze_document()` in `api/utils.py:2255-2320`
- **Metrics**:
- Word count, character count
- Average word length
- Handwriting regions count, coverage %
- Visual elements count by type
- OCR confidence statistics (mean, min, max)
- **Output**: Analysis dictionary with computed metrics
#### **Stage 19: Create Debug Data & Export**
- **Original**: `pipeline_19_create_debug_data.py`
- **API**: `export_to_msgpack()` in `api/utils.py:2350-2520`
- **Debug Visualization**:
- Draw bboxes on image with different colors:
- Green: Word bboxes
- Red: Handwriting regions
- Blue: Visual elements
- Yellow: Ground truth target regions
- Save annotated image
- **Dataset Export (msgpack)**:
```python
dataset_entry = {
"image": img_bytes, # PNG bytes
"words": ["hello", "world"],
"word_bboxes": [[0.1, 0.2, 0.15, 0.25], ...], # Normalized
"segment_bboxes": [...],
"ground_truth": {"question": "answer"},
"metadata": {
"document_id": "...",
"has_handwriting": True,
"num_visual_elements": 3
}
}
msgpack.dump(dataset_entry, f)
```
- **Output**: `.msgpack` file compatible with PyTorch DataLoader
---
## Pipeline Verification: API vs Original Implementation
### βœ… **Stage-by-Stage Mapping**
| Stage | Original File | API Function | Status |
|-------|--------------|--------------|--------|
| 01 | `pipeline_01_select_seeds.py` | `download_seed_images()` | βœ… Mapped (with retry logic) |
| 02 | `pipeline_02_prompt_llm.py` | `call_claude_api_direct()` | βœ… Mapped (uses Messages API) |
| 03 | `pipeline_03_process_response.py` | `extract_html_documents_from_response()` | βœ… Mapped |
| 04 | `pipeline_04_render_pdf_and_extract_geos.py` | `render_html_to_pdf()` | βœ… Mapped (Playwright) |
| 05 | `pipeline_05_extract_bboxes_from_pdf.py` | `extract_bboxes_from_rendered_pdf()` | βœ… Mapped |
| 06 | `pipeline_06_validation.py` | `validate_html_structure()`, `validate_pdf()` | βœ… Mapped |
| 07 | `pipeline_07_extract_handwriting.py` | `process_stage3_complete()` section | βœ… Mapped (with ratio filter) |
| 08 | `pipeline_08_extract_visual_element_definitions.py` | `process_stage3_complete()` section | βœ… Mapped |
| 09 | `pipeline_09_create_handwriting_images.py` | `call_handwriting_service_batch()` | βœ… Mapped (RunPod integration) |
| 10 | `pipeline_10_create_visual_elements.py` | `generate_visual_element_images()` | βœ… Mapped |
| 11 | `pipeline_11_make_text_transparent.py` | `process_stage3_complete()` (whiteout) | βœ… Mapped (white rectangles) |
| 12 | `pipeline_12_insert_handwriting_images.py` | `process_stage3_complete()` section | βœ… Mapped |
| 13 | `pipeline_13_insert_visual_elements.py` | `process_stage3_complete()` section | βœ… Mapped |
| 14 | `pipeline_14_render_image.py` | `process_stage4_ocr()` | βœ… Mapped |
| 15 | `pipeline_15_perform_ocr.py` | `run_paddle_ocr()` | βœ… Mapped |
| 16 | `pipeline_16_normalize_bboxes.py` | `normalize_bboxes()` | βœ… Mapped |
| 17 | `pipeline_17_gt_preparation_verification.py` | `verify_ground_truth()` | βœ… Mapped |
| 18 | `pipeline_18_analyze.py` | `analyze_document()` | βœ… Mapped |
| 19 | `pipeline_19_create_debug_data.py` | `export_to_msgpack()` | βœ… Mapped |
### πŸ“Š **Key Differences: API vs Batch Pipeline**
#### **Processing Model**
- **Original**: Batch processing with file-based state management
- Input: CSV of seed selections, prompt parameters in JSON
- Output: Folder structure with intermediate files
- State: JSON logs per document + message
- Resumability: Can restart from any stage
- **API**: Request/response with in-memory processing
- Input: JSON request with seed URLs
- Output: JSON response or ZIP file
- State: Ephemeral (temporary directories)
- Resumability: None (single-shot generation)
#### **Handwriting Generation**
- **Original**: Local GPU with WordStylist model loaded in-process
- Location: `docgenie/generation/handwriting_diffusion/`
- Execution: `generate_handwriting_diffusion_raw.py`
- Cost: Free (local GPU)
- **API**: Remote RunPod serverless endpoint
- Location: `handwriting_service/` (deployed separately)
- Execution: HTTP POST to RunPod API
- Cost: ~$0.00025/s GPU time (pay-per-use)
- Benefit: No local GPU required, scales automatically
#### **Seed Selection**
- **Original**: Pre-crawled dataset with systematic selection
- Seeds stored in: `data/datasets/base_v2/`
- Selection: Clustering algorithm β†’ balanced subset
- Tracking: CSV manifest with seed IDs
- **API**: User-provided URLs
- Seeds: Any publicly accessible image URL
- Selection: User chooses 1-8 images per request
- Tracking: URLs stored in request metadata
#### **Prompt Templates**
- **Original**: Multiple template versions in folders
- Path: `data/prompt_templates/{version}/seed-based-json.txt`
- Versioning: ClaudeRefined1 β†’ ClaudeRefined12
- Selection: Configurable per dataset
- **API**: Fixed template (latest version)
- Path: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt`
- Hardcoded in: `api/main.py:171`
- **Future improvement**: Make template selectable via API parameter
---
## Complete Request Flow Example
### Example Request (Sync Endpoint)
```bash
POST /generate/pdf HTTP/1.1
Content-Type: application/json
{
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "english",
"doc_type": "medical_form",
"gt_type": "kie",
"gt_format": "json",
"num_solutions": 2,
"enable_handwriting": true,
"handwriting_ratio": 0.3,
"enable_visual_elements": true,
"visual_element_types": ["logo", "signature"],
"enable_ocr": true,
"enable_dataset_export": true,
"seed": 42
}
}
```
### Processing Flow (Stages Executed)
**Phase 1: Core Document Generation (30-60s)**
1. βœ… Download 2 seed images with retry β†’ `[img1_b64, img2_b64]`
2. βœ… Load prompt template β†’ Build prompt for medical_form + KIE
3. βœ… Call Claude API β†’ LLM generates 2 HTML documents (~25s)
4. βœ… Extract HTML + ground truth β†’ 2 clean HTML files with GT JSON
5. βœ… Render each HTML to PDF via Playwright β†’ 2 PDFs + geometries
6. βœ… Extract word bboxes from PDFs β†’ ~200-500 words per document
**Phase 2: Feature Synthesis (120-180s if handwriting enabled)**
7. βœ… Parse geometries for handwriting markers
- Found: 12 elements with `class="handwritten"`
- Filtered by ratio: 12 Γ— 0.3 = ~4 elements selected (probabilistic)
- Matched to word bboxes: 4 regions with 15 total words
8. βœ… Parse geometries for visual elements
- Found: 3 elements (`data-placeholder="logo"`, `"signature"`, `"logo"`)
- Filtered by types: Keep logo + signature, remove others
- Result: 2 visual element definitions
9. βœ… Generate handwriting images via RunPod
- **Batch request**: 15 words in ONE API call
- Map author IDs: `author1 β†’ style 42`, `author2 β†’ style 137`
- RunPod processing: 1 worker Γ— (15 Γ— 18s) = ~270s
- Result: 15 PNG images (base64-encoded)
10. βœ… Generate visual element images
- Logo: Random selection from `data/visual_element_prefabs/logo/` (seed=42)
- Signature: Generate on-the-fly using signature prefab
- Result: 2 PNG images
11. βœ… Whiteout original text: Draw white rectangles over 15 word positions
12. βœ… Insert handwriting: Place 15 generated images at word bboxes with offsets
- Save: `doc1_with_handwriting.pdf`, `doc2_with_handwriting.pdf`
13. βœ… Insert visual elements: Place logo + signature at geometry coords
- Save: `doc1_final.pdf`, `doc2_final.pdf`
**Phase 3: Image + OCR (5-10s)**
14. βœ… Render each final PDF to 220 DPI image β†’ 2 PNG files (base64)
15. βœ… Run PaddleOCR on each image
- Doc1: Detected 187 words, avg confidence 0.91
- Doc2: Detected 203 words, avg confidence 0.94
**Phase 4: Dataset Packaging (2-5s)**
16. βœ… Normalize OCR bboxes: Convert pixels β†’ [0,1] range
17. βœ… Verify ground truth: Check GT fields match OCR output (enabled=false, skipped)
18. βœ… Analyze documents: Compute metrics (enabled=false, skipped)
19. βœ… Export to msgpack:
- Doc1: Pack image + words + normalized bboxes + GT β†’ `doc1.msgpack`
- Doc2: Pack image + words + normalized bboxes + GT β†’ `doc2.msgpack`
**Final Output: ZIP File Contents**
```
dataset.zip
β”œβ”€β”€ doc1_uuid_0.pdf # Original rendered PDF
β”œβ”€β”€ doc1_uuid_0_final.pdf # PDF with handwriting + visual elements
β”œβ”€β”€ doc1_uuid_0.msgpack # Dataset format
β”œβ”€β”€ doc2_uuid_1.pdf
β”œβ”€β”€ doc2_uuid_1_final.pdf
β”œβ”€β”€ doc2_uuid_1.msgpack
β”œβ”€β”€ metadata.json # Complete generation metadata
└── handwriting/
β”œβ”€β”€ hw0_b0_l0_w0.png # Individual handwriting images
β”œβ”€β”€ hw0_b0_l0_w1.png
└── ... (13 more)
```
### Response (JSON Metadata)
```json
{
"task_id": "uuid-here",
"status": "completed",
"num_documents": 2,
"processing_time_seconds": 305.7,
"stages_completed": [
"seed_download", "llm_prompt", "html_extraction",
"pdf_render", "bbox_extraction", "handwriting_extraction",
"visual_element_extraction", "handwriting_generation",
"visual_element_generation", "handwriting_insertion",
"visual_element_insertion", "image_render", "ocr",
"bbox_normalization", "dataset_export"
],
"documents": [
{
"document_id": "doc1_uuid_0",
"ground_truth": {"patient_name": "John Doe", "date": "2024-01-15"},
"num_words": 187,
"num_handwriting_regions": 2,
"num_visual_elements": 2,
"ocr_confidence_avg": 0.91
},
{
"document_id": "doc2_uuid_1",
"ground_truth": {"patient_name": "Jane Smith", "date": "2024-01-16"},
"num_words": 203,
"num_handwriting_regions": 2,
"num_visual_elements": 2,
"ocr_confidence_avg": 0.94
}
],
"download_url": "/download/dataset_uuid.zip"
}
```
---
## Configuration & Environment
### Required Environment Variables
```bash
# LLM API
ANTHROPIC_API_KEY=sk-ant-... # Claude API key
CLAUDE_MODEL=claude-3-5-sonnet-20241022 # Default model
# Handwriting Service (RunPod)
HANDWRITING_SERVICE_ENABLED=true
HANDWRITING_SERVICE_URL=https://api.runpod.ai/v2/{endpoint_id}/runsync
RUNPOD_API_KEY=... # RunPod API key
HANDWRITING_APPLY_BLUR=true # Gaussian blur for realism
HANDWRITING_SERVICE_MAX_RETRIES=3
HANDWRITING_SERVICE_TIMEOUT=600 # 10 minutes for large batches
# OCR Configuration
OCR_DPI=300 # Image resolution for OCR
OCR_LANGUAGE=en # PaddleOCR language code
# File Paths
PROMPT_TEMPLATES_DIR=/path/to/data/prompt_templates
VISUAL_ELEMENT_PREFABS_DIR=/path/to/data/visual_element_prefabs
```
### Docker Deployment (Railway)
```dockerfile
# Dockerfile (api service)
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
chromium chromium-driver \ # Playwright dependencies
libgl1 libglib2.0-0 \ # PaddleOCR dependencies
&& rm -rf /var/lib/apt/lists/*
COPY api/ /app/api
COPY docgenie/ /app/docgenie
COPY data/ /app/data
WORKDIR /app/api
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```
**Handwriting service**: See `handwriting_service/Dockerfile` (deployed separately to RunPod)
---
## Performance & Costs
### Timing Breakdown (Single Document)
| Stage | Time | Notes |
|-------|------|-------|
| Seed download | 0.5-2s | Depends on image size + network |
| LLM prompt | 20-40s | Claude API latency |
| PDF render | 1-3s | Playwright initialization |
| Handwriting (10 words) | 180s | RunPod: 1 worker Γ— (10Γ—18s) |
| Visual elements | 0.5-1s | Local file selection |
| OCR | 3-5s | PaddleOCR inference |
| Dataset export | 0.5-1s | msgpack serialization |
| **TOTAL (no handwriting)** | **25-50s** |
| **TOTAL (with handwriting)** | **200-230s** | Batched |
### Cost Breakdown (Per Document)
| Component | Cost | Notes |
|-----------|------|-------|
| Claude API | $0.015-0.03 | ~5K input + 16K output tokens |
| RunPod GPU (10 words) | $0.045 | 180s Γ— $0.00025/s |
| Storage | Negligible | Temporary files deleted |
| **TOTAL (no handwriting)** | **$0.015-0.03** |
| **TOTAL (with handwriting)** | **$0.06-0.08** |
**Optimization**: Batch multiple documents in ONE RunPod call to share worker activation overhead.
---
## Error Handling & Reliability
### Retry Mechanisms
1. **Seed image download**: 3 attempts, exponential backoff (2s, 4s, 8s)
2. **Handwriting service**: 3 attempts, status polling up to 30 times
3. **LLM API**: Built-in Anthropic SDK retries (rate limits, 529 errors)
### Failure Modes
| Error Type | Behavior | User Impact |
|------------|----------|-------------|
| Seed download failure | Raise HTTP 400 | Request rejected immediately |
| LLM API error | Raise HTTP 500 | No charge, can retry |
| Handwriting service failure | **Raise exception** (NEW) | Generation fails, prevents invalid outputs |
| OCR failure | Log warning, continue | Document generated without OCR data |
| PDF render failure | Raise HTTP 500 | Request fails, no partial results |
### Session Fixes Applied
- βœ… **Handwriting service failure now raises exception** (previously silent)
- βœ… **Seed parameter defaults to null** (previously 0)
- βœ… **Seed image download retry logic** (handles 503 timeout errors)
- βœ… **API docs show correct examples** (seed: null, not 0)
---
## Future Enhancements
### Short-term
1. **Configurable prompt templates** via API parameter
2. **Async endpoint progress tracking** (websocket or polling)
3. **Batch ZIP download** with multiple documents in one archive
4. **Cost estimation** before generation (preview mode)
### Long-term
1. **Custom visual element upload** (user-provided logos, signatures)
2. **Multi-page document support** (currently single-page only)
3. **Additional export formats** (COCO, YOLO, HuggingFace Datasets)
4. **Fine-tuning handwriting styles** (train on user's handwriting samples)
5. **LLM caching** (reduce cost for similar prompts)
---
## Troubleshooting
### Common Issues
**Q: "Handwriting service not called, but enable_handwriting=true"**
- Check: LLM output contains `class="handwritten"` in HTML
- Check: `handwriting_ratio` > 0 (default 0.2)
- Check: `HANDWRITING_SERVICE_ENABLED=true` in environment
- Debug: Look for "πŸ” DEBUG - Handwriting Service Check" in logs
**Q: "RunPod job stuck IN_PROGRESS"**
- Cause: Large batch timing out
- Solution: Increase `HANDWRITING_SERVICE_TIMEOUT` (default 600s)
- Or: Reduce batch size by lowering `handwriting_ratio`
**Q: "503 first byte timeout" on seed download**
- Cause: CDN/storage provider temporary unavailability
- Solution: Retry logic automatically handles this (3 attempts)
- If persists: Use different image hosting (imgur, cloudinary)
**Q: "Seed parameter still shows 0 in API docs"**
- Fixed: Added `examples=[None, 42]` to Field definition
- Clear browser cache if seeing old docs
---
## Testing
### Unit Tests
```bash
# Test individual stages
pytest api/tests/test_utils.py::test_download_seed_images
pytest api/tests/test_utils.py::test_handwriting_service_batch
```
### Integration Tests
```bash
# Test sync endpoint (included in repo)
python api/test_sync_pdf_api.py
# Test async endpoint
python api/test_async_api.py
```
### Manual Testing via Docs UI
1. Navigate to `http://localhost:8000/docs`
2. Expand `/generate/pdf` endpoint
3. Click "Try it out"
4. Paste example request JSON
5. Click "Execute"
6. Download resulting ZIP file
### Example Test Request (Minimal)
```json
{
"seed_images": [
"https://i.imgur.com/example.jpg"
],
"prompt_params": {
"language": "english",
"doc_type": "invoice",
"num_solutions": 1,
"enable_handwriting": false,
"enable_visual_elements": false,
"enable_ocr": true,
"enable_dataset_export": true
}
}
```
---
## Conclusion
The DocGenie API successfully implements all 19 stages of the original batch pipeline in a request/response model suitable for real-time generation. Key architectural differences:
1. **Handwriting generation**: Offloaded to RunPod serverless (cost-efficient batching)
2. **Seed selection**: User-provided URLs instead of pre-crawled dataset
3. **State management**: Ephemeral in-memory processing vs file-based
4. **Scalability**: Horizontal scaling via FastAPI workers + async processing
The API maintains feature parity with the batch pipeline while providing a simpler interface for integration with external systems (web apps, mobile apps, data pipelines).
**Total Processing Time**: 25-50s (no handwriting) or 200-230s (with handwriting)
**Cost Per Document**: $0.015-0.08 depending on features
**Output Formats**: PDF, PNG, msgpack, ZIP archive
For questions or issues, see `api/README.md` or `TESTING.md`.