Spaces:
Runtime error
Runtime error
| # Complete API Flow Documentation | |
| ## Overview | |
| The DocGenie API provides three endpoints for synthetic document generation, implementing a 19-stage pipeline that transforms seed images and prompts into complete datasets with OCR, ground truth, and optional handwriting/visual elements. | |
| **Base URL**: `http://localhost:8000` (development) or Railway deployment | |
| **Documentation**: `/docs` (FastAPI auto-generated Swagger UI) | |
| --- | |
| ## API Endpoints | |
| ### 1. `/generate` - Legacy JSON Response (POST) | |
| **Purpose**: Generate documents and return complete JSON metadata | |
| **Response**: JSON with HTML, PDF (base64), bounding boxes, optional handwriting/visual elements | |
| **Use Case**: Testing, development, full metadata inspection | |
| **Pipeline Stages**: 1-19 (configurable via parameters) | |
| ### 2. `/generate/pdf` - Sync PDF+Dataset ZIP (POST) | |
| **Purpose**: Generate documents and return ZIP file with all artifacts | |
| **Response**: ZIP file containing: | |
| - `*.pdf` - Generated document PDFs | |
| - `*_final.pdf` - PDFs with handwriting/visual elements (if enabled) | |
| - `*.msgpack` - Dataset format (if export enabled) | |
| - `metadata.json` - Complete generation metadata | |
| - `handwriting/` - Individual handwriting images | |
| - `visual_elements/` - Individual visual element images | |
| **Use Case**: Production dataset generation, batch processing | |
| **Pipeline Stages**: 1-19 (all features available) | |
| ### 3. `/generate/async` - Async Batch Processing (POST) | |
| **Purpose**: Queue large batch jobs via background worker (Redis Queue) | |
| **Response**: Task ID for status polling | |
| **Status Check**: `GET /generate/async/status/{task_id}` | |
| **Result Download**: `GET /generate/async/result/{task_id}` (returns ZIP) | |
| **Use Case**: Large-scale dataset generation (100+ documents) | |
| **Pipeline Stages**: 1-19 (via worker.py) | |
| --- | |
| ## Request Parameters | |
| ```python | |
| class GenerateDocumentRequest: | |
| seed_images: List[HttpUrl] # 1-8 seed images from web URLs | |
| prompt_params: PromptParameters # Generation configuration | |
| class PromptParameters: | |
| # Core Parameters | |
| language: str = "english" # Document language | |
| doc_type: str = "invoice" # Document type (invoice, receipt, form, etc.) | |
| gt_type: str = "qa" # Ground truth format (qa, kie) | |
| gt_format: str = "json" # GT encoding (json, annotation) | |
| num_solutions: int = 1 # Documents per seed set | |
| # Feature Toggles (Stages 07-19) | |
| enable_handwriting: bool = False # Stage 07-09, 12 | |
| handwriting_ratio: float = 0.2 # Probabilistic filter (0.0-1.0) | |
| enable_visual_elements: bool = False # Stage 08, 10, 13 | |
| visual_element_types: List[str] = [] # Filter types: logo, photo, figure, barcode, etc. | |
| enable_ocr: bool = True # Stage 15 | |
| enable_bbox_normalization: bool = True # Stage 16 | |
| enable_gt_verification: bool = False # Stage 17 | |
| enable_analysis: bool = False # Stage 18 | |
| enable_debug_visualization: bool = False # Stage 19 | |
| enable_dataset_export: bool = False # Stage 19 (msgpack format) | |
| dataset_export_format: str = "msgpack" # Currently only msgpack supported | |
| # Reproducibility | |
| seed: Optional[int] = None # Random seed (null = random, int = reproducible) | |
| ``` | |
| --- | |
| ## Pipeline Architecture: The 19 Stages | |
| The API implements all 19 stages of the original batch pipeline in `docgenie/generation/`. Each stage is mapped to corresponding functions in `api/utils.py`. | |
| ### **Phase 1: Core Pipeline (Stages 01-06)** | |
| Generate base documents from seed images and LLM prompts. | |
| #### **Stage 01: Seed Selection & Download** | |
| - **Original**: `pipeline_01_select_seeds.py` | |
| - **API**: `download_seed_images()` in `api/utils.py:117-161` | |
| - **Process**: | |
| 1. Accept user-provided seed image URLs (1-8 images) | |
| 2. Download with retry logic (3 attempts, exponential backoff) | |
| 3. Handle transient HTTP errors (502, 503, 504, 429) | |
| 4. Convert to base64 for LLM input | |
| - **Error Handling**: Retry with 2s, 4s, 8s delays; raise HTTPException on failure | |
| #### **Stage 02: Prompt LLM** | |
| - **Original**: `pipeline_02_prompt_llm.py` | |
| - **API**: `call_claude_api_direct()` in `api/utils.py:550-600` | |
| - **Process**: | |
| 1. Load prompt template: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt` | |
| 2. Build prompt with parameters: language, doc_type, gt_type, num_solutions | |
| 3. Call Claude API (Anthropic Messages API v1) | |
| - Model: `claude-3-5-sonnet-20241022` (configurable) | |
| - Max tokens: 16,000 | |
| - Temperature: 1.0 | |
| - Vision: Send base64-encoded seed images | |
| 4. Receive HTML documents with embedded ground truth | |
| - **LLM Output Format**: Multiple `<!DOCTYPE html>...</html>` blocks with: | |
| - CSS styling with page dimensions | |
| - HTML elements with semantic classes | |
| - Handwriting markers: `class="handwritten author1"` (author1, author2, etc.) | |
| - Visual element placeholders: `data-placeholder="logo"`, `data-content="company-logo"` | |
| - Ground truth: `<script id="GT">{...json...}</script>` | |
| #### **Stage 03: Process Response & Extract HTML** | |
| - **Original**: `pipeline_03_process_response.py` | |
| - **API**: `extract_html_documents_from_response()` in `api/utils.py:605-635` | |
| - **Process**: | |
| 1. Parse LLM response for `<!DOCTYPE html>...</html>` blocks (regex) | |
| 2. Prettify HTML with BeautifulSoup | |
| 3. Validate HTML structure | |
| 4. Extract ground truth JSON from `<script id="GT">` tag | |
| 5. Remove GT script tag, clean HTML for rendering | |
| - **Validation**: Check for required elements, CSS, proper structure | |
| #### **Stage 04: Render PDF & Extract Geometries** | |
| - **Original**: `pipeline_04_render_pdf_and_extract_geos.py` | |
| - **API**: `render_html_to_pdf()` in `api/utils.py:650-740` | |
| - **Process**: | |
| 1. Launch Playwright browser (Chromium) | |
| 2. Set page dimensions from CSS `@page` rule | |
| 3. Render HTML to PDF via `page.pdf()` | |
| 4. Extract element geometries: | |
| - Handwriting elements: `.handwritten` class β `{rect, text, classes, selectorTypes: ["handwriting"]}` | |
| - Visual elements: `[data-placeholder]` attribute β `{rect, dataPlaceholder, dataContent, selectorTypes: ["visual_element"]}` | |
| 5. Save PDF and geometries JSON | |
| - **Output**: | |
| - PDF at 72 DPI (PyMuPDF standard) | |
| - Geometries at 96 DPI (browser rendering) | |
| - Dimensions in mm | |
| #### **Stage 05: Extract Bounding Boxes** | |
| - **Original**: `pipeline_05_extract_bboxes_from_pdf.py` | |
| - **API**: `extract_bboxes_from_rendered_pdf()` in `api/utils.py:750-825` | |
| - **Process**: | |
| 1. Open PDF with PyMuPDF (fitz) | |
| 2. Extract text at word level: `page.get_text("words")` | |
| 3. Structure bboxes as: | |
| ```python | |
| { | |
| "text": "word", | |
| "x0": float, # left | |
| "y0": float, # top | |
| "x1": float, # right (x2) | |
| "y1": float, # bottom (y2) | |
| "block_no": int, | |
| "line_no": int, | |
| "word_no": int | |
| } | |
| ``` | |
| 4. Filter whitespace-only text | |
| 5. Convert to OCRBox objects for processing | |
| - **Coordinate System**: PDF points (72 DPI), origin top-left | |
| #### **Stage 06: Validation** | |
| - **Original**: `pipeline_06_validation.py` (implicit) | |
| - **API**: `validate_html_structure()`, `validate_pdf()`, `validate_bboxes()` in `api/utils.py:830-890` | |
| - **Checks**: | |
| - HTML: Required DOCTYPE, head, body, CSS | |
| - PDF: File readable, page count = 1, has text | |
| - Bboxes: Minimum count (configurable), valid coordinates | |
| --- | |
| ### **Phase 2: Feature Synthesis (Stages 07-13)** | |
| Add handwriting and visual elements to base documents. | |
| #### **Stage 07: Extract Handwriting Definitions** | |
| - **Original**: `pipeline_07_extract_handwriting.py` | |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1150-1235` | |
| - **Process**: | |
| 1. Filter geometries: `"handwriting" in geo['selectorTypes']` | |
| 2. Parse classes: Extract `author1`, `author2`, etc. from `class="handwritten author1"` | |
| 3. **Probabilistic filtering** (handwriting_ratio): | |
| ```python | |
| if random.random() > handwriting_ratio: | |
| continue # Skip this element | |
| ``` | |
| - `ratio=0.0`: No handwriting (0%) | |
| - `ratio=0.5`: ~50% of marked elements | |
| - `ratio=1.0`: All marked elements (100%) | |
| 4. Match geometries to word bboxes: | |
| - Convert browser coords (96 DPI) to PDF coords (72 DPI): `scale = 72/96 = 0.75` | |
| - Find consecutive word bboxes matching geometry text | |
| - Check bboxes are within geometry rect (threshold: 0.7) | |
| - Track taken bbox indices to avoid duplicates | |
| 5. Build handwriting region definitions: | |
| ```python | |
| { | |
| "id": "hw0", | |
| "text": "Patient Name", | |
| "author_id": "author1", | |
| "is_signature": False, | |
| "rect": {x, y, width, height}, # in points | |
| "bboxes": ["0_0_0 Patient 10.0 20.0 50.0 35.0", ...] | |
| } | |
| ``` | |
| - **Reproducibility**: Use `seed + i` for each region to maintain order consistency | |
| #### **Stage 08: Extract Visual Element Definitions** | |
| - **Original**: `pipeline_08_extract_visual_element_definitions.py` | |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1237-1275` | |
| - **Process**: | |
| 1. Filter geometries: `"visual_element" in geo['selectorTypes']` | |
| 2. Parse attributes: | |
| - `data-placeholder`: Element type (logo, photo, figure, chart, barcode, etc.) | |
| - `data-content`: Semantic description (e.g., "company-logo", "product-photo") | |
| 3. Normalize types using synonyms: | |
| - "chart" β "figure" | |
| - "image" β "photo" | |
| 4. Filter by `visual_element_types` parameter (if specified) | |
| 5. Convert coordinates: pixels (96 DPI) β mm | |
| 6. Extract rotation from CSS `transform: rotate(Xdeg)` | |
| 7. Build visual element definitions: | |
| ```python | |
| { | |
| "id": "ve0", | |
| "type": "logo", # normalized | |
| "content": "company-logo", | |
| "rect": {x, y, width, height}, # in mm | |
| "rotation": 0 # degrees | |
| } | |
| ``` | |
| #### **Stage 09: Create Handwriting Images** | |
| - **Original**: `pipeline_09_create_handwriting_images.py` | |
| - **API**: `call_handwriting_service_batch()` in `api/utils.py:785-920` | |
| - **Handwriting Service**: RunPod serverless endpoint hosting WordStylist diffusion model | |
| - **Service Implementation**: `handwriting_service/handler.py`, `handwriting_service/inference.py` | |
| **π Handwriting Service Integration Details:** | |
| ##### **Service Architecture** | |
| - **Platform**: RunPod Serverless (GPU: NVIDIA A4000, Cost: ~$0.00025/s active) | |
| - **Model**: WordStylist (Diffusion-based handwriting synthesis) | |
| - Architecture: UNet with conditional style embeddings | |
| - Input: Text (A-Z, a-z only, no spaces), Writer style ID (0-656) | |
| - Output: PNG image with transparent background | |
| - Inference time: ~18s per text on A4000 | |
| - Weights: `handwriting_service/WordStylist/models/` | |
| - **Endpoints**: | |
| - `/run` (async): Queue job, return ID, poll `/status/{id}` (10MB limit) | |
| - `/runsync` (sync): Wait for completion, return result (20MB limit, used by API) | |
| ##### **Batch Processing (Cost Optimization)** | |
| The API uses TRUE batch processing to minimize RunPod activation overhead: | |
| ```python | |
| # β NEW: Batch all texts in ONE request | |
| runpod_request = { | |
| "input": { | |
| "texts": [ | |
| {"text": "Hello", "author_id": 42, "hw_id": "hw0_b0_l0_w0"}, | |
| {"text": "World", "author_id": 42, "hw_id": "hw0_b0_l0_w1"}, | |
| # ... 10-100 texts | |
| ], | |
| "apply_blur": True | |
| } | |
| } | |
| # Result: 1 worker activation Γ (N Γ 18s) = ~40-60% cost savings | |
| ``` | |
| **Cost Comparison for 10 texts:** | |
| - β OLD (parallel): 10 workers Γ 18s = 180 worker-seconds + 10Γ activation fee | |
| - β NEW (batched): 1 worker Γ 190s = 190 worker-seconds + 1Γ activation fee | |
| ##### **API Processing Flow** | |
| 1. **Group by region and line**: Split handwriting regions into word-level requests | |
| ```python | |
| # Text: "Patient Name" β 2 word-level generations | |
| texts_to_generate = [ | |
| {"text": "Patient", "author_id": 42, "hw_id": "hw0_b0_l0_w0"}, | |
| {"text": "Name", "author_id": 42, "hw_id": "hw0_b0_l0_w1"} | |
| ] | |
| ``` | |
| 2. **Map author IDs to numeric styles**: | |
| ```python | |
| # "author1" β WRITER_STYLES[1] = 42 (deterministic) | |
| # "author2" β WRITER_STYLES[2] = 137 | |
| # 657 total writer styles available | |
| ``` | |
| 3. **Sanitize text** (WordStylist constraint): | |
| ```python | |
| # Only A-Z, a-z allowed (no spaces, numbers, punctuation) | |
| "Hello123!" β "Hello" | |
| "first-name" β "firstname" | |
| ``` | |
| 4. **Send batch request** to RunPod `/runsync` endpoint: | |
| ```python | |
| POST https://api.runpod.ai/v2/{endpoint_id}/runsync | |
| Authorization: Bearer {RUNPOD_API_KEY} | |
| Content-Type: application/json | |
| { | |
| "input": { | |
| "texts": [...], | |
| "apply_blur": True # Gaussian blur for realism | |
| } | |
| } | |
| ``` | |
| 5. **Handle async responses**: | |
| - If `status: "IN_PROGRESS"`: Poll `/status/{job_id}` every 5-10s (max 30 polls) | |
| - If `status: "COMPLETED"`: Extract `output.images[]` | |
| - If `status: "FAILED"`: Raise exception (stops entire generation) | |
| 6. **Response format**: | |
| ```python | |
| { | |
| "status": "COMPLETED", | |
| "output": { | |
| "images": [ | |
| { | |
| "image_base64": "iVBORw0KGgoAAAANSU...", | |
| "width": 200, | |
| "height": 64, | |
| "text": "Patient", | |
| "author_id": 42, | |
| "hw_id": "hw0_b0_l0_w0" | |
| }, | |
| ... | |
| ], | |
| "total_generated": 2 | |
| } | |
| } | |
| ``` | |
| 7. **Store generated images**: Map `hw_id β image_base64` for insertion | |
| ##### **Error Handling** | |
| - **Retry logic**: 3 attempts with exponential backoff (matching seed download) | |
| - **Timeouts**: Dynamic based on batch size: `20s Γ num_texts + 30s buffer` | |
| - **Failure behavior**: **RAISE EXCEPTION** (since session fix) | |
| - β OLD: Silent continue β Documents without handwriting | |
| - β NEW: Raise exception β Generation fails when user requested handwriting | |
| ##### **Service Code Structure** | |
| **`handwriting_service/handler.py`** (RunPod handler): | |
| ```python | |
| # Initialize model ONCE at module level (not per request) | |
| generator = HandwritingGenerator( | |
| model_dir="WordStylist", | |
| checkpoint_path="WordStylist/models", | |
| device="cuda" | |
| ) | |
| def handler(job): | |
| """RunPod entry point - supports both /run and /runsync""" | |
| texts = job["input"]["texts"] # Batch input | |
| results = generator.generate_batch( | |
| texts=[t["text"] for t in texts], | |
| author_ids=[t["author_id"] for t in texts], | |
| num_inference_steps=50, | |
| temperature=1.0, | |
| apply_blur=True | |
| ) | |
| return {"images": results, "total_generated": len(results)} | |
| ``` | |
| **`handwriting_service/inference.py`** (WordStylist wrapper): | |
| ```python | |
| class HandwritingGenerator: | |
| def generate_batch(self, texts, author_ids, ...): | |
| results = [] | |
| for text, author_id in zip(texts, author_ids): | |
| # Load model checkpoint | |
| unet = Unet(...) | |
| unet.load_state_dict(checkpoint) | |
| # Prepare style condition | |
| style_id_tensor = torch.tensor([author_id]) | |
| # Diffusion reverse process (50 steps) | |
| img = self.sample(unet, style_id_tensor, text_length=len(text)) | |
| # Post-process: crop, resize, apply blur | |
| img_pil = postprocess_image(img) | |
| if apply_blur: | |
| img_pil = img_pil.filter(ImageFilter.GaussianBlur(1.2)) | |
| # Encode to base64 | |
| img_base64 = encode_pil_to_base64(img_pil) | |
| results.append({ | |
| "image_base64": img_base64, | |
| "width": img_pil.width, | |
| "height": img_pil.height | |
| }) | |
| return results | |
| ``` | |
| #### **Stage 10: Create Visual Element Images** | |
| - **Original**: `pipeline_10_create_visual_elements.py` | |
| - **API**: `generate_visual_element_images()` in `api/utils.py:925-1020` | |
| - **Process**: | |
| 1. Load prefab images from `data/visual_element_prefabs/{type}/`: | |
| - `logo/`: Company logos (50+ SVGs) | |
| - `photo/`: Stock photos (100+ JPGs) | |
| - `figure/`: Charts, graphs (30+ PNGs) | |
| - `barcode/`: Generated barcodes | |
| - `qr_code/`, `stamp/`, `signature/`, `checkbox/`, etc. | |
| 2. **Random selection** (seed-based if provided): | |
| ```python | |
| if seed is not None: | |
| random.seed(seed) | |
| prefab_path = random.choice(list(prefab_dir.glob("*"))) | |
| ``` | |
| 3. **Special handling**: | |
| - **Barcode**: Generate on-the-fly using `python-barcode` library | |
| ```python | |
| # Generate random EAN-13 barcode (12 digits + checksum) | |
| barcode_num = random.randint(100000000000, 999999999999) | |
| barcode = EAN13(str(barcode_num), writer=ImageWriter()) | |
| ``` | |
| - **QR Code**: Generate using `qrcode` library | |
| - **Checkbox**: Render checked/unchecked SVG | |
| 4. Load and convert to base64: | |
| ```python | |
| with open(prefab_path, 'rb') as f: | |
| img_bytes = f.read() | |
| img_base64 = base64.b64encode(img_bytes).decode('utf-8') | |
| ``` | |
| 5. Return mapping: `ve_id β image_base64` | |
| #### **Stage 11: Make Text Transparent (Implicit)** | |
| - **Original**: `pipeline_11_make_text_transparent.py` | |
| - **API**: Implemented as "whiteout" in `process_stage3_complete()` at `api/utils.py:1415-1427` | |
| - **Process**: | |
| ```python | |
| # Draw white rectangles over original text to hide it | |
| for hw_region in handwriting_regions: | |
| for bbox_str in hw_region['bboxes']: | |
| bbox = parse_bbox(bbox_str) | |
| rect = fitz.Rect(bbox.x0, bbox.y0, bbox.x2, bbox.y2) | |
| page.draw_rect(rect, color=(1,1,1), fill=(1,1,1)) # White fill | |
| ``` | |
| - **Why not transparent?**: PyMuPDF doesn't support making existing text transparent, so we use white rectangles instead (same visual result) | |
| #### **Stage 12: Insert Handwriting Images** | |
| - **Original**: `pipeline_12_insert_handwriting_images.py` | |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1429-1520` | |
| - **Process**: | |
| 1. **Position calculation**: | |
| ```python | |
| # Get word bbox from PDF extraction | |
| bbox_w = bbox.x2 - bbox.x0 # Width in points | |
| bbox_h = bbox.y2 - bbox.y0 # Height in points | |
| # Resize handwriting image with aspect ratio | |
| scale = min(bbox_w / img_width, bbox_h / img_height) | |
| new_w = int(img_width * scale * SCALE_UP_FACTOR) # 3x upscale | |
| new_h = int(img_height * scale * SCALE_UP_FACTOR) | |
| # Add random offsets for natural variation | |
| offset_x = random.randint(-MAX_OFFSET_LEFT, MAX_OFFSET_RIGHT) + FIXED_OFFSET | |
| offset_y = random.randint(-MAX_OFFSET_UP, MAX_OFFSET_DOWN) | |
| # Position at bbox coordinates | |
| x0 = bbox.x0 + offset_x | |
| y0 = bbox.y0 + offset_y - y_padding | |
| ``` | |
| 2. **Insert into PDF**: | |
| ```python | |
| img_resized = img.resize((new_w, new_h), Image.LANCZOS).convert("RGBA") | |
| img_bytes = pil_to_bytes(img_resized) | |
| rect = fitz.Rect(x0, y0, x0 + bbox_w, y0 + bbox_h) | |
| page.insert_image(rect, stream=img_bytes) | |
| ``` | |
| 3. Save intermediate PDF: `{doc_id}_with_handwriting.pdf` | |
| #### **Stage 13: Insert Visual Elements** | |
| - **Original**: `pipeline_13_insert_visual_elements.py` | |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1523-1625` | |
| - **Process**: | |
| 1. Convert mm β points: `mm_to_pt = 72 / 25.4` | |
| 2. Resize with aspect ratio preservation (same as handwriting) | |
| 3. Center image on white background (maintains bbox size) | |
| 4. Insert into PDF at geometry coordinates | |
| 5. Save final PDF: `{doc_id}_final.pdf` (includes both handwriting + visual elements) | |
| --- | |
| ### **Phase 3: Image Finalization & OCR (Stages 14-15)** | |
| Convert final PDF to high-resolution image and extract OCR data. | |
| #### **Stage 14: Render Image** | |
| - **Original**: `pipeline_14_render_image.py` | |
| - **API**: `process_stage4_ocr()` in `api/utils.py:1899-1940` | |
| - **Process**: | |
| ```python | |
| # Render PDF page to high-res PNG | |
| page = fitz.open(pdf_path)[0] | |
| pix = page.get_pixmap(matrix=fitz.Matrix(3, 3)) # 3x scale = ~220 DPI | |
| img_bytes = pix.tobytes("png") | |
| img_base64 = base64.b64encode(img_bytes).decode('utf-8') | |
| ``` | |
| - **Output**: Base64-encoded PNG at 220 DPI (configurable via scale factor) | |
| #### **Stage 15: Perform OCR** | |
| - **Original**: `pipeline_15_perform_ocr.py` | |
| - **API**: `run_paddle_ocr()` in `api/utils.py:1950-2080` | |
| - **OCR Engine**: PaddleOCR v4 (multilingual) | |
| - Models: `PP-OCRv4` detection + recognition | |
| - Languages: Supports 80+ languages | |
| - Accuracy: State-of-the-art open-source OCR | |
| - **Process**: | |
| 1. Render PDF to image via `pdf2image` at specified DPI (default: 300) | |
| 2. Initialize PaddleOCR with language parameter | |
| 3. Run detection + recognition: | |
| ```python | |
| ocr = PaddleOCR(lang=language, use_gpu=True) | |
| results = ocr.ocr(img_array, cls=True) | |
| ``` | |
| 4. Parse results into word-level bboxes: | |
| ```python | |
| { | |
| "text": "word", | |
| "bbox": { | |
| "x0": float, | |
| "y0": float, | |
| "x1": float, # right | |
| "y1": float # bottom | |
| }, | |
| "confidence": 0.95 | |
| } | |
| ``` | |
| - **Output**: Dictionary with `words` list, image dimensions, OCR engine info | |
| --- | |
| ### **Phase 4: Dataset Packaging (Stages 16-19)** | |
| Normalize, verify, analyze, and export final dataset. | |
| #### **Stage 16: Normalize Bboxes** | |
| - **Original**: `pipeline_16_normalize_bboxes.py` | |
| - **API**: `normalize_bboxes()` in `api/utils.py:2100-2180` | |
| - **Process**: | |
| 1. Convert absolute pixel coordinates β normalized [0, 1] range: | |
| ```python | |
| norm_bbox = [ | |
| bbox['x0'] / img_width, | |
| bbox['y0'] / img_height, | |
| bbox['x1'] / img_width, | |
| bbox['y1'] / img_height | |
| ] | |
| ``` | |
| 2. Clip to [0, 1]: `[max(0, min(1, x)) for x in norm_bbox]` | |
| 3. Create word-level and segment-level bboxes | |
| - **Output**: List of `{text, bbox: [x0, y0, x1, y1]}` where bbox is normalized | |
| #### **Stage 17: Ground Truth Verification** | |
| - **Original**: `pipeline_17_gt_preparation_verification.py` | |
| - **API**: `verify_ground_truth()` in `api/utils.py:2185-2250` | |
| - **Checks**: | |
| - GT structure: Valid JSON, required fields | |
| - Text matching: GT text exists in OCR output | |
| - Bbox coverage: GT answers have corresponding bboxes | |
| - **Output**: Verification report with pass/fail status | |
| #### **Stage 18: Analyze** | |
| - **Original**: `pipeline_18_analyze.py` | |
| - **API**: `analyze_document()` in `api/utils.py:2255-2320` | |
| - **Metrics**: | |
| - Word count, character count | |
| - Average word length | |
| - Handwriting regions count, coverage % | |
| - Visual elements count by type | |
| - OCR confidence statistics (mean, min, max) | |
| - **Output**: Analysis dictionary with computed metrics | |
| #### **Stage 19: Create Debug Data & Export** | |
| - **Original**: `pipeline_19_create_debug_data.py` | |
| - **API**: `export_to_msgpack()` in `api/utils.py:2350-2520` | |
| - **Debug Visualization**: | |
| - Draw bboxes on image with different colors: | |
| - Green: Word bboxes | |
| - Red: Handwriting regions | |
| - Blue: Visual elements | |
| - Yellow: Ground truth target regions | |
| - Save annotated image | |
| - **Dataset Export (msgpack)**: | |
| ```python | |
| dataset_entry = { | |
| "image": img_bytes, # PNG bytes | |
| "words": ["hello", "world"], | |
| "word_bboxes": [[0.1, 0.2, 0.15, 0.25], ...], # Normalized | |
| "segment_bboxes": [...], | |
| "ground_truth": {"question": "answer"}, | |
| "metadata": { | |
| "document_id": "...", | |
| "has_handwriting": True, | |
| "num_visual_elements": 3 | |
| } | |
| } | |
| msgpack.dump(dataset_entry, f) | |
| ``` | |
| - **Output**: `.msgpack` file compatible with PyTorch DataLoader | |
| --- | |
| ## Pipeline Verification: API vs Original Implementation | |
| ### β **Stage-by-Stage Mapping** | |
| | Stage | Original File | API Function | Status | | |
| |-------|--------------|--------------|--------| | |
| | 01 | `pipeline_01_select_seeds.py` | `download_seed_images()` | β Mapped (with retry logic) | | |
| | 02 | `pipeline_02_prompt_llm.py` | `call_claude_api_direct()` | β Mapped (uses Messages API) | | |
| | 03 | `pipeline_03_process_response.py` | `extract_html_documents_from_response()` | β Mapped | | |
| | 04 | `pipeline_04_render_pdf_and_extract_geos.py` | `render_html_to_pdf()` | β Mapped (Playwright) | | |
| | 05 | `pipeline_05_extract_bboxes_from_pdf.py` | `extract_bboxes_from_rendered_pdf()` | β Mapped | | |
| | 06 | `pipeline_06_validation.py` | `validate_html_structure()`, `validate_pdf()` | β Mapped | | |
| | 07 | `pipeline_07_extract_handwriting.py` | `process_stage3_complete()` section | β Mapped (with ratio filter) | | |
| | 08 | `pipeline_08_extract_visual_element_definitions.py` | `process_stage3_complete()` section | β Mapped | | |
| | 09 | `pipeline_09_create_handwriting_images.py` | `call_handwriting_service_batch()` | β Mapped (RunPod integration) | | |
| | 10 | `pipeline_10_create_visual_elements.py` | `generate_visual_element_images()` | β Mapped | | |
| | 11 | `pipeline_11_make_text_transparent.py` | `process_stage3_complete()` (whiteout) | β Mapped (white rectangles) | | |
| | 12 | `pipeline_12_insert_handwriting_images.py` | `process_stage3_complete()` section | β Mapped | | |
| | 13 | `pipeline_13_insert_visual_elements.py` | `process_stage3_complete()` section | β Mapped | | |
| | 14 | `pipeline_14_render_image.py` | `process_stage4_ocr()` | β Mapped | | |
| | 15 | `pipeline_15_perform_ocr.py` | `run_paddle_ocr()` | β Mapped | | |
| | 16 | `pipeline_16_normalize_bboxes.py` | `normalize_bboxes()` | β Mapped | | |
| | 17 | `pipeline_17_gt_preparation_verification.py` | `verify_ground_truth()` | β Mapped | | |
| | 18 | `pipeline_18_analyze.py` | `analyze_document()` | β Mapped | | |
| | 19 | `pipeline_19_create_debug_data.py` | `export_to_msgpack()` | β Mapped | | |
| ### π **Key Differences: API vs Batch Pipeline** | |
| #### **Processing Model** | |
| - **Original**: Batch processing with file-based state management | |
| - Input: CSV of seed selections, prompt parameters in JSON | |
| - Output: Folder structure with intermediate files | |
| - State: JSON logs per document + message | |
| - Resumability: Can restart from any stage | |
| - **API**: Request/response with in-memory processing | |
| - Input: JSON request with seed URLs | |
| - Output: JSON response or ZIP file | |
| - State: Ephemeral (temporary directories) | |
| - Resumability: None (single-shot generation) | |
| #### **Handwriting Generation** | |
| - **Original**: Local GPU with WordStylist model loaded in-process | |
| - Location: `docgenie/generation/handwriting_diffusion/` | |
| - Execution: `generate_handwriting_diffusion_raw.py` | |
| - Cost: Free (local GPU) | |
| - **API**: Remote RunPod serverless endpoint | |
| - Location: `handwriting_service/` (deployed separately) | |
| - Execution: HTTP POST to RunPod API | |
| - Cost: ~$0.00025/s GPU time (pay-per-use) | |
| - Benefit: No local GPU required, scales automatically | |
| #### **Seed Selection** | |
| - **Original**: Pre-crawled dataset with systematic selection | |
| - Seeds stored in: `data/datasets/base_v2/` | |
| - Selection: Clustering algorithm β balanced subset | |
| - Tracking: CSV manifest with seed IDs | |
| - **API**: User-provided URLs | |
| - Seeds: Any publicly accessible image URL | |
| - Selection: User chooses 1-8 images per request | |
| - Tracking: URLs stored in request metadata | |
| #### **Prompt Templates** | |
| - **Original**: Multiple template versions in folders | |
| - Path: `data/prompt_templates/{version}/seed-based-json.txt` | |
| - Versioning: ClaudeRefined1 β ClaudeRefined12 | |
| - Selection: Configurable per dataset | |
| - **API**: Fixed template (latest version) | |
| - Path: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt` | |
| - Hardcoded in: `api/main.py:171` | |
| - **Future improvement**: Make template selectable via API parameter | |
| --- | |
| ## Complete Request Flow Example | |
| ### Example Request (Sync Endpoint) | |
| ```bash | |
| POST /generate/pdf HTTP/1.1 | |
| Content-Type: application/json | |
| { | |
| "seed_images": [ | |
| "https://example.com/seed1.jpg", | |
| "https://example.com/seed2.jpg" | |
| ], | |
| "prompt_params": { | |
| "language": "english", | |
| "doc_type": "medical_form", | |
| "gt_type": "kie", | |
| "gt_format": "json", | |
| "num_solutions": 2, | |
| "enable_handwriting": true, | |
| "handwriting_ratio": 0.3, | |
| "enable_visual_elements": true, | |
| "visual_element_types": ["logo", "signature"], | |
| "enable_ocr": true, | |
| "enable_dataset_export": true, | |
| "seed": 42 | |
| } | |
| } | |
| ``` | |
| ### Processing Flow (Stages Executed) | |
| **Phase 1: Core Document Generation (30-60s)** | |
| 1. β Download 2 seed images with retry β `[img1_b64, img2_b64]` | |
| 2. β Load prompt template β Build prompt for medical_form + KIE | |
| 3. β Call Claude API β LLM generates 2 HTML documents (~25s) | |
| 4. β Extract HTML + ground truth β 2 clean HTML files with GT JSON | |
| 5. β Render each HTML to PDF via Playwright β 2 PDFs + geometries | |
| 6. β Extract word bboxes from PDFs β ~200-500 words per document | |
| **Phase 2: Feature Synthesis (120-180s if handwriting enabled)** | |
| 7. β Parse geometries for handwriting markers | |
| - Found: 12 elements with `class="handwritten"` | |
| - Filtered by ratio: 12 Γ 0.3 = ~4 elements selected (probabilistic) | |
| - Matched to word bboxes: 4 regions with 15 total words | |
| 8. β Parse geometries for visual elements | |
| - Found: 3 elements (`data-placeholder="logo"`, `"signature"`, `"logo"`) | |
| - Filtered by types: Keep logo + signature, remove others | |
| - Result: 2 visual element definitions | |
| 9. β Generate handwriting images via RunPod | |
| - **Batch request**: 15 words in ONE API call | |
| - Map author IDs: `author1 β style 42`, `author2 β style 137` | |
| - RunPod processing: 1 worker Γ (15 Γ 18s) = ~270s | |
| - Result: 15 PNG images (base64-encoded) | |
| 10. β Generate visual element images | |
| - Logo: Random selection from `data/visual_element_prefabs/logo/` (seed=42) | |
| - Signature: Generate on-the-fly using signature prefab | |
| - Result: 2 PNG images | |
| 11. β Whiteout original text: Draw white rectangles over 15 word positions | |
| 12. β Insert handwriting: Place 15 generated images at word bboxes with offsets | |
| - Save: `doc1_with_handwriting.pdf`, `doc2_with_handwriting.pdf` | |
| 13. β Insert visual elements: Place logo + signature at geometry coords | |
| - Save: `doc1_final.pdf`, `doc2_final.pdf` | |
| **Phase 3: Image + OCR (5-10s)** | |
| 14. β Render each final PDF to 220 DPI image β 2 PNG files (base64) | |
| 15. β Run PaddleOCR on each image | |
| - Doc1: Detected 187 words, avg confidence 0.91 | |
| - Doc2: Detected 203 words, avg confidence 0.94 | |
| **Phase 4: Dataset Packaging (2-5s)** | |
| 16. β Normalize OCR bboxes: Convert pixels β [0,1] range | |
| 17. β Verify ground truth: Check GT fields match OCR output (enabled=false, skipped) | |
| 18. β Analyze documents: Compute metrics (enabled=false, skipped) | |
| 19. β Export to msgpack: | |
| - Doc1: Pack image + words + normalized bboxes + GT β `doc1.msgpack` | |
| - Doc2: Pack image + words + normalized bboxes + GT β `doc2.msgpack` | |
| **Final Output: ZIP File Contents** | |
| ``` | |
| dataset.zip | |
| βββ doc1_uuid_0.pdf # Original rendered PDF | |
| βββ doc1_uuid_0_final.pdf # PDF with handwriting + visual elements | |
| βββ doc1_uuid_0.msgpack # Dataset format | |
| βββ doc2_uuid_1.pdf | |
| βββ doc2_uuid_1_final.pdf | |
| βββ doc2_uuid_1.msgpack | |
| βββ metadata.json # Complete generation metadata | |
| βββ handwriting/ | |
| βββ hw0_b0_l0_w0.png # Individual handwriting images | |
| βββ hw0_b0_l0_w1.png | |
| βββ ... (13 more) | |
| ``` | |
| ### Response (JSON Metadata) | |
| ```json | |
| { | |
| "task_id": "uuid-here", | |
| "status": "completed", | |
| "num_documents": 2, | |
| "processing_time_seconds": 305.7, | |
| "stages_completed": [ | |
| "seed_download", "llm_prompt", "html_extraction", | |
| "pdf_render", "bbox_extraction", "handwriting_extraction", | |
| "visual_element_extraction", "handwriting_generation", | |
| "visual_element_generation", "handwriting_insertion", | |
| "visual_element_insertion", "image_render", "ocr", | |
| "bbox_normalization", "dataset_export" | |
| ], | |
| "documents": [ | |
| { | |
| "document_id": "doc1_uuid_0", | |
| "ground_truth": {"patient_name": "John Doe", "date": "2024-01-15"}, | |
| "num_words": 187, | |
| "num_handwriting_regions": 2, | |
| "num_visual_elements": 2, | |
| "ocr_confidence_avg": 0.91 | |
| }, | |
| { | |
| "document_id": "doc2_uuid_1", | |
| "ground_truth": {"patient_name": "Jane Smith", "date": "2024-01-16"}, | |
| "num_words": 203, | |
| "num_handwriting_regions": 2, | |
| "num_visual_elements": 2, | |
| "ocr_confidence_avg": 0.94 | |
| } | |
| ], | |
| "download_url": "/download/dataset_uuid.zip" | |
| } | |
| ``` | |
| --- | |
| ## Configuration & Environment | |
| ### Required Environment Variables | |
| ```bash | |
| # LLM API | |
| ANTHROPIC_API_KEY=sk-ant-... # Claude API key | |
| CLAUDE_MODEL=claude-3-5-sonnet-20241022 # Default model | |
| # Handwriting Service (RunPod) | |
| HANDWRITING_SERVICE_ENABLED=true | |
| HANDWRITING_SERVICE_URL=https://api.runpod.ai/v2/{endpoint_id}/runsync | |
| RUNPOD_API_KEY=... # RunPod API key | |
| HANDWRITING_APPLY_BLUR=true # Gaussian blur for realism | |
| HANDWRITING_SERVICE_MAX_RETRIES=3 | |
| HANDWRITING_SERVICE_TIMEOUT=600 # 10 minutes for large batches | |
| # OCR Configuration | |
| OCR_DPI=300 # Image resolution for OCR | |
| OCR_LANGUAGE=en # PaddleOCR language code | |
| # File Paths | |
| PROMPT_TEMPLATES_DIR=/path/to/data/prompt_templates | |
| VISUAL_ELEMENT_PREFABS_DIR=/path/to/data/visual_element_prefabs | |
| ``` | |
| ### Docker Deployment (Railway) | |
| ```dockerfile | |
| # Dockerfile (api service) | |
| FROM python:3.11-slim | |
| RUN apt-get update && apt-get install -y \ | |
| chromium chromium-driver \ # Playwright dependencies | |
| libgl1 libglib2.0-0 \ # PaddleOCR dependencies | |
| && rm -rf /var/lib/apt/lists/* | |
| COPY api/ /app/api | |
| COPY docgenie/ /app/docgenie | |
| COPY data/ /app/data | |
| WORKDIR /app/api | |
| RUN pip install -r requirements.txt | |
| CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] | |
| ``` | |
| **Handwriting service**: See `handwriting_service/Dockerfile` (deployed separately to RunPod) | |
| --- | |
| ## Performance & Costs | |
| ### Timing Breakdown (Single Document) | |
| | Stage | Time | Notes | | |
| |-------|------|-------| | |
| | Seed download | 0.5-2s | Depends on image size + network | | |
| | LLM prompt | 20-40s | Claude API latency | | |
| | PDF render | 1-3s | Playwright initialization | | |
| | Handwriting (10 words) | 180s | RunPod: 1 worker Γ (10Γ18s) | | |
| | Visual elements | 0.5-1s | Local file selection | | |
| | OCR | 3-5s | PaddleOCR inference | | |
| | Dataset export | 0.5-1s | msgpack serialization | | |
| | **TOTAL (no handwriting)** | **25-50s** | | |
| | **TOTAL (with handwriting)** | **200-230s** | Batched | | |
| ### Cost Breakdown (Per Document) | |
| | Component | Cost | Notes | | |
| |-----------|------|-------| | |
| | Claude API | $0.015-0.03 | ~5K input + 16K output tokens | | |
| | RunPod GPU (10 words) | $0.045 | 180s Γ $0.00025/s | | |
| | Storage | Negligible | Temporary files deleted | | |
| | **TOTAL (no handwriting)** | **$0.015-0.03** | | |
| | **TOTAL (with handwriting)** | **$0.06-0.08** | | |
| **Optimization**: Batch multiple documents in ONE RunPod call to share worker activation overhead. | |
| --- | |
| ## Error Handling & Reliability | |
| ### Retry Mechanisms | |
| 1. **Seed image download**: 3 attempts, exponential backoff (2s, 4s, 8s) | |
| 2. **Handwriting service**: 3 attempts, status polling up to 30 times | |
| 3. **LLM API**: Built-in Anthropic SDK retries (rate limits, 529 errors) | |
| ### Failure Modes | |
| | Error Type | Behavior | User Impact | | |
| |------------|----------|-------------| | |
| | Seed download failure | Raise HTTP 400 | Request rejected immediately | | |
| | LLM API error | Raise HTTP 500 | No charge, can retry | | |
| | Handwriting service failure | **Raise exception** (NEW) | Generation fails, prevents invalid outputs | | |
| | OCR failure | Log warning, continue | Document generated without OCR data | | |
| | PDF render failure | Raise HTTP 500 | Request fails, no partial results | | |
| ### Session Fixes Applied | |
| - β **Handwriting service failure now raises exception** (previously silent) | |
| - β **Seed parameter defaults to null** (previously 0) | |
| - β **Seed image download retry logic** (handles 503 timeout errors) | |
| - β **API docs show correct examples** (seed: null, not 0) | |
| --- | |
| ## Future Enhancements | |
| ### Short-term | |
| 1. **Configurable prompt templates** via API parameter | |
| 2. **Async endpoint progress tracking** (websocket or polling) | |
| 3. **Batch ZIP download** with multiple documents in one archive | |
| 4. **Cost estimation** before generation (preview mode) | |
| ### Long-term | |
| 1. **Custom visual element upload** (user-provided logos, signatures) | |
| 2. **Multi-page document support** (currently single-page only) | |
| 3. **Additional export formats** (COCO, YOLO, HuggingFace Datasets) | |
| 4. **Fine-tuning handwriting styles** (train on user's handwriting samples) | |
| 5. **LLM caching** (reduce cost for similar prompts) | |
| --- | |
| ## Troubleshooting | |
| ### Common Issues | |
| **Q: "Handwriting service not called, but enable_handwriting=true"** | |
| - Check: LLM output contains `class="handwritten"` in HTML | |
| - Check: `handwriting_ratio` > 0 (default 0.2) | |
| - Check: `HANDWRITING_SERVICE_ENABLED=true` in environment | |
| - Debug: Look for "π DEBUG - Handwriting Service Check" in logs | |
| **Q: "RunPod job stuck IN_PROGRESS"** | |
| - Cause: Large batch timing out | |
| - Solution: Increase `HANDWRITING_SERVICE_TIMEOUT` (default 600s) | |
| - Or: Reduce batch size by lowering `handwriting_ratio` | |
| **Q: "503 first byte timeout" on seed download** | |
| - Cause: CDN/storage provider temporary unavailability | |
| - Solution: Retry logic automatically handles this (3 attempts) | |
| - If persists: Use different image hosting (imgur, cloudinary) | |
| **Q: "Seed parameter still shows 0 in API docs"** | |
| - Fixed: Added `examples=[None, 42]` to Field definition | |
| - Clear browser cache if seeing old docs | |
| --- | |
| ## Testing | |
| ### Unit Tests | |
| ```bash | |
| # Test individual stages | |
| pytest api/tests/test_utils.py::test_download_seed_images | |
| pytest api/tests/test_utils.py::test_handwriting_service_batch | |
| ``` | |
| ### Integration Tests | |
| ```bash | |
| # Test sync endpoint (included in repo) | |
| python api/test_sync_pdf_api.py | |
| # Test async endpoint | |
| python api/test_async_api.py | |
| ``` | |
| ### Manual Testing via Docs UI | |
| 1. Navigate to `http://localhost:8000/docs` | |
| 2. Expand `/generate/pdf` endpoint | |
| 3. Click "Try it out" | |
| 4. Paste example request JSON | |
| 5. Click "Execute" | |
| 6. Download resulting ZIP file | |
| ### Example Test Request (Minimal) | |
| ```json | |
| { | |
| "seed_images": [ | |
| "https://i.imgur.com/example.jpg" | |
| ], | |
| "prompt_params": { | |
| "language": "english", | |
| "doc_type": "invoice", | |
| "num_solutions": 1, | |
| "enable_handwriting": false, | |
| "enable_visual_elements": false, | |
| "enable_ocr": true, | |
| "enable_dataset_export": true | |
| } | |
| } | |
| ``` | |
| --- | |
| ## Conclusion | |
| The DocGenie API successfully implements all 19 stages of the original batch pipeline in a request/response model suitable for real-time generation. Key architectural differences: | |
| 1. **Handwriting generation**: Offloaded to RunPod serverless (cost-efficient batching) | |
| 2. **Seed selection**: User-provided URLs instead of pre-crawled dataset | |
| 3. **State management**: Ephemeral in-memory processing vs file-based | |
| 4. **Scalability**: Horizontal scaling via FastAPI workers + async processing | |
| The API maintains feature parity with the batch pipeline while providing a simpler interface for integration with external systems (web apps, mobile apps, data pipelines). | |
| **Total Processing Time**: 25-50s (no handwriting) or 200-230s (with handwriting) | |
| **Cost Per Document**: $0.015-0.08 depending on features | |
| **Output Formats**: PDF, PNG, msgpack, ZIP archive | |
| For questions or issues, see `api/README.md` or `TESTING.md`. | |