Complete API Flow Documentation
Overview
The DocGenie API provides three endpoints for synthetic document generation, implementing a 19-stage pipeline that transforms seed images and prompts into complete datasets with OCR, ground truth, and optional handwriting/visual elements.
Base URL: http://localhost:8000 (development) or Railway deployment
Documentation: /docs (FastAPI auto-generated Swagger UI)
API Endpoints
1. /generate - Legacy JSON Response (POST)
Purpose: Generate documents and return complete JSON metadata
Response: JSON with HTML, PDF (base64), bounding boxes, optional handwriting/visual elements
Use Case: Testing, development, full metadata inspection
Pipeline Stages: 1-19 (configurable via parameters)
2. /generate/pdf - Sync PDF+Dataset ZIP (POST)
Purpose: Generate documents and return ZIP file with all artifacts
Response: ZIP file containing:
*.pdf- Generated document PDFs*_final.pdf- PDFs with handwriting/visual elements (if enabled)*.msgpack- Dataset format (if export enabled)metadata.json- Complete generation metadatahandwriting/- Individual handwriting imagesvisual_elements/- Individual visual element images
Use Case: Production dataset generation, batch processing
Pipeline Stages: 1-19 (all features available)
3. /generate/async - Async Batch Processing (POST)
Purpose: Queue large batch jobs via background worker (Redis Queue)
Response: Task ID for status polling
Status Check: GET /generate/async/status/{task_id}
Result Download: GET /generate/async/result/{task_id} (returns ZIP)
Use Case: Large-scale dataset generation (100+ documents)
Pipeline Stages: 1-19 (via worker.py)
Request Parameters
class GenerateDocumentRequest:
seed_images: List[HttpUrl] # 1-8 seed images from web URLs
prompt_params: PromptParameters # Generation configuration
class PromptParameters:
# Core Parameters
language: str = "english" # Document language
doc_type: str = "invoice" # Document type (invoice, receipt, form, etc.)
gt_type: str = "qa" # Ground truth format (qa, kie)
gt_format: str = "json" # GT encoding (json, annotation)
num_solutions: int = 1 # Documents per seed set
# Feature Toggles (Stages 07-19)
enable_handwriting: bool = False # Stage 07-09, 12
handwriting_ratio: float = 0.2 # Probabilistic filter (0.0-1.0)
enable_visual_elements: bool = False # Stage 08, 10, 13
visual_element_types: List[str] = [] # Filter types: logo, photo, figure, barcode, etc.
enable_ocr: bool = True # Stage 15
enable_bbox_normalization: bool = True # Stage 16
enable_gt_verification: bool = False # Stage 17
enable_analysis: bool = False # Stage 18
enable_debug_visualization: bool = False # Stage 19
enable_dataset_export: bool = False # Stage 19 (msgpack format)
dataset_export_format: str = "msgpack" # Currently only msgpack supported
# Reproducibility
seed: Optional[int] = None # Random seed (null = random, int = reproducible)
Pipeline Architecture: The 19 Stages
The API implements all 19 stages of the original batch pipeline in docgenie/generation/. Each stage is mapped to corresponding functions in api/utils.py.
Phase 1: Core Pipeline (Stages 01-06)
Generate base documents from seed images and LLM prompts.
Stage 01: Seed Selection & Download
- Original:
pipeline_01_select_seeds.py - API:
download_seed_images()inapi/utils.py:117-161 - Process:
- Accept user-provided seed image URLs (1-8 images)
- Download with retry logic (3 attempts, exponential backoff)
- Handle transient HTTP errors (502, 503, 504, 429)
- Convert to base64 for LLM input
- Error Handling: Retry with 2s, 4s, 8s delays; raise HTTPException on failure
Stage 02: Prompt LLM
- Original:
pipeline_02_prompt_llm.py - API:
call_claude_api_direct()inapi/utils.py:550-600 - Process:
- Load prompt template:
data/prompt_templates/ClaudeRefined12/seed-based-json.txt - Build prompt with parameters: language, doc_type, gt_type, num_solutions
- Call Claude API (Anthropic Messages API v1)
- Model:
claude-3-5-sonnet-20241022(configurable) - Max tokens: 16,000
- Temperature: 1.0
- Vision: Send base64-encoded seed images
- Model:
- Receive HTML documents with embedded ground truth
- Load prompt template:
- LLM Output Format: Multiple
<!DOCTYPE html>...</html>blocks with:- CSS styling with page dimensions
- HTML elements with semantic classes
- Handwriting markers:
class="handwritten author1"(author1, author2, etc.) - Visual element placeholders:
data-placeholder="logo",data-content="company-logo" - Ground truth:
<script id="GT">{...json...}</script>
Stage 03: Process Response & Extract HTML
- Original:
pipeline_03_process_response.py - API:
extract_html_documents_from_response()inapi/utils.py:605-635 - Process:
- Parse LLM response for
<!DOCTYPE html>...</html>blocks (regex) - Prettify HTML with BeautifulSoup
- Validate HTML structure
- Extract ground truth JSON from
<script id="GT">tag - Remove GT script tag, clean HTML for rendering
- Parse LLM response for
- Validation: Check for required elements, CSS, proper structure
Stage 04: Render PDF & Extract Geometries
- Original:
pipeline_04_render_pdf_and_extract_geos.py - API:
render_html_to_pdf()inapi/utils.py:650-740 - Process:
- Launch Playwright browser (Chromium)
- Set page dimensions from CSS
@pagerule - Render HTML to PDF via
page.pdf() - Extract element geometries:
- Handwriting elements:
.handwrittenclass β{rect, text, classes, selectorTypes: ["handwriting"]} - Visual elements:
[data-placeholder]attribute β{rect, dataPlaceholder, dataContent, selectorTypes: ["visual_element"]}
- Handwriting elements:
- Save PDF and geometries JSON
- Output:
- PDF at 72 DPI (PyMuPDF standard)
- Geometries at 96 DPI (browser rendering)
- Dimensions in mm
Stage 05: Extract Bounding Boxes
- Original:
pipeline_05_extract_bboxes_from_pdf.py - API:
extract_bboxes_from_rendered_pdf()inapi/utils.py:750-825 - Process:
- Open PDF with PyMuPDF (fitz)
- Extract text at word level:
page.get_text("words") - Structure bboxes as:
{ "text": "word", "x0": float, # left "y0": float, # top "x1": float, # right (x2) "y1": float, # bottom (y2) "block_no": int, "line_no": int, "word_no": int } - Filter whitespace-only text
- Convert to OCRBox objects for processing
- Coordinate System: PDF points (72 DPI), origin top-left
Stage 06: Validation
- Original:
pipeline_06_validation.py(implicit) - API:
validate_html_structure(),validate_pdf(),validate_bboxes()inapi/utils.py:830-890 - Checks:
- HTML: Required DOCTYPE, head, body, CSS
- PDF: File readable, page count = 1, has text
- Bboxes: Minimum count (configurable), valid coordinates
Phase 2: Feature Synthesis (Stages 07-13)
Add handwriting and visual elements to base documents.
Stage 07: Extract Handwriting Definitions
- Original:
pipeline_07_extract_handwriting.py - API:
process_stage3_complete()section inapi/utils.py:1150-1235 - Process:
- Filter geometries:
"handwriting" in geo['selectorTypes'] - Parse classes: Extract
author1,author2, etc. fromclass="handwritten author1" - Probabilistic filtering (handwriting_ratio):
if random.random() > handwriting_ratio: continue # Skip this elementratio=0.0: No handwriting (0%)ratio=0.5: ~50% of marked elementsratio=1.0: All marked elements (100%)
- Match geometries to word bboxes:
- Convert browser coords (96 DPI) to PDF coords (72 DPI):
scale = 72/96 = 0.75 - Find consecutive word bboxes matching geometry text
- Check bboxes are within geometry rect (threshold: 0.7)
- Track taken bbox indices to avoid duplicates
- Convert browser coords (96 DPI) to PDF coords (72 DPI):
- Build handwriting region definitions:
{ "id": "hw0", "text": "Patient Name", "author_id": "author1", "is_signature": False, "rect": {x, y, width, height}, # in points "bboxes": ["0_0_0 Patient 10.0 20.0 50.0 35.0", ...] }
- Filter geometries:
- Reproducibility: Use
seed + ifor each region to maintain order consistency
Stage 08: Extract Visual Element Definitions
- Original:
pipeline_08_extract_visual_element_definitions.py - API:
process_stage3_complete()section inapi/utils.py:1237-1275 - Process:
- Filter geometries:
"visual_element" in geo['selectorTypes'] - Parse attributes:
data-placeholder: Element type (logo, photo, figure, chart, barcode, etc.)data-content: Semantic description (e.g., "company-logo", "product-photo")
- Normalize types using synonyms:
- "chart" β "figure"
- "image" β "photo"
- Filter by
visual_element_typesparameter (if specified) - Convert coordinates: pixels (96 DPI) β mm
- Extract rotation from CSS
transform: rotate(Xdeg) - Build visual element definitions:
{ "id": "ve0", "type": "logo", # normalized "content": "company-logo", "rect": {x, y, width, height}, # in mm "rotation": 0 # degrees }
- Filter geometries:
Stage 09: Create Handwriting Images
- Original:
pipeline_09_create_handwriting_images.py - API:
call_handwriting_service_batch()inapi/utils.py:785-920 - Handwriting Service: RunPod serverless endpoint hosting WordStylist diffusion model
- Service Implementation:
handwriting_service/handler.py,handwriting_service/inference.py
π Handwriting Service Integration Details:
Service Architecture
- Platform: RunPod Serverless (GPU: NVIDIA A4000, Cost: ~$0.00025/s active)
- Model: WordStylist (Diffusion-based handwriting synthesis)
- Architecture: UNet with conditional style embeddings
- Input: Text (A-Z, a-z only, no spaces), Writer style ID (0-656)
- Output: PNG image with transparent background
- Inference time: ~18s per text on A4000
- Weights:
handwriting_service/WordStylist/models/
- Endpoints:
/run(async): Queue job, return ID, poll/status/{id}(10MB limit)/runsync(sync): Wait for completion, return result (20MB limit, used by API)
Batch Processing (Cost Optimization)
The API uses TRUE batch processing to minimize RunPod activation overhead:
# β
NEW: Batch all texts in ONE request
runpod_request = {
"input": {
"texts": [
{"text": "Hello", "author_id": 42, "hw_id": "hw0_b0_l0_w0"},
{"text": "World", "author_id": 42, "hw_id": "hw0_b0_l0_w1"},
# ... 10-100 texts
],
"apply_blur": True
}
}
# Result: 1 worker activation Γ (N Γ 18s) = ~40-60% cost savings
Cost Comparison for 10 texts:
- β OLD (parallel): 10 workers Γ 18s = 180 worker-seconds + 10Γ activation fee
- β NEW (batched): 1 worker Γ 190s = 190 worker-seconds + 1Γ activation fee
API Processing Flow
Group by region and line: Split handwriting regions into word-level requests
# Text: "Patient Name" β 2 word-level generations texts_to_generate = [ {"text": "Patient", "author_id": 42, "hw_id": "hw0_b0_l0_w0"}, {"text": "Name", "author_id": 42, "hw_id": "hw0_b0_l0_w1"} ]Map author IDs to numeric styles:
# "author1" β WRITER_STYLES[1] = 42 (deterministic) # "author2" β WRITER_STYLES[2] = 137 # 657 total writer styles availableSanitize text (WordStylist constraint):
# Only A-Z, a-z allowed (no spaces, numbers, punctuation) "Hello123!" β "Hello" "first-name" β "firstname"Send batch request to RunPod
/runsyncendpoint:POST https://api.runpod.ai/v2/{endpoint_id}/runsync Authorization: Bearer {RUNPOD_API_KEY} Content-Type: application/json { "input": { "texts": [...], "apply_blur": True # Gaussian blur for realism } }Handle async responses:
- If
status: "IN_PROGRESS": Poll/status/{job_id}every 5-10s (max 30 polls) - If
status: "COMPLETED": Extractoutput.images[] - If
status: "FAILED": Raise exception (stops entire generation)
- If
Response format:
{ "status": "COMPLETED", "output": { "images": [ { "image_base64": "iVBORw0KGgoAAAANSU...", "width": 200, "height": 64, "text": "Patient", "author_id": 42, "hw_id": "hw0_b0_l0_w0" }, ... ], "total_generated": 2 } }Store generated images: Map
hw_id β image_base64for insertion
Error Handling
- Retry logic: 3 attempts with exponential backoff (matching seed download)
- Timeouts: Dynamic based on batch size:
20s Γ num_texts + 30s buffer - Failure behavior: RAISE EXCEPTION (since session fix)
- β OLD: Silent continue β Documents without handwriting
- β NEW: Raise exception β Generation fails when user requested handwriting
Service Code Structure
handwriting_service/handler.py (RunPod handler):
# Initialize model ONCE at module level (not per request)
generator = HandwritingGenerator(
model_dir="WordStylist",
checkpoint_path="WordStylist/models",
device="cuda"
)
def handler(job):
"""RunPod entry point - supports both /run and /runsync"""
texts = job["input"]["texts"] # Batch input
results = generator.generate_batch(
texts=[t["text"] for t in texts],
author_ids=[t["author_id"] for t in texts],
num_inference_steps=50,
temperature=1.0,
apply_blur=True
)
return {"images": results, "total_generated": len(results)}
handwriting_service/inference.py (WordStylist wrapper):
class HandwritingGenerator:
def generate_batch(self, texts, author_ids, ...):
results = []
for text, author_id in zip(texts, author_ids):
# Load model checkpoint
unet = Unet(...)
unet.load_state_dict(checkpoint)
# Prepare style condition
style_id_tensor = torch.tensor([author_id])
# Diffusion reverse process (50 steps)
img = self.sample(unet, style_id_tensor, text_length=len(text))
# Post-process: crop, resize, apply blur
img_pil = postprocess_image(img)
if apply_blur:
img_pil = img_pil.filter(ImageFilter.GaussianBlur(1.2))
# Encode to base64
img_base64 = encode_pil_to_base64(img_pil)
results.append({
"image_base64": img_base64,
"width": img_pil.width,
"height": img_pil.height
})
return results
Stage 10: Create Visual Element Images
- Original:
pipeline_10_create_visual_elements.py - API:
generate_visual_element_images()inapi/utils.py:925-1020 - Process:
- Load prefab images from
data/visual_element_prefabs/{type}/:logo/: Company logos (50+ SVGs)photo/: Stock photos (100+ JPGs)figure/: Charts, graphs (30+ PNGs)barcode/: Generated barcodesqr_code/,stamp/,signature/,checkbox/, etc.
- Random selection (seed-based if provided):
if seed is not None: random.seed(seed) prefab_path = random.choice(list(prefab_dir.glob("*"))) - Special handling:
- Barcode: Generate on-the-fly using
python-barcodelibrary# Generate random EAN-13 barcode (12 digits + checksum) barcode_num = random.randint(100000000000, 999999999999) barcode = EAN13(str(barcode_num), writer=ImageWriter()) - QR Code: Generate using
qrcodelibrary - Checkbox: Render checked/unchecked SVG
- Barcode: Generate on-the-fly using
- Load and convert to base64:
with open(prefab_path, 'rb') as f: img_bytes = f.read() img_base64 = base64.b64encode(img_bytes).decode('utf-8') - Return mapping:
ve_id β image_base64
- Load prefab images from
Stage 11: Make Text Transparent (Implicit)
- Original:
pipeline_11_make_text_transparent.py - API: Implemented as "whiteout" in
process_stage3_complete()atapi/utils.py:1415-1427 - Process:
# Draw white rectangles over original text to hide it for hw_region in handwriting_regions: for bbox_str in hw_region['bboxes']: bbox = parse_bbox(bbox_str) rect = fitz.Rect(bbox.x0, bbox.y0, bbox.x2, bbox.y2) page.draw_rect(rect, color=(1,1,1), fill=(1,1,1)) # White fill - Why not transparent?: PyMuPDF doesn't support making existing text transparent, so we use white rectangles instead (same visual result)
Stage 12: Insert Handwriting Images
- Original:
pipeline_12_insert_handwriting_images.py - API:
process_stage3_complete()section inapi/utils.py:1429-1520 - Process:
Position calculation:
# Get word bbox from PDF extraction bbox_w = bbox.x2 - bbox.x0 # Width in points bbox_h = bbox.y2 - bbox.y0 # Height in points # Resize handwriting image with aspect ratio scale = min(bbox_w / img_width, bbox_h / img_height) new_w = int(img_width * scale * SCALE_UP_FACTOR) # 3x upscale new_h = int(img_height * scale * SCALE_UP_FACTOR) # Add random offsets for natural variation offset_x = random.randint(-MAX_OFFSET_LEFT, MAX_OFFSET_RIGHT) + FIXED_OFFSET offset_y = random.randint(-MAX_OFFSET_UP, MAX_OFFSET_DOWN) # Position at bbox coordinates x0 = bbox.x0 + offset_x y0 = bbox.y0 + offset_y - y_paddingInsert into PDF:
img_resized = img.resize((new_w, new_h), Image.LANCZOS).convert("RGBA") img_bytes = pil_to_bytes(img_resized) rect = fitz.Rect(x0, y0, x0 + bbox_w, y0 + bbox_h) page.insert_image(rect, stream=img_bytes)Save intermediate PDF:
{doc_id}_with_handwriting.pdf
Stage 13: Insert Visual Elements
- Original:
pipeline_13_insert_visual_elements.py - API:
process_stage3_complete()section inapi/utils.py:1523-1625 - Process:
- Convert mm β points:
mm_to_pt = 72 / 25.4 - Resize with aspect ratio preservation (same as handwriting)
- Center image on white background (maintains bbox size)
- Insert into PDF at geometry coordinates
- Save final PDF:
{doc_id}_final.pdf(includes both handwriting + visual elements)
- Convert mm β points:
Phase 3: Image Finalization & OCR (Stages 14-15)
Convert final PDF to high-resolution image and extract OCR data.
Stage 14: Render Image
- Original:
pipeline_14_render_image.py - API:
process_stage4_ocr()inapi/utils.py:1899-1940 - Process:
# Render PDF page to high-res PNG page = fitz.open(pdf_path)[0] pix = page.get_pixmap(matrix=fitz.Matrix(3, 3)) # 3x scale = ~220 DPI img_bytes = pix.tobytes("png") img_base64 = base64.b64encode(img_bytes).decode('utf-8') - Output: Base64-encoded PNG at 220 DPI (configurable via scale factor)
Stage 15: Perform OCR
- Original:
pipeline_15_perform_ocr.py - API:
run_paddle_ocr()inapi/utils.py:1950-2080 - OCR Engine: PaddleOCR v4 (multilingual)
- Models:
PP-OCRv4detection + recognition - Languages: Supports 80+ languages
- Accuracy: State-of-the-art open-source OCR
- Models:
- Process:
- Render PDF to image via
pdf2imageat specified DPI (default: 300) - Initialize PaddleOCR with language parameter
- Run detection + recognition:
ocr = PaddleOCR(lang=language, use_gpu=True) results = ocr.ocr(img_array, cls=True) - Parse results into word-level bboxes:
{ "text": "word", "bbox": { "x0": float, "y0": float, "x1": float, # right "y1": float # bottom }, "confidence": 0.95 }
- Render PDF to image via
- Output: Dictionary with
wordslist, image dimensions, OCR engine info
Phase 4: Dataset Packaging (Stages 16-19)
Normalize, verify, analyze, and export final dataset.
Stage 16: Normalize Bboxes
- Original:
pipeline_16_normalize_bboxes.py - API:
normalize_bboxes()inapi/utils.py:2100-2180 - Process:
- Convert absolute pixel coordinates β normalized [0, 1] range:
norm_bbox = [ bbox['x0'] / img_width, bbox['y0'] / img_height, bbox['x1'] / img_width, bbox['y1'] / img_height ] - Clip to [0, 1]:
[max(0, min(1, x)) for x in norm_bbox] - Create word-level and segment-level bboxes
- Convert absolute pixel coordinates β normalized [0, 1] range:
- Output: List of
{text, bbox: [x0, y0, x1, y1]}where bbox is normalized
Stage 17: Ground Truth Verification
- Original:
pipeline_17_gt_preparation_verification.py - API:
verify_ground_truth()inapi/utils.py:2185-2250 - Checks:
- GT structure: Valid JSON, required fields
- Text matching: GT text exists in OCR output
- Bbox coverage: GT answers have corresponding bboxes
- Output: Verification report with pass/fail status
Stage 18: Analyze
- Original:
pipeline_18_analyze.py - API:
analyze_document()inapi/utils.py:2255-2320 - Metrics:
- Word count, character count
- Average word length
- Handwriting regions count, coverage %
- Visual elements count by type
- OCR confidence statistics (mean, min, max)
- Output: Analysis dictionary with computed metrics
Stage 19: Create Debug Data & Export
- Original:
pipeline_19_create_debug_data.py - API:
export_to_msgpack()inapi/utils.py:2350-2520 - Debug Visualization:
- Draw bboxes on image with different colors:
- Green: Word bboxes
- Red: Handwriting regions
- Blue: Visual elements
- Yellow: Ground truth target regions
- Save annotated image
- Draw bboxes on image with different colors:
- Dataset Export (msgpack):
dataset_entry = { "image": img_bytes, # PNG bytes "words": ["hello", "world"], "word_bboxes": [[0.1, 0.2, 0.15, 0.25], ...], # Normalized "segment_bboxes": [...], "ground_truth": {"question": "answer"}, "metadata": { "document_id": "...", "has_handwriting": True, "num_visual_elements": 3 } } msgpack.dump(dataset_entry, f) - Output:
.msgpackfile compatible with PyTorch DataLoader
Pipeline Verification: API vs Original Implementation
β Stage-by-Stage Mapping
| Stage | Original File | API Function | Status |
|---|---|---|---|
| 01 | pipeline_01_select_seeds.py |
download_seed_images() |
β Mapped (with retry logic) |
| 02 | pipeline_02_prompt_llm.py |
call_claude_api_direct() |
β Mapped (uses Messages API) |
| 03 | pipeline_03_process_response.py |
extract_html_documents_from_response() |
β Mapped |
| 04 | pipeline_04_render_pdf_and_extract_geos.py |
render_html_to_pdf() |
β Mapped (Playwright) |
| 05 | pipeline_05_extract_bboxes_from_pdf.py |
extract_bboxes_from_rendered_pdf() |
β Mapped |
| 06 | pipeline_06_validation.py |
validate_html_structure(), validate_pdf() |
β Mapped |
| 07 | pipeline_07_extract_handwriting.py |
process_stage3_complete() section |
β Mapped (with ratio filter) |
| 08 | pipeline_08_extract_visual_element_definitions.py |
process_stage3_complete() section |
β Mapped |
| 09 | pipeline_09_create_handwriting_images.py |
call_handwriting_service_batch() |
β Mapped (RunPod integration) |
| 10 | pipeline_10_create_visual_elements.py |
generate_visual_element_images() |
β Mapped |
| 11 | pipeline_11_make_text_transparent.py |
process_stage3_complete() (whiteout) |
β Mapped (white rectangles) |
| 12 | pipeline_12_insert_handwriting_images.py |
process_stage3_complete() section |
β Mapped |
| 13 | pipeline_13_insert_visual_elements.py |
process_stage3_complete() section |
β Mapped |
| 14 | pipeline_14_render_image.py |
process_stage4_ocr() |
β Mapped |
| 15 | pipeline_15_perform_ocr.py |
run_paddle_ocr() |
β Mapped |
| 16 | pipeline_16_normalize_bboxes.py |
normalize_bboxes() |
β Mapped |
| 17 | pipeline_17_gt_preparation_verification.py |
verify_ground_truth() |
β Mapped |
| 18 | pipeline_18_analyze.py |
analyze_document() |
β Mapped |
| 19 | pipeline_19_create_debug_data.py |
export_to_msgpack() |
β Mapped |
π Key Differences: API vs Batch Pipeline
Processing Model
Original: Batch processing with file-based state management
- Input: CSV of seed selections, prompt parameters in JSON
- Output: Folder structure with intermediate files
- State: JSON logs per document + message
- Resumability: Can restart from any stage
API: Request/response with in-memory processing
- Input: JSON request with seed URLs
- Output: JSON response or ZIP file
- State: Ephemeral (temporary directories)
- Resumability: None (single-shot generation)
Handwriting Generation
Original: Local GPU with WordStylist model loaded in-process
- Location:
docgenie/generation/handwriting_diffusion/ - Execution:
generate_handwriting_diffusion_raw.py - Cost: Free (local GPU)
- Location:
API: Remote RunPod serverless endpoint
- Location:
handwriting_service/(deployed separately) - Execution: HTTP POST to RunPod API
- Cost: ~$0.00025/s GPU time (pay-per-use)
- Benefit: No local GPU required, scales automatically
- Location:
Seed Selection
Original: Pre-crawled dataset with systematic selection
- Seeds stored in:
data/datasets/base_v2/ - Selection: Clustering algorithm β balanced subset
- Tracking: CSV manifest with seed IDs
- Seeds stored in:
API: User-provided URLs
- Seeds: Any publicly accessible image URL
- Selection: User chooses 1-8 images per request
- Tracking: URLs stored in request metadata
Prompt Templates
Original: Multiple template versions in folders
- Path:
data/prompt_templates/{version}/seed-based-json.txt - Versioning: ClaudeRefined1 β ClaudeRefined12
- Selection: Configurable per dataset
- Path:
API: Fixed template (latest version)
- Path:
data/prompt_templates/ClaudeRefined12/seed-based-json.txt - Hardcoded in:
api/main.py:171 - Future improvement: Make template selectable via API parameter
- Path:
Complete Request Flow Example
Example Request (Sync Endpoint)
POST /generate/pdf HTTP/1.1
Content-Type: application/json
{
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "english",
"doc_type": "medical_form",
"gt_type": "kie",
"gt_format": "json",
"num_solutions": 2,
"enable_handwriting": true,
"handwriting_ratio": 0.3,
"enable_visual_elements": true,
"visual_element_types": ["logo", "signature"],
"enable_ocr": true,
"enable_dataset_export": true,
"seed": 42
}
}
Processing Flow (Stages Executed)
Phase 1: Core Document Generation (30-60s)
- β
Download 2 seed images with retry β
[img1_b64, img2_b64] - β Load prompt template β Build prompt for medical_form + KIE
- β Call Claude API β LLM generates 2 HTML documents (~25s)
- β Extract HTML + ground truth β 2 clean HTML files with GT JSON
- β Render each HTML to PDF via Playwright β 2 PDFs + geometries
- β Extract word bboxes from PDFs β ~200-500 words per document
Phase 2: Feature Synthesis (120-180s if handwriting enabled) 7. β Parse geometries for handwriting markers
- Found: 12 elements with
class="handwritten" - Filtered by ratio: 12 Γ 0.3 = ~4 elements selected (probabilistic)
- Matched to word bboxes: 4 regions with 15 total words
- β
Parse geometries for visual elements
- Found: 3 elements (
data-placeholder="logo","signature","logo") - Filtered by types: Keep logo + signature, remove others
- Result: 2 visual element definitions
- Found: 3 elements (
- β
Generate handwriting images via RunPod
- Batch request: 15 words in ONE API call
- Map author IDs:
author1 β style 42,author2 β style 137 - RunPod processing: 1 worker Γ (15 Γ 18s) = ~270s
- Result: 15 PNG images (base64-encoded)
- β
Generate visual element images
- Logo: Random selection from
data/visual_element_prefabs/logo/(seed=42) - Signature: Generate on-the-fly using signature prefab
- Result: 2 PNG images
- Logo: Random selection from
- β Whiteout original text: Draw white rectangles over 15 word positions
- β
Insert handwriting: Place 15 generated images at word bboxes with offsets
- Save:
doc1_with_handwriting.pdf,doc2_with_handwriting.pdf
- Save:
- β
Insert visual elements: Place logo + signature at geometry coords
- Save:
doc1_final.pdf,doc2_final.pdf
- Save:
Phase 3: Image + OCR (5-10s) 14. β Render each final PDF to 220 DPI image β 2 PNG files (base64) 15. β Run PaddleOCR on each image - Doc1: Detected 187 words, avg confidence 0.91 - Doc2: Detected 203 words, avg confidence 0.94
Phase 4: Dataset Packaging (2-5s)
16. β
Normalize OCR bboxes: Convert pixels β [0,1] range
17. β
Verify ground truth: Check GT fields match OCR output (enabled=false, skipped)
18. β
Analyze documents: Compute metrics (enabled=false, skipped)
19. β
Export to msgpack:
- Doc1: Pack image + words + normalized bboxes + GT β doc1.msgpack
- Doc2: Pack image + words + normalized bboxes + GT β doc2.msgpack
Final Output: ZIP File Contents
dataset.zip
βββ doc1_uuid_0.pdf # Original rendered PDF
βββ doc1_uuid_0_final.pdf # PDF with handwriting + visual elements
βββ doc1_uuid_0.msgpack # Dataset format
βββ doc2_uuid_1.pdf
βββ doc2_uuid_1_final.pdf
βββ doc2_uuid_1.msgpack
βββ metadata.json # Complete generation metadata
βββ handwriting/
βββ hw0_b0_l0_w0.png # Individual handwriting images
βββ hw0_b0_l0_w1.png
βββ ... (13 more)
Response (JSON Metadata)
{
"task_id": "uuid-here",
"status": "completed",
"num_documents": 2,
"processing_time_seconds": 305.7,
"stages_completed": [
"seed_download", "llm_prompt", "html_extraction",
"pdf_render", "bbox_extraction", "handwriting_extraction",
"visual_element_extraction", "handwriting_generation",
"visual_element_generation", "handwriting_insertion",
"visual_element_insertion", "image_render", "ocr",
"bbox_normalization", "dataset_export"
],
"documents": [
{
"document_id": "doc1_uuid_0",
"ground_truth": {"patient_name": "John Doe", "date": "2024-01-15"},
"num_words": 187,
"num_handwriting_regions": 2,
"num_visual_elements": 2,
"ocr_confidence_avg": 0.91
},
{
"document_id": "doc2_uuid_1",
"ground_truth": {"patient_name": "Jane Smith", "date": "2024-01-16"},
"num_words": 203,
"num_handwriting_regions": 2,
"num_visual_elements": 2,
"ocr_confidence_avg": 0.94
}
],
"download_url": "/download/dataset_uuid.zip"
}
Configuration & Environment
Required Environment Variables
# LLM API
ANTHROPIC_API_KEY=sk-ant-... # Claude API key
CLAUDE_MODEL=claude-3-5-sonnet-20241022 # Default model
# Handwriting Service (RunPod)
HANDWRITING_SERVICE_ENABLED=true
HANDWRITING_SERVICE_URL=https://api.runpod.ai/v2/{endpoint_id}/runsync
RUNPOD_API_KEY=... # RunPod API key
HANDWRITING_APPLY_BLUR=true # Gaussian blur for realism
HANDWRITING_SERVICE_MAX_RETRIES=3
HANDWRITING_SERVICE_TIMEOUT=600 # 10 minutes for large batches
# OCR Configuration
OCR_DPI=300 # Image resolution for OCR
OCR_LANGUAGE=en # PaddleOCR language code
# File Paths
PROMPT_TEMPLATES_DIR=/path/to/data/prompt_templates
VISUAL_ELEMENT_PREFABS_DIR=/path/to/data/visual_element_prefabs
Docker Deployment (Railway)
# Dockerfile (api service)
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
chromium chromium-driver \ # Playwright dependencies
libgl1 libglib2.0-0 \ # PaddleOCR dependencies
&& rm -rf /var/lib/apt/lists/*
COPY api/ /app/api
COPY docgenie/ /app/docgenie
COPY data/ /app/data
WORKDIR /app/api
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Handwriting service: See handwriting_service/Dockerfile (deployed separately to RunPod)
Performance & Costs
Timing Breakdown (Single Document)
| Stage | Time | Notes |
|---|---|---|
| Seed download | 0.5-2s | Depends on image size + network |
| LLM prompt | 20-40s | Claude API latency |
| PDF render | 1-3s | Playwright initialization |
| Handwriting (10 words) | 180s | RunPod: 1 worker Γ (10Γ18s) |
| Visual elements | 0.5-1s | Local file selection |
| OCR | 3-5s | PaddleOCR inference |
| Dataset export | 0.5-1s | msgpack serialization |
| TOTAL (no handwriting) | 25-50s | |
| TOTAL (with handwriting) | 200-230s | Batched |
Cost Breakdown (Per Document)
| Component | Cost | Notes |
|---|---|---|
| Claude API | $0.015-0.03 | ~5K input + 16K output tokens |
| RunPod GPU (10 words) | $0.045 | 180s Γ $0.00025/s |
| Storage | Negligible | Temporary files deleted |
| TOTAL (no handwriting) | $0.015-0.03 | |
| TOTAL (with handwriting) | $0.06-0.08 |
Optimization: Batch multiple documents in ONE RunPod call to share worker activation overhead.
Error Handling & Reliability
Retry Mechanisms
- Seed image download: 3 attempts, exponential backoff (2s, 4s, 8s)
- Handwriting service: 3 attempts, status polling up to 30 times
- LLM API: Built-in Anthropic SDK retries (rate limits, 529 errors)
Failure Modes
| Error Type | Behavior | User Impact |
|---|---|---|
| Seed download failure | Raise HTTP 400 | Request rejected immediately |
| LLM API error | Raise HTTP 500 | No charge, can retry |
| Handwriting service failure | Raise exception (NEW) | Generation fails, prevents invalid outputs |
| OCR failure | Log warning, continue | Document generated without OCR data |
| PDF render failure | Raise HTTP 500 | Request fails, no partial results |
Session Fixes Applied
- β Handwriting service failure now raises exception (previously silent)
- β Seed parameter defaults to null (previously 0)
- β Seed image download retry logic (handles 503 timeout errors)
- β API docs show correct examples (seed: null, not 0)
Future Enhancements
Short-term
- Configurable prompt templates via API parameter
- Async endpoint progress tracking (websocket or polling)
- Batch ZIP download with multiple documents in one archive
- Cost estimation before generation (preview mode)
Long-term
- Custom visual element upload (user-provided logos, signatures)
- Multi-page document support (currently single-page only)
- Additional export formats (COCO, YOLO, HuggingFace Datasets)
- Fine-tuning handwriting styles (train on user's handwriting samples)
- LLM caching (reduce cost for similar prompts)
Troubleshooting
Common Issues
Q: "Handwriting service not called, but enable_handwriting=true"
- Check: LLM output contains
class="handwritten"in HTML - Check:
handwriting_ratio> 0 (default 0.2) - Check:
HANDWRITING_SERVICE_ENABLED=truein environment - Debug: Look for "π DEBUG - Handwriting Service Check" in logs
Q: "RunPod job stuck IN_PROGRESS"
- Cause: Large batch timing out
- Solution: Increase
HANDWRITING_SERVICE_TIMEOUT(default 600s) - Or: Reduce batch size by lowering
handwriting_ratio
Q: "503 first byte timeout" on seed download
- Cause: CDN/storage provider temporary unavailability
- Solution: Retry logic automatically handles this (3 attempts)
- If persists: Use different image hosting (imgur, cloudinary)
Q: "Seed parameter still shows 0 in API docs"
- Fixed: Added
examples=[None, 42]to Field definition - Clear browser cache if seeing old docs
Testing
Unit Tests
# Test individual stages
pytest api/tests/test_utils.py::test_download_seed_images
pytest api/tests/test_utils.py::test_handwriting_service_batch
Integration Tests
# Test sync endpoint (included in repo)
python api/test_sync_pdf_api.py
# Test async endpoint
python api/test_async_api.py
Manual Testing via Docs UI
- Navigate to
http://localhost:8000/docs - Expand
/generate/pdfendpoint - Click "Try it out"
- Paste example request JSON
- Click "Execute"
- Download resulting ZIP file
Example Test Request (Minimal)
{
"seed_images": [
"https://i.imgur.com/example.jpg"
],
"prompt_params": {
"language": "english",
"doc_type": "invoice",
"num_solutions": 1,
"enable_handwriting": false,
"enable_visual_elements": false,
"enable_ocr": true,
"enable_dataset_export": true
}
}
Conclusion
The DocGenie API successfully implements all 19 stages of the original batch pipeline in a request/response model suitable for real-time generation. Key architectural differences:
- Handwriting generation: Offloaded to RunPod serverless (cost-efficient batching)
- Seed selection: User-provided URLs instead of pre-crawled dataset
- State management: Ephemeral in-memory processing vs file-based
- Scalability: Horizontal scaling via FastAPI workers + async processing
The API maintains feature parity with the batch pipeline while providing a simpler interface for integration with external systems (web apps, mobile apps, data pipelines).
Total Processing Time: 25-50s (no handwriting) or 200-230s (with handwriting)
Cost Per Document: $0.015-0.08 depending on features
Output Formats: PDF, PNG, msgpack, ZIP archive
For questions or issues, see api/README.md or TESTING.md.