PDF Atomic Parser

Atomically parse and understand complex PDF documents using claude-opus-4-6 (Anthropic).
Handles equations, graphs, algorithms, unique drawings, multi-column layouts, scanned pages, and 100+ page documents without hallucination.

Designed to be dropped into local agent pipelines as a callable module.

What Makes This Work

Claude processes PDFs natively through Anthropic's document API. Each page is sent as a base64-encoded PDF chunk (or rendered at 300 DPI in image mode) alongside a structured JSON extraction prompt. The model simultaneously sees:

The rasterized visual content (charts, graphs, drawings, handwriting)
The underlying text layer (searchable text, equations, captions)

This dual perception eliminates the need for separate OCR, layout parsers, or equation recognizers. The model returns fully structured JSON containing LaTeX equations, Markdown tables, verbatim algorithm code, and semantic figure descriptions per page.

Features

Feature	Description
Native PDF API	Sends PDF bytes directly; Claude sees both text and visuals
Image mode	Renders pages at 300 DPI via PyMuPDF for maximum fidelity
LaTeX equations	Every equation extracted as proper LaTeX
Table extraction	Tables as Markdown and list-of-dicts JSON
Algorithm extraction	Pseudocode and code blocks verbatim with language detection
Figure description	Semantic descriptions of charts, plots, diagrams, drawings
SQLite caching	Pages are cached; re-runs skip already-parsed pages
Chunked processing	Handles 100+ page documents by splitting into chunks
Multiple output formats	JSON, Markdown, plain text
Agent interface	`AgentPDFInterface` class for programmatic use
Batch processing	Process entire directories of PDFs

Requirements

Python 3.10 or higher
An Anthropic API key with access to claude-opus-4-6
No GPU required; all inference runs through the Anthropic API

External System Dependencies

PyMuPDF (installed via pip) requires no external system libraries on most platforms. On some Linux systems you may need:

sudo apt-get install -y libmupdf-dev

On macOS:

brew install mupdf

On Windows: PyMuPDF ships with pre-built wheels on PyPI; no additional steps needed.

Installation

git clone https://github.com/algorembrant/pdf-atomic-parser.git
cd pdf-atomic-parser

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install -r requirements.txt

Set your API key:

export ANTHROPIC_API_KEY="sk-ant-..."   # Linux / macOS
set  ANTHROPIC_API_KEY=sk-ant-...       # Windows CMD
$env:ANTHROPIC_API_KEY="sk-ant-..."     # Windows PowerShell

Quick Start

Parse a PDF

python pdf_atomic_parser.py parse document.pdf

Outputs document_parsed.json in the current directory.

Full Atomic Extraction (JSON + Markdown + Text)

python pdf_atomic_parser.py atomic document.pdf --output ./results/

Ask a Question

python pdf_atomic_parser.py query document.pdf "What is the main loss function?"

Extract Only Equations

python pdf_atomic_parser.py extract-equations document.pdf

Use in an Agent Pipeline

from pdf_atomic_parser import AgentPDFInterface

agent = AgentPDFInterface(model="opus")

# Full structured parse
result = agent.parse("paper.pdf")

# Just equations as list of dicts
equations = agent.get_equations("paper.pdf")
for eq in equations:
    print(f"Page {eq['page']}: {eq['latex']}")

# Just tables
tables = agent.get_tables("paper.pdf")

# Semantic query
answer = agent.ask("paper.pdf", "What datasets were used for evaluation?")
print(answer)

Usage Reference

Command Overview

Command	Purpose
`parse <pdf>`	Parse entire PDF to JSON/Markdown/text
`atomic <pdf>`	Full extraction to output directory (all formats)
`extract-equations <pdf>`	Extract LaTeX equations only
`extract-tables <pdf>`	Extract tables only
`extract-algorithms <pdf>`	Extract algorithms and code blocks only
`extract-figures <pdf>`	Extract figure descriptions only
`query <pdf> "<question>"`	Semantic question-answering over document
`batch <dir>`	Batch process all PDFs in a directory
`estimate <pdf>`	Estimate token count and cost before parsing
`cache-stats`	Show SQLite cache statistics
`list-cache`	List all cached documents
`clear-cache <pdf>`	Clear cached pages for a document

Global Options

Option	Default	Description
`--model`	`opus`	`opus`, `sonnet`, `haiku`, or full model string
`--mode`	`native`	`native` (PDF bytes) or `image` (300 DPI PNG per page)
`--chunk-size`	`20`	Number of pages per API call
`--verbose`	off	Enable debug logging

parse / atomic Options

Option	Default	Description
`--output / -o`	auto	Output file or directory path
`--format / -f`	`json`	`json`, `markdown`, or `text`
`--pages`	all	Page range, e.g. `1-50`

Output Schema

Each parsed document returns a DocumentResult with:

title, authors, abstract, document_summary
page_results: list of PageResult per page

Each PageResult contains:

{
  "page_number": 3,
  "raw_text": "Full verbatim text...",
  "summary": "This page describes...",
  "section_headers": ["Introduction", "Related Work"],
  "keywords": ["transformer", "attention", "BERT"],
  "equations": [
    {
      "index": 0,
      "latex": "\\mathcal{L} = -\\sum_{i} y_i \\log \\hat{y}_i",
      "description": "Cross-entropy loss function",
      "inline": false
    }
  ],
  "tables": [
    {
      "index": 0,
      "markdown": "| Model | Accuracy |\n|---|---|\n| BERT | 94.2 |",
      "json_data": [{"Model": "BERT", "Accuracy": "94.2"}],
      "caption": "Table 1: Benchmark results"
    }
  ],
  "algorithms": [
    {
      "index": 0,
      "name": "Algorithm 1: Backpropagation",
      "language": "pseudocode",
      "code": "for each layer l from L to 1:\n  ...",
      "description": "Gradient descent update rule"
    }
  ],
  "figures": [
    {
      "index": 0,
      "figure_type": "line_chart",
      "description": "Training loss over 100 epochs...",
      "data_summary": "Y-axis: loss 0-2.0, X-axis: epoch 0-100...",
      "caption": "Figure 2: Training curves"
    }
  ]
}

Choosing a Mode

Scenario	Recommended Mode	Reason
Standard digital PDF	`native` (default)	Fastest, uses both text and visual layers
Scanned / photographed PDF	`image`	Text layer absent; vision handles everything
PDF with complex math	`image`	300 DPI render ensures equation clarity
Very large file (>32 MB)	`image`	Native API has 32 MB size limit per chunk
Cost-sensitive workflow	`native`	Fewer tokens consumed

Cost Estimate

Rough estimates per 100-page academic paper:

Model	Est. Tokens	Est. Cost
claude-opus-4-6	~120,000	~$3.50
claude-sonnet-4-6	~120,000	~$0.60
claude-haiku-4-5	~120,000	~$0.10

Use python pdf_atomic_parser.py estimate document.pdf for a per-document estimate.

Caching

Parsed pages are stored in ~/.cache/pdf_atomic_parser/.pdf_parser_cache.db.
Re-running on the same document skips already-parsed pages automatically.
The cache key is (document_SHA256, page_number, model, mode).

Project Structure

pdf-atomic-parser/
  pdf_atomic_parser.py    Main tool (single file, no splitting needed)
  requirements.txt        Python dependencies
  README.md               This file
  model_card.yml          Hugging Face model card
  .gitignore
  .gitattributes

Author

algorembrant

License

MIT License. See LICENSE file.

Downloads last month: -; Downloads are not tracked for this model. How to track