| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: anthropic |
| | tags: |
| | - pdf |
| | - document-parsing |
| | - ocr |
| | - multimodal |
| | - equations |
| | - table-extraction |
| | - agent |
| | - claude |
| | - information-extraction |
| | - scientific-documents |
| | pipeline_tag: document-question-answering |
| | model_name: PDF Atomic Parser |
| | authors: |
| | - algorembrant |
| | sdk: other |
| | sdk_version: "1.0.0" |
| | app_file: pdf_atomic_parser.py |
| | short_description: > |
| | Atomically parse complex PDFs (equations, graphs, algorithms, tables) |
| | using Claude claude-opus-4-6 without hallucination. Agent-ready. |
| | --- |
| | |
| | # PDF Atomic Parser |
| | |
| |  |
| |  |
| |  |
| |  |
| |  |
| |  |
| |
|
| | |
| | Atomically parse and understand complex PDF documents using **claude-opus-4-6** (Anthropic). |
| | Handles equations, graphs, algorithms, unique drawings, multi-column layouts, scanned pages, |
| | and 100+ page documents without hallucination. |
| |
|
| | |
| | Designed to be dropped into local agent pipelines as a callable module. |
| |
|
| |
|
| | ## What Makes This Work |
| |
|
| | Claude processes PDFs natively through Anthropic's document API. Each page is sent as a |
| | base64-encoded PDF chunk (or rendered at 300 DPI in image mode) alongside a structured |
| | JSON extraction prompt. The model simultaneously sees: |
| |
|
| | - The rasterized visual content (charts, graphs, drawings, handwriting) |
| | - The underlying text layer (searchable text, equations, captions) |
| |
|
| | This dual perception eliminates the need for separate OCR, layout parsers, or equation |
| | recognizers. The model returns fully structured JSON containing LaTeX equations, Markdown |
| | tables, verbatim algorithm code, and semantic figure descriptions per page. |
| |
|
| | --- |
| |
|
| | ## Features |
| |
|
| | | Feature | Description | |
| | |---|---| |
| | | Native PDF API | Sends PDF bytes directly; Claude sees both text and visuals | |
| | | Image mode | Renders pages at 300 DPI via PyMuPDF for maximum fidelity | |
| | | LaTeX equations | Every equation extracted as proper LaTeX | |
| | | Table extraction | Tables as Markdown and list-of-dicts JSON | |
| | | Algorithm extraction | Pseudocode and code blocks verbatim with language detection | |
| | | Figure description | Semantic descriptions of charts, plots, diagrams, drawings | |
| | | SQLite caching | Pages are cached; re-runs skip already-parsed pages | |
| | | Chunked processing | Handles 100+ page documents by splitting into chunks | |
| | | Multiple output formats | JSON, Markdown, plain text | |
| | | Agent interface | `AgentPDFInterface` class for programmatic use | |
| | | Batch processing | Process entire directories of PDFs | |
| |
|
| | --- |
| |
|
| | ## Requirements |
| |
|
| | - Python 3.10 or higher |
| | - An Anthropic API key with access to `claude-opus-4-6` |
| | - No GPU required; all inference runs through the Anthropic API |
| |
|
| | ### External System Dependencies |
| |
|
| | PyMuPDF (installed via pip) requires no external system libraries on most platforms. |
| | On some Linux systems you may need: |
| |
|
| | ```bash |
| | sudo apt-get install -y libmupdf-dev |
| | ``` |
| |
|
| | On macOS: |
| |
|
| | ```bash |
| | brew install mupdf |
| | ``` |
| |
|
| | On Windows: PyMuPDF ships with pre-built wheels on PyPI; no additional steps needed. |
| |
|
| | --- |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | git clone https://github.com/algorembrant/pdf-atomic-parser.git |
| | cd pdf-atomic-parser |
| | |
| | python -m venv .venv |
| | source .venv/bin/activate # Windows: .venv\Scripts\activate |
| | |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | Set your API key: |
| |
|
| | ```bash |
| | export ANTHROPIC_API_KEY="sk-ant-..." # Linux / macOS |
| | set ANTHROPIC_API_KEY=sk-ant-... # Windows CMD |
| | $env:ANTHROPIC_API_KEY="sk-ant-..." # Windows PowerShell |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Quick Start |
| |
|
| | ### Parse a PDF |
| |
|
| | ```bash |
| | python pdf_atomic_parser.py parse document.pdf |
| | ``` |
| |
|
| | Outputs `document_parsed.json` in the current directory. |
| |
|
| | ### Full Atomic Extraction (JSON + Markdown + Text) |
| |
|
| | ```bash |
| | python pdf_atomic_parser.py atomic document.pdf --output ./results/ |
| | ``` |
| |
|
| | ### Ask a Question |
| |
|
| | ```bash |
| | python pdf_atomic_parser.py query document.pdf "What is the main loss function?" |
| | ``` |
| |
|
| | ### Extract Only Equations |
| |
|
| | ```bash |
| | python pdf_atomic_parser.py extract-equations document.pdf |
| | ``` |
| |
|
| | ### Use in an Agent Pipeline |
| |
|
| | ```python |
| | from pdf_atomic_parser import AgentPDFInterface |
| | |
| | agent = AgentPDFInterface(model="opus") |
| | |
| | # Full structured parse |
| | result = agent.parse("paper.pdf") |
| | |
| | # Just equations as list of dicts |
| | equations = agent.get_equations("paper.pdf") |
| | for eq in equations: |
| | print(f"Page {eq['page']}: {eq['latex']}") |
| | |
| | # Just tables |
| | tables = agent.get_tables("paper.pdf") |
| | |
| | # Semantic query |
| | answer = agent.ask("paper.pdf", "What datasets were used for evaluation?") |
| | print(answer) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Usage Reference |
| |
|
| | ### Command Overview |
| |
|
| | | Command | Purpose | |
| | |---|---| |
| | | `parse <pdf>` | Parse entire PDF to JSON/Markdown/text | |
| | | `atomic <pdf>` | Full extraction to output directory (all formats) | |
| | | `extract-equations <pdf>` | Extract LaTeX equations only | |
| | | `extract-tables <pdf>` | Extract tables only | |
| | | `extract-algorithms <pdf>` | Extract algorithms and code blocks only | |
| | | `extract-figures <pdf>` | Extract figure descriptions only | |
| | | `query <pdf> "<question>"` | Semantic question-answering over document | |
| | | `batch <dir>` | Batch process all PDFs in a directory | |
| | | `estimate <pdf>` | Estimate token count and cost before parsing | |
| | | `cache-stats` | Show SQLite cache statistics | |
| | | `list-cache` | List all cached documents | |
| | | `clear-cache <pdf>` | Clear cached pages for a document | |
| |
|
| | ### Global Options |
| |
|
| | | Option | Default | Description | |
| | |---|---|---| |
| | | `--model` | `opus` | `opus`, `sonnet`, `haiku`, or full model string | |
| | | `--mode` | `native` | `native` (PDF bytes) or `image` (300 DPI PNG per page) | |
| | | `--chunk-size` | `20` | Number of pages per API call | |
| | | `--verbose` | off | Enable debug logging | |
| |
|
| | ### parse / atomic Options |
| |
|
| | | Option | Default | Description | |
| | |---|---|---| |
| | | `--output / -o` | auto | Output file or directory path | |
| | | `--format / -f` | `json` | `json`, `markdown`, or `text` | |
| | | `--pages` | all | Page range, e.g. `1-50` | |
| |
|
| | --- |
| |
|
| | ## Output Schema |
| |
|
| | Each parsed document returns a `DocumentResult` with: |
| |
|
| | - `title`, `authors`, `abstract`, `document_summary` |
| | - `page_results`: list of `PageResult` per page |
| |
|
| | Each `PageResult` contains: |
| |
|
| | ```json |
| | { |
| | "page_number": 3, |
| | "raw_text": "Full verbatim text...", |
| | "summary": "This page describes...", |
| | "section_headers": ["Introduction", "Related Work"], |
| | "keywords": ["transformer", "attention", "BERT"], |
| | "equations": [ |
| | { |
| | "index": 0, |
| | "latex": "\\mathcal{L} = -\\sum_{i} y_i \\log \\hat{y}_i", |
| | "description": "Cross-entropy loss function", |
| | "inline": false |
| | } |
| | ], |
| | "tables": [ |
| | { |
| | "index": 0, |
| | "markdown": "| Model | Accuracy |\n|---|---|\n| BERT | 94.2 |", |
| | "json_data": [{"Model": "BERT", "Accuracy": "94.2"}], |
| | "caption": "Table 1: Benchmark results" |
| | } |
| | ], |
| | "algorithms": [ |
| | { |
| | "index": 0, |
| | "name": "Algorithm 1: Backpropagation", |
| | "language": "pseudocode", |
| | "code": "for each layer l from L to 1:\n ...", |
| | "description": "Gradient descent update rule" |
| | } |
| | ], |
| | "figures": [ |
| | { |
| | "index": 0, |
| | "figure_type": "line_chart", |
| | "description": "Training loss over 100 epochs...", |
| | "data_summary": "Y-axis: loss 0-2.0, X-axis: epoch 0-100...", |
| | "caption": "Figure 2: Training curves" |
| | } |
| | ] |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Choosing a Mode |
| |
|
| | | Scenario | Recommended Mode | Reason | |
| | |---|---|---| |
| | | Standard digital PDF | `native` (default) | Fastest, uses both text and visual layers | |
| | | Scanned / photographed PDF | `image` | Text layer absent; vision handles everything | |
| | | PDF with complex math | `image` | 300 DPI render ensures equation clarity | |
| | | Very large file (>32 MB) | `image` | Native API has 32 MB size limit per chunk | |
| | | Cost-sensitive workflow | `native` | Fewer tokens consumed | |
| |
|
| | --- |
| |
|
| | ## Cost Estimate |
| |
|
| | Rough estimates per 100-page academic paper: |
| |
|
| | | Model | Est. Tokens | Est. Cost | |
| | |---|---|---| |
| | | claude-opus-4-6 | ~120,000 | ~$3.50 | |
| | | claude-sonnet-4-6 | ~120,000 | ~$0.60 | |
| | | claude-haiku-4-5 | ~120,000 | ~$0.10 | |
| |
|
| | Use `python pdf_atomic_parser.py estimate document.pdf` for a per-document estimate. |
| |
|
| | --- |
| |
|
| | ## Caching |
| |
|
| | Parsed pages are stored in `~/.cache/pdf_atomic_parser/.pdf_parser_cache.db`. |
| | Re-running on the same document skips already-parsed pages automatically. |
| | The cache key is `(document_SHA256, page_number, model, mode)`. |
| |
|
| | --- |
| |
|
| | ## Project Structure |
| |
|
| | ``` |
| | pdf-atomic-parser/ |
| | pdf_atomic_parser.py Main tool (single file, no splitting needed) |
| | requirements.txt Python dependencies |
| | README.md This file |
| | model_card.yml Hugging Face model card |
| | .gitignore |
| | .gitattributes |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Author |
| |
|
| | **algorembrant** |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | MIT License. See LICENSE file. |
| |
|