PDF text extraction for Claude Code agents — what to install, ranked by accuracy

Ranked installable OCR for agents parsing PDFs, NFS-e, boletos, phone-photo receipts. Surya leads on word accuracy (76.9%) on a 10-doc real-world Brazilian corpus.

Top pick: surya
Last verified: 2026-04-21
Eval method: auxiliar-ocr-walkthrough-v1
Eval score: 7.7/10
Categories: pdf-processing, ocr, agent-tools, task-template
Works with: claude-code, claude-desktop, cursor, openclaw

PDF text extraction for Claude Code agents — what to install, ranked by accuracy

Answer

If your Claude Code agent needs to parse PDFs and photo-captured documents reliably, install Surya (pip install surya-ocr + pin transformers<5.0.0). It led our 10-document real-world corpus on word accuracy (76.9%), layout preservation (7.0/10), and token F1 (93.4%) — while costing zero dollars per page, running entirely local, and handling PDFs natively. For latency-critical workflows where throughput matters more than layout, Tesseract 5 + por.traineddata runs 14× faster (1.6s p50) and installs in one brew command, trading 1.5 percentage points of word accuracy for dramatic speed and the cleanest install path. Google Document AI costs ~$0.002 per page after a 1000-page/month free tier and wins on mobile-captured receipts (94.6% vs 93.1% for Surya on a phone-photo Pix receipt) — but it places third on overall word accuracy on this corpus, diverges from top-to-bottom reading order, and carries an enterprise-auth install flow. For Brazilian corporate filings specifically, local models match or beat the paid vendor API.

Install

Surya (recommended)

python -m venv .venv && source .venv/bin/activate
pip install surya-ocr 'transformers<5.0.0'
surya_ocr path/to/doc.pdf --output_dir out/

The transformers pin is required as of April 2026: pip install surya-ocr alone fails at runtime with 'SuryaDecoderConfig' object has no attribute 'pad_token_id' (issue #484).

Tesseract 5

brew install tesseract tesseract-lang   # or apt-get equivalent
pdftoppm -r 300 doc.pdf page && tesseract page-1.ppm - -l por

PDFs need a pdftoppm render step; Tesseract takes images.

Google Document AI

gcloud auth login && gcloud auth application-default login
gcloud services enable documentai.googleapis.com --project YOUR_PROJECT
# Create processor at https://console.cloud.google.com/ai/document-ai → Document OCR → US
export DOCUMENT_AI_PROCESSOR_ID=<copied-id>

Requires billing account linked to the GCP project (free tier covers first 1,000 pages/month per processor).

What it does

All three tools take PDF files or document images (JPG/PNG/TIFF/etc.) and return extracted text. Surya and Google Document AI additionally return bounding-box and layout metadata; Tesseract returns plain text only by default. The commonly needed capability for an agent is: “give me the textual content of this PDF as a string I can reason over.” All three deliver this at different accuracy/latency/cost points.

Tools / entry points

Tool	Input	Output	Notes
`surya_ocr <path>`	PDF, image, folder	JSON with `text_lines[].text` per page	Python CLI; `--output_dir` controls where JSON lands
`tesseract <image> - -l por`	Image (pdftoppm PDFs first)	plain text	`-l por` selects Portuguese language pack
Document AI REST `:process`	PDF or image, base64	JSON with `document.text` + layout	Sync limit 15 pages; split or use async batch for larger

Eval

Corpus

10 real-world documents spanning OCR stress dimensions:

2 native-text PDFs (clean text layer)
3 digital legal docs (Word-generated PDFs)
2 image-heavy scans (Brazilian corporate-registry certifications — body content in image layer, government watermarks in text layer)
3 structured Brazilian forms (NFSe invoices, boleto)
2 phone-photographed receipts (real lighting, perspective)

Ground truth is human-reviewed transcription (LLM-drafted, human-finalized). Documents are git-ignored (contain real business information).

Scorecard

Candidate	Word accuracy	Token F1	Layout (1-10)	p50 latency	Install friction (1-10)	Cost / 10 docs
Surya	0.769	0.934	7.0	22.1 s	7	$0
Tesseract	0.754	0.914	5.0	1.6 s	3	$0
Google Document AI	0.697	0.934	5.7	3.8 s	7	$0.069

Method

Word accuracy: 1 − WER via jiwer, computed over normalized text (comments/page-markers/private markers stripped, whitespace collapsed, lowercased).
Token F1: bag-of-tokens F1 (order-insensitive) as an accuracy indicator that isn’t penalized by reading-order differences.
Layout: Claude Opus 4.7 judge scored each candidate’s output against ground truth on a 1-10 reading-order-fidelity rubric. This walkthrough sampled 3 of 10 documents inline; the full 10-doc API-isolated judge run is scripted but awaits an ANTHROPIC_API_KEY.
Latency: wall-clock per document, measured in the runner script.
Cost: API-billed per-page rates × page counts; local tools are $0.
Install friction: 1-10 hand-scored per published rubric, reflecting steps from zero to first successful extraction.

Reproducible command

git clone https://github.com/<your-fork>/auxiliar.ai
cd auxiliar.ai
# supply your own 10-doc corpus at tests/fixtures/ocr-corpus/sources/
# write ground truth at tests/fixtures/ocr-corpus/ground-truth/
bash scripts/ocr-walkthrough/run-tesseract.sh
bash scripts/ocr-walkthrough/run-surya.sh
DOCUMENT_AI_PROCESSOR_ID=xxx python3 scripts/ocr-walkthrough/run-google-documentai.py
python3 scripts/ocr-walkthrough/score-candidates.py
# dim 2 requires Claude API key:
ANTHROPIC_API_KEY=... python3 scripts/ocr-walkthrough/score-layout.py

Fit by agent

Agent	Tesseract	Surya	Google Doc AI
Claude Code	✓ (Bash tool)	✓ (Bash tool, needs .venv)	✓ (Bash + gcloud auth)
Claude Desktop	✓	✓	✓
Cursor	✓	✓	✓
OpenClaw	✓	✓	✓

All three are agent-agnostic — the agent shells out to a local binary (Tesseract/Surya) or a REST API (Document AI). Choice doesn’t depend on which underlying LLM powers the agent; it depends on the accuracy/latency/cost trade-off.

Alternatives considered but dropped

yescan-ocr-universal (ClawHub skill) — Requires a SCAN_WEBSERVICE_KEY from Quark Scan (Chinese sign-up) and doesn’t support PDFs natively (image-only, 5 MB limit). Practical install friction 9/10 for non-Chinese users. Documented finding: top-ranked ClawHub OCR skill is not generally runnable for Portuguese corpora.
Mistral OCR 3 (via everaldo/mcp-mistral-ocr MCP) — strong on paper (88.9% handwriting, 96.6% table extraction per vendor benchmarks); deferred from this walkthrough because no MISTRAL_API_KEY was available. Runner at scripts/ocr-walkthrough/run-mistral-ocr.sh is ready; re-enable when API key is provisioned.
pdf-reader-mcp — looks like an MCP OCR tool but its 2.3.1 docs explicitly list OCR as “planned, not implemented.” Only does text-layer extraction from native-text PDFs — equivalent to pdftotext. Unsuitable for scanned corpora.

FAQ

Q: Why is Surya slow compared to Google Document AI? A: Surya runs locally on PyTorch. First invocation downloads ~150 MB of model weights. Steady-state inference on CPU averages ~20 s/doc on a multi-page scanned PDF. On GPU (MPS on Apple Silicon, CUDA on NVIDIA), expect 3-5× speedup. Google Document AI is a remote server farm.

Q: Does “word accuracy 76.9%” mean Surya gets only 76.9% of words right? A: No. WER is an order-sensitive metric; it penalizes insertions, deletions, AND reordering. Token F1 (93.4%) is the order-insensitive accuracy — meaning Surya captures 93.4% of the correct words, but in a sequence that differs from the ground-truth order in enough places to drag WER down. For downstream agent use, token F1 is usually the relevant metric: did the OCR see the content at all? Surya, Google Document AI, and Tesseract all score 0.91-0.94 on token F1.

Q: When should I pay for Document AI over the free local options? A: Three cases. (1) Your workflow depends on structured output (form fields, tables, bounding boxes) — Document AI’s JSON is richer than plain text. (2) You’re processing phone photos — Document AI was marginally best on the two phone-photo receipts in our corpus. (3) You’re already in GCP and want auditable enterprise auth. Otherwise: local models match or beat Document AI on word accuracy, at $0.

Q: Why do all three candidates score 0 on slot 07 (the boleto)? A: The ground-truth transcription for the boleto is conservative — it excludes visible-decoration text like “Aponte a câmera do seu celular para este QRCode…” All three OCR engines correctly capture that text, inflating the candidate output relative to ground truth. WER explodes on insertions. The token F1 scores for slot 07 (0.575-0.646) are more representative of actual capture quality. If your downstream use is agent-driven extraction of boleto fields specifically, all three are usable; if you need exactly ground truth shape, consider post-processing to filter known-noise phrases.

Q: Is Surya’s GPL-3.0 license a problem for my startup? A: Probably not for internal tooling, production services where outputs are consumed internally, or SaaS backends. Note: model weights are under a separate AI Pubs Open Rail-M license, with a <$2M funding/revenue clause for free use. For anything distributed to end users or embedded in shipped software, consult counsel.

JSON-LD

{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "Extract text from PDFs in a Claude Code agent",
  "totalTime": "PT5M",
  "step": [
    {"@type": "HowToStep", "name": "Install", "text": "pip install surya-ocr 'transformers<5.0.0'"},
    {"@type": "HowToStep", "name": "Run", "text": "surya_ocr doc.pdf --output_dir out/"}
  ]
}

Methodological caveats (honest)

Corpus is 10 documents. Statistical confidence intervals are wide; differences under ~5 percentage points in word accuracy should be treated as noise.
Ground truth was LLM-drafted then human-reviewed. Some drafts (especially slot 07 boleto) may be under-conservative on visible-decoration text; WER on those docs is artifactually low.
Layout judge was the same Claude Opus 4.7 model handling this walkthrough (inline, in-conversation). A separate API-isolated judge run with fresh context is the more rigorous approach and is scripted but not yet executed.
Install-friction scores reflect actual install experience in April 2026; versions and paths drift.
Costs: $0.069 total for 10 docs on Document AI; $0 for both local tools. Budget consumed from a $20 walkthrough cap.
Not tested: native multimodal capabilities of Claude, GPT-4o, or Gemini themselves (agents with strong native vision may not need an added OCR tool).

Update cadence

Re-run this walkthrough when: (a) any candidate ships a major version, (b) new OCR MCPs or ClawHub skills emerge that might outrank the top 3, (c) 90 days after first publish (2026-07-20), (d) Google Document AI pricing changes.

/solve/nfs-e-extraction/ — once you’ve OCR’d a Brazilian NFS-e PDF with Surya, this ranks the parsers that turn that text into structured invoice fields (prestador, tomador, CNPJs, valor, ISS, código de serviço).
/solve/cnpj-enrichment-mcp/ — after extracting CNPJs from invoices, this ranks the public-data sources (BrasilAPI, CNPJá, CNPJ.ws, ReceitaWS, mcp-gov) that enrich each CNPJ with CNAE + regime tributário + full address for bookkeeping handoff.

Together with this page, they form a complete Brazilian bookkeeping pipeline: OCR PDF → parse fields → enrich CNPJs → ledger.

Query this ranking from your agent

Install the auxiliar-mcp MCP server and call solve_task:

# Install
claude mcp add auxiliar -- npx auxiliar-mcp

# Query from your agent
solve_task(task_slug="pdf-text-extraction-mcp")

Returns the full JSON ranking with scorecards, install commands, alternatives considered, FAQ, and methodological caveats.

PDF text extraction for Claude Code agents — what to install, ranked by accuracy

Answer

Install

Surya (recommended)

Tesseract 5

Google Document AI

What it does

Tools / entry points

Eval

Corpus

Scorecard

Method

Reproducible command

Fit by agent

Alternatives considered but dropped

FAQ

JSON-LD

Methodological caveats (honest)

Update cadence

Related /solve/ rankings