PDF text extraction for Claude Code agents — what to install, ranked by accuracy

PDF text extraction for Claude Code agents — what to install, ranked by accuracy

Answer

If your Claude Code agent needs to parse PDFs and photo-captured documents reliably, install Surya (pip install surya-ocr + pin transformers<5.0.0). It led our 10-document real-world corpus on word accuracy (76.9%), layout preservation (7.0/10), and token F1 (93.4%) — while costing zero dollars per page, running entirely local, and handling PDFs natively. For latency-critical workflows where throughput matters more than layout, Tesseract 5 + por.traineddata runs 14× faster (1.6s p50) and installs in one brew command, trading 1.5 percentage points of word accuracy for dramatic speed and the cleanest install path. Google Document AI costs ~$0.002 per page after a 1000-page/month free tier and wins on mobile-captured receipts (94.6% vs 93.1% for Surya on a phone-photo Pix receipt) — but it places third on overall word accuracy on this corpus, diverges from top-to-bottom reading order, and carries an enterprise-auth install flow. For Brazilian corporate filings specifically, local models match or beat the paid vendor API.

Install

python -m venv .venv && source .venv/bin/activate
pip install surya-ocr 'transformers<5.0.0'
surya_ocr path/to/doc.pdf --output_dir out/

The transformers pin is required as of April 2026: pip install surya-ocr alone fails at runtime with 'SuryaDecoderConfig' object has no attribute 'pad_token_id' (issue #484).

Tesseract 5

brew install tesseract tesseract-lang   # or apt-get equivalent
pdftoppm -r 300 doc.pdf page && tesseract page-1.ppm - -l por

PDFs need a pdftoppm render step; Tesseract takes images.

Google Document AI

gcloud auth login && gcloud auth application-default login
gcloud services enable documentai.googleapis.com --project YOUR_PROJECT
# Create processor at https://console.cloud.google.com/ai/document-ai → Document OCR → US
export DOCUMENT_AI_PROCESSOR_ID=<copied-id>

Requires billing account linked to the GCP project (free tier covers first 1,000 pages/month per processor).

What it does

All three tools take PDF files or document images (JPG/PNG/TIFF/etc.) and return extracted text. Surya and Google Document AI additionally return bounding-box and layout metadata; Tesseract returns plain text only by default. The commonly needed capability for an agent is: “give me the textual content of this PDF as a string I can reason over.” All three deliver this at different accuracy/latency/cost points.

Tools / entry points

Tool Input Output Notes
surya_ocr <path> PDF, image, folder JSON with text_lines[].text per page Python CLI; --output_dir controls where JSON lands
tesseract <image> - -l por Image (pdftoppm PDFs first) plain text -l por selects Portuguese language pack
Document AI REST :process PDF or image, base64 JSON with document.text + layout Sync limit 15 pages; split or use async batch for larger

Eval

Corpus

10 real-world documents spanning OCR stress dimensions:

Ground truth is human-reviewed transcription (LLM-drafted, human-finalized). Documents are git-ignored (contain real business information).

Scorecard

Candidate Word accuracy Token F1 Layout (1-10) p50 latency Install friction (1-10) Cost / 10 docs
Surya 0.769 0.934 7.0 22.1 s 7 $0
Tesseract 0.754 0.914 5.0 1.6 s 3 $0
Google Document AI 0.697 0.934 5.7 3.8 s 7 $0.069

Method

Reproducible command

git clone https://github.com/<your-fork>/auxiliar.ai
cd auxiliar.ai
# supply your own 10-doc corpus at tests/fixtures/ocr-corpus/sources/
# write ground truth at tests/fixtures/ocr-corpus/ground-truth/
bash scripts/ocr-walkthrough/run-tesseract.sh
bash scripts/ocr-walkthrough/run-surya.sh
DOCUMENT_AI_PROCESSOR_ID=xxx python3 scripts/ocr-walkthrough/run-google-documentai.py
python3 scripts/ocr-walkthrough/score-candidates.py
# dim 2 requires Claude API key:
ANTHROPIC_API_KEY=... python3 scripts/ocr-walkthrough/score-layout.py

Fit by agent

Agent Tesseract Surya Google Doc AI
Claude Code ✓ (Bash tool) ✓ (Bash tool, needs .venv) ✓ (Bash + gcloud auth)
Claude Desktop
Cursor
OpenClaw

All three are agent-agnostic — the agent shells out to a local binary (Tesseract/Surya) or a REST API (Document AI). Choice doesn’t depend on which underlying LLM powers the agent; it depends on the accuracy/latency/cost trade-off.

Alternatives considered but dropped

FAQ

Q: Why is Surya slow compared to Google Document AI? A: Surya runs locally on PyTorch. First invocation downloads ~150 MB of model weights. Steady-state inference on CPU averages ~20 s/doc on a multi-page scanned PDF. On GPU (MPS on Apple Silicon, CUDA on NVIDIA), expect 3-5× speedup. Google Document AI is a remote server farm.

Q: Does “word accuracy 76.9%” mean Surya gets only 76.9% of words right? A: No. WER is an order-sensitive metric; it penalizes insertions, deletions, AND reordering. Token F1 (93.4%) is the order-insensitive accuracy — meaning Surya captures 93.4% of the correct words, but in a sequence that differs from the ground-truth order in enough places to drag WER down. For downstream agent use, token F1 is usually the relevant metric: did the OCR see the content at all? Surya, Google Document AI, and Tesseract all score 0.91-0.94 on token F1.

Q: When should I pay for Document AI over the free local options? A: Three cases. (1) Your workflow depends on structured output (form fields, tables, bounding boxes) — Document AI’s JSON is richer than plain text. (2) You’re processing phone photos — Document AI was marginally best on the two phone-photo receipts in our corpus. (3) You’re already in GCP and want auditable enterprise auth. Otherwise: local models match or beat Document AI on word accuracy, at $0.

Q: Why do all three candidates score 0 on slot 07 (the boleto)? A: The ground-truth transcription for the boleto is conservative — it excludes visible-decoration text like “Aponte a câmera do seu celular para este QRCode…” All three OCR engines correctly capture that text, inflating the candidate output relative to ground truth. WER explodes on insertions. The token F1 scores for slot 07 (0.575-0.646) are more representative of actual capture quality. If your downstream use is agent-driven extraction of boleto fields specifically, all three are usable; if you need exactly ground truth shape, consider post-processing to filter known-noise phrases.

Q: Is Surya’s GPL-3.0 license a problem for my startup? A: Probably not for internal tooling, production services where outputs are consumed internally, or SaaS backends. Note: model weights are under a separate AI Pubs Open Rail-M license, with a <$2M funding/revenue clause for free use. For anything distributed to end users or embedded in shipped software, consult counsel.

JSON-LD

{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "Extract text from PDFs in a Claude Code agent",
  "totalTime": "PT5M",
  "step": [
    {"@type": "HowToStep", "name": "Install", "text": "pip install surya-ocr 'transformers<5.0.0'"},
    {"@type": "HowToStep", "name": "Run", "text": "surya_ocr doc.pdf --output_dir out/"}
  ]
}

Methodological caveats (honest)

Update cadence

Re-run this walkthrough when: (a) any candidate ships a major version, (b) new OCR MCPs or ClawHub skills emerge that might outrank the top 3, (c) 90 days after first publish (2026-07-20), (d) Google Document AI pricing changes.