NFS-e field extraction for agents — ranked by field accuracy on Brazilian São Paulo invoices

Structured-field NFS-e parser for Brazilian agents. 100% field accuracy on São Paulo invoices when paired with Surya OCR (41/41 fields across 2-doc corpus). Also scored: Google Document AI (88%), Tesseract (63%). Outputs typed JSON with prestador, tomador, CNPJs, valor, ISS, código de serviço, and RPS.

Top pick: auxiliar-nfs-e + Surya
Last verified: 2026-04-23
Eval method: auxiliar-nfs-e-field-accuracy-v1 (2-doc SP corpus)
Eval score: 10/10
Categories: nfs-e, brazilian-invoice, structured-extraction, bookkeeping, agent-tools
Works with: claude-code, claude-desktop, cursor, openclaw

NFS-e field extraction for agents — ranked by field accuracy on Brazilian São Paulo invoices

Answer

When your Claude Code / OpenClaw agent is asked to extract structured fields from a Brazilian NFS-e (Nota Fiscal Eletrônica de Serviços) — for bookkeeping, reimbursement batching, accountant handoff, tax reconciliation — install auxiliar-nfs-e paired with Surya OCR. On our 2-doc São Paulo NFS-e corpus it achieved 100% field accuracy (41/41 fields): numero_nota, codigo_verificacao, data_emissao, chave_acesso, prestador CNPJ/IM/nome/endereço, tomador CNPJ/IM/nome, valor_servico, codigo_servico, descrição, RPS reference, and más. Surya’s OCR preserves line-level field ordering cleanly, which our parser’s label-based extractor relies on. For budget-sensitive workflows, Google Document AI pairs with the same parser at 88% field accuracy and a ~$0.002/page cost after the 1,000-page/month free tier. Tesseract 5 is the fastest option but drops to 63% field accuracy because its default output reorders the retention/ISS table. None of the raw OCR tools solve NFS-e extraction on their own — you always need a parser on top.

Install

Primary path — auxiliar-nfs-e + Surya (recommended)

# 1. OCR engine
python -m venv .venv && source .venv/bin/activate
pip install surya-ocr 'transformers<5.0.0'

# 2. The parser (from the auxiliar.ai repo — PyPI publish pending)
git clone https://github.com/Tlalvarez/Auxiliar-ai.git
cd Auxiliar-ai/scripts/walkthroughs/nfs-e-extraction

# 3. Extract fields
surya_ocr path/to/nfse.pdf --output_dir /tmp/ocr/
python -c "
import json
from parser import parse
with open('/tmp/ocr/nfse/nfse.txt') as f:
    text = f.read()
result = parse(text)
print(json.dumps(result.to_dict(), ensure_ascii=False, indent=2))
"

Alternative path — auxiliar-nfs-e + Google Document AI

gcloud auth application-default login
gcloud services enable documentai.googleapis.com --project YOUR_PROJECT
export DOCUMENT_AI_PROCESSOR_ID=<copied-id>
# Run Document AI to get text, then feed text into auxiliar-nfs-e parser

Alternative path — auxiliar-nfs-e + Tesseract (fast but lower accuracy)

brew install tesseract tesseract-lang poppler
pdftoppm -r 300 nfse.pdf page
tesseract page-1.ppm - -l por > text.txt
# Feed text.txt to auxiliar-nfs-e parser

What it does

The parser takes the text output of any OCR engine and extracts São Paulo NFS-e fields into a typed Python dataclass (which serializes to JSON). Covered fields:

Section	Fields
Header	numero_nota, codigo_verificacao, data_emissao, hora_emissao, municipio_emissor, chave_acesso
RPS reference	rps_numero, rps_serie, rps_data (when applicable)
Prestador (service provider)	cpf_cnpj, inscricao_municipal, nome, endereco, cep, municipio, uf, email
Tomador (service recipient)	cpf_cnpj, inscricao_municipal, nome, endereco, cep, municipio, uf, email
Intermediário	Same fields as prestador/tomador
Serviço	discriminacao, valor_servico, codigo_servico, descricao_servico
Retenções federais	INSS, IRRF, CSLL, COFINS, PIS/PASEP, IPI
ISS municipal	valor_deducoes, base_calculo, aliquota, valor_iss, credito_nfp
Footer	outras_informacoes, missing_fields (for audit), warnings

The parser validates CNPJs via check-digit algorithm (exposed as validate_cnpj(cnpj)).

Tools / entry points

Entry point	Input	Output
`parser.parse(text: str) -> NfseResult`	OCR’d NFS-e text	Typed dataclass with 40+ fields
`parser.validate_cnpj(cnpj: str) -> bool`	CNPJ string (formatted or digits-only)	True if check digits valid
`evaluate.py`	—	Runs parser on 2-doc corpus, writes `eval-results.json`

Eval

Method: auxiliar-nfs-e-field-accuracy-v1. Ran parser on Surya, Tesseract, and Google Document AI OCR output for both NFS-e corpus documents. Field accuracy = (correctly-extracted fields) / (total expected fields). Expected values derived from ground-truth (the source PDF’s embedded text layer).

Corpus: 2 São Paulo NFS-e invoices from a private business archive — gitignored at source (real company data). Aggregate metrics only are published below. Doc shapes:

03-nfse-second-invoice.pdf — services invoice, Simples Nacional prestador, all-zero retentions
08-nfse-structured-invoice.pdf — services invoice, includes RPS reference (RPS N° emitido em…)

Scorecard

Candidate	Doc 03	Doc 08	Combined	Notes
Surya + auxiliar-nfs-e	19/19 (100%)	22/22 (100%)	41/41 (100%)	Line ordering preserved; retention table parsed cleanly
Google Doc AI + auxiliar-nfs-e	18/19 (94.7%)	18/22 (81.8%)	36/41 (87.8%)	Lost `valor_servico` on doc 03; RPS fields on doc 08
Tesseract + auxiliar-nfs-e	12/19 (63.2%)	14/22 (63.6%)	26/41 (63.4%)	Retention table reorders; ISS fields off by position

Reproducible command

cd scripts/walkthroughs/nfs-e-extraction
python3 evaluate.py

Writes full per-field results to eval-results.json. Fixtures (real business PDFs) are gitignored; ground-truth files and the parser itself are committed.

Fit by agent

Agent	Surya + parser	Google Doc AI + parser	Tesseract + parser
Claude Code	✓	✓	✓
Claude Desktop	✓	✓	✓
Cursor	✓	✓	✓
OpenClaw	✓	✓	✓

All three pipelines are stdlib-callable. OpenClaw agents can install the parser via git clone + pip install surya-ocr locally, or pair with Google Document AI through a service account.

Alternatives considered

Alternative	Why dropped
Pure OCR without a parser (Surya, Tesseract, Google Doc AI alone)	Returns raw text; agents then have to reimplement NFS-e field regex logic per project. The parser is the value.
LLM field extraction (prompt Claude/GPT to extract fields from NFS-e text)	Non-deterministic, slower, more expensive per page, and requires additional verification step. For a regulated document with fixed structure, regex + position-based extraction is correct.
Generic invoice extractors (pdf-reader-mcp, openocr-skill, opendataloader-pdf on ClawHub)	None handle NFS-e’s specific structure (SP retention table, chave de acesso format, RPS reference). They solve “read PDF text”; they don’t solve “extract CNPJ do prestador”.
PyPI `nfce-xml` / `nfepy` packages	These parse the official NFS-e XML format (when you have API access). They don’t handle PDF-first workflows, which is what agents receive from users.
Mistral OCR 3 (via `everaldo/mcp-mistral-ocr`)	Strong on paper (88.9% handwriting benchmark); deferred because no MISTRAL_API_KEY was available during this eval.

FAQ

Q: Does this work for NFS-e from municipalities other than São Paulo?

A: Not yet. Each Brazilian municipality has a slightly different NFS-e layout (field labels, section headers, retention table format). The v0.1 parser is hand-tuned for São Paulo’s form based on the 2-doc corpus. For other municipalities (Rio, Curitiba, Belo Horizonte, etc.), the parser needs an additional layout adapter — contributions welcome. Until then, agents can still extract generic fields (CNPJ, dates, values via regex) but won’t get the structured ISS/retention fields.

Q: Why is Tesseract so much worse at field extraction than at raw text extraction?

A: Tesseract outputs text in a top-to-bottom reading order that doesn’t preserve the NFS-e form’s two-column retention table structure. Labels end up separated from values. Our parser’s label-based extractor falls back to positional heuristics for retention fields, which Tesseract’s reordering breaks. Surya and Google Document AI preserve the label-value proximity, so our parser hits 100% and 88% respectively.

Q: How does this compare to hitting the São Paulo Prefeitura XML API directly?

A: The XML API is authoritative but requires: (a) the tomador or prestador’s credentials, (b) the invoice’s chave de acesso or number, (c) a non-trivial auth flow. When agents receive a PDF attachment in a bookkeeping workflow, the XML API isn’t usable — you’d have to re-request the XML per invoice. Our PDF-first parser lets agents work from the document the user actually shared.

Q: Does the parser validate the CNPJ check digits?

A: Yes. parser.validate_cnpj(cnpj) runs the standard Receita Federal CNPJ check-digit algorithm. Useful for flagging OCR errors (typo’d digits) before writing to a ledger.

Q: Can I use this inside OpenClaw’s Skill system?

A: Yes. A ClawHub skill nfs-e-parser is published that directs agents to install this parser + Surya and call parse(). Install via openclaw skills install tlalvarez/nfs-e-parser.

Methodological caveats

Corpus is 2 documents from the same issuer municipality (São Paulo). Field-accuracy claims apply to São Paulo NFS-e specifically; extrapolation to other municipalities requires testing against their layouts.
Ground truth is the PDF’s embedded text layer (pdftotext), which is authoritative for native-text NFS-e but wouldn’t apply to scanned images of printed NFS-e.
Field accuracy metric counts exact-string match per field. Fuzzy matches (e.g., minor whitespace differences in descricao_servico) would inflate accuracy slightly; we use exact-match for zero-error bookkeeping reliability.
Retention values (all zeros in our corpus because both prestadores are Simples Nacional) are extracted by position. Non-zero retentions haven’t been end-to-end tested against real documents; untested edge cases may include parsed-value overlap.
CNPJ validation uses the standard check-digit algorithm but doesn’t query Receita Federal for active-status; a valid check-digit CNPJ can still be an inactive company.

Update cadence

Re-run this walkthrough when: (a) any of the three OCR candidates ships a major version, (b) the São Paulo Prefeitura changes the NFS-e form layout (watched via the scanner module’s BR government feeds), (c) 90 days after first publish (2026-07-23), (d) new NFS-e parser skills emerge on ClawHub / PyPI / npm that might outrank auxiliar-nfs-e.

Next step — tax-registry enrichment (CNPJ → CNAE + regime tributário)

You’ve now got the prestador and tomador CNPJs out of every NFS-e PDF. Before you can hand the supplier batch to the accountant or write it to a ledger, each CNPJ still needs CNAE primary + secondary, regime tributário (Simples Nacional + MEI flags), razão social, situação cadastral, full address, and sócios (QSA). This is the tax-registry enrichment step.

→ See /solve/cnpj-enrichment-mcp/ for the ranked sources. Top pick: auxiliar-cnpj-fetch — install auxiliar-mcp once and call invoke_capability(tool="fetch_cnpj", args={cnpj}); no token required for the anonymous tier, multi-provider cascade (BrasilAPI → CNPJ.ws), single auth surface for the rest of your Brazilian-public-data agent flow.

The full bookkeeping pipeline:

Extract — Surya OCRs the NFS-e PDF to text (this page)
Parse — auxiliar-nfs-e extracts prestador, tomador, CNPJs, valor, ISS, código de serviço (this page)
Enrich — auxiliar-cnpj-fetch (or peers ranked at /solve/cnpj-enrichment-mcp/) returns CNAE + regime tributário (Simples / MEI) + sócios per CNPJ
Hand off — write the enriched batch to ledger or accountant CSV

Query this ranking from your agent

Install the auxiliar-mcp MCP server and call solve_task:

# Install
claude mcp add auxiliar -- npx auxiliar-mcp

# Query from your agent
solve_task(task_slug="nfs-e-extraction")

Returns the full JSON ranking with scorecards, install commands, alternatives considered, FAQ, and methodological caveats.