NFS-e field extraction for agents — ranked by field accuracy on Brazilian São Paulo invoices

Structured-field NFS-e parser for Brazilian agents. 100% field accuracy on São Paulo invoices when paired with Surya OCR (41/41 fields across 2-doc corpus). Also scored: Google Document AI (88%), Tesseract (63%). Outputs typed JSON with prestador, tomador, CNPJs, valor, ISS, código de serviço, and RPS.

Top pick
auxiliar-nfs-e + Surya
Last verified
Eval method
auxiliar-nfs-e-field-accuracy-v1 (2-doc SP corpus)
Eval score
10/10
Categories
nfs-e, brazilian-invoice, structured-extraction, bookkeeping, agent-tools
Works with
claude-code, claude-desktop, cursor, openclaw

NFS-e field extraction for agents — ranked by field accuracy on Brazilian São Paulo invoices

Answer

When your Claude Code / OpenClaw agent is asked to extract structured fields from a Brazilian NFS-e (Nota Fiscal Eletrônica de Serviços) — for bookkeeping, reimbursement batching, accountant handoff, tax reconciliation — install auxiliar-nfs-e paired with Surya OCR. On our 2-doc São Paulo NFS-e corpus it achieved 100% field accuracy (41/41 fields): numero_nota, codigo_verificacao, data_emissao, chave_acesso, prestador CNPJ/IM/nome/endereço, tomador CNPJ/IM/nome, valor_servico, codigo_servico, descrição, RPS reference, and más. Surya’s OCR preserves line-level field ordering cleanly, which our parser’s label-based extractor relies on. For budget-sensitive workflows, Google Document AI pairs with the same parser at 88% field accuracy and a ~$0.002/page cost after the 1,000-page/month free tier. Tesseract 5 is the fastest option but drops to 63% field accuracy because its default output reorders the retention/ISS table. None of the raw OCR tools solve NFS-e extraction on their own — you always need a parser on top.

Install

# 1. OCR engine
python -m venv .venv && source .venv/bin/activate
pip install surya-ocr 'transformers<5.0.0'

# 2. The parser (from the auxiliar.ai repo — PyPI publish pending)
git clone https://github.com/Tlalvarez/Auxiliar-ai.git
cd Auxiliar-ai/scripts/walkthroughs/nfs-e-extraction

# 3. Extract fields
surya_ocr path/to/nfse.pdf --output_dir /tmp/ocr/
python -c "
import json
from parser import parse
with open('/tmp/ocr/nfse/nfse.txt') as f:
    text = f.read()
result = parse(text)
print(json.dumps(result.to_dict(), ensure_ascii=False, indent=2))
"

Alternative path — auxiliar-nfs-e + Google Document AI

gcloud auth application-default login
gcloud services enable documentai.googleapis.com --project YOUR_PROJECT
export DOCUMENT_AI_PROCESSOR_ID=<copied-id>
# Run Document AI to get text, then feed text into auxiliar-nfs-e parser

Alternative path — auxiliar-nfs-e + Tesseract (fast but lower accuracy)

brew install tesseract tesseract-lang poppler
pdftoppm -r 300 nfse.pdf page
tesseract page-1.ppm - -l por > text.txt
# Feed text.txt to auxiliar-nfs-e parser

What it does

The parser takes the text output of any OCR engine and extracts São Paulo NFS-e fields into a typed Python dataclass (which serializes to JSON). Covered fields:

Section Fields
Header numero_nota, codigo_verificacao, data_emissao, hora_emissao, municipio_emissor, chave_acesso
RPS reference rps_numero, rps_serie, rps_data (when applicable)
Prestador (service provider) cpf_cnpj, inscricao_municipal, nome, endereco, cep, municipio, uf, email
Tomador (service recipient) cpf_cnpj, inscricao_municipal, nome, endereco, cep, municipio, uf, email
Intermediário Same fields as prestador/tomador
Serviço discriminacao, valor_servico, codigo_servico, descricao_servico
Retenções federais INSS, IRRF, CSLL, COFINS, PIS/PASEP, IPI
ISS municipal valor_deducoes, base_calculo, aliquota, valor_iss, credito_nfp
Footer outras_informacoes, missing_fields (for audit), warnings

The parser validates CNPJs via check-digit algorithm (exposed as validate_cnpj(cnpj)).

Tools / entry points

Entry point Input Output
parser.parse(text: str) -> NfseResult OCR’d NFS-e text Typed dataclass with 40+ fields
parser.validate_cnpj(cnpj: str) -> bool CNPJ string (formatted or digits-only) True if check digits valid
evaluate.py Runs parser on 2-doc corpus, writes eval-results.json

Eval

Method: auxiliar-nfs-e-field-accuracy-v1. Ran parser on Surya, Tesseract, and Google Document AI OCR output for both NFS-e corpus documents. Field accuracy = (correctly-extracted fields) / (total expected fields). Expected values derived from ground-truth (the source PDF’s embedded text layer).

Corpus: 2 São Paulo NFS-e invoices from a private business archive — gitignored at source (real company data). Aggregate metrics only are published below. Doc shapes:

  • 03-nfse-second-invoice.pdf — services invoice, Simples Nacional prestador, all-zero retentions
  • 08-nfse-structured-invoice.pdf — services invoice, includes RPS reference (RPS N° emitido em…)

Scorecard

Candidate Doc 03 Doc 08 Combined Notes
Surya + auxiliar-nfs-e 19/19 (100%) 22/22 (100%) 41/41 (100%) Line ordering preserved; retention table parsed cleanly
Google Doc AI + auxiliar-nfs-e 18/19 (94.7%) 18/22 (81.8%) 36/41 (87.8%) Lost valor_servico on doc 03; RPS fields on doc 08
Tesseract + auxiliar-nfs-e 12/19 (63.2%) 14/22 (63.6%) 26/41 (63.4%) Retention table reorders; ISS fields off by position

Reproducible command

cd scripts/walkthroughs/nfs-e-extraction
python3 evaluate.py

Writes full per-field results to eval-results.json. Fixtures (real business PDFs) are gitignored; ground-truth files and the parser itself are committed.

Fit by agent

Agent Surya + parser Google Doc AI + parser Tesseract + parser
Claude Code
Claude Desktop
Cursor
OpenClaw

All three pipelines are stdlib-callable. OpenClaw agents can install the parser via git clone + pip install surya-ocr locally, or pair with Google Document AI through a service account.

Alternatives considered

Alternative Why dropped
Pure OCR without a parser (Surya, Tesseract, Google Doc AI alone) Returns raw text; agents then have to reimplement NFS-e field regex logic per project. The parser is the value.
LLM field extraction (prompt Claude/GPT to extract fields from NFS-e text) Non-deterministic, slower, more expensive per page, and requires additional verification step. For a regulated document with fixed structure, regex + position-based extraction is correct.
Generic invoice extractors (pdf-reader-mcp, openocr-skill, opendataloader-pdf on ClawHub) None handle NFS-e’s specific structure (SP retention table, chave de acesso format, RPS reference). They solve “read PDF text”; they don’t solve “extract CNPJ do prestador”.
PyPI nfce-xml / nfepy packages These parse the official NFS-e XML format (when you have API access). They don’t handle PDF-first workflows, which is what agents receive from users.
Mistral OCR 3 (via everaldo/mcp-mistral-ocr) Strong on paper (88.9% handwriting benchmark); deferred because no MISTRAL_API_KEY was available during this eval.

FAQ

Q: Does this work for NFS-e from municipalities other than São Paulo?

A: Not yet. Each Brazilian municipality has a slightly different NFS-e layout (field labels, section headers, retention table format). The v0.1 parser is hand-tuned for São Paulo’s form based on the 2-doc corpus. For other municipalities (Rio, Curitiba, Belo Horizonte, etc.), the parser needs an additional layout adapter — contributions welcome. Until then, agents can still extract generic fields (CNPJ, dates, values via regex) but won’t get the structured ISS/retention fields.

Q: Why is Tesseract so much worse at field extraction than at raw text extraction?

A: Tesseract outputs text in a top-to-bottom reading order that doesn’t preserve the NFS-e form’s two-column retention table structure. Labels end up separated from values. Our parser’s label-based extractor falls back to positional heuristics for retention fields, which Tesseract’s reordering breaks. Surya and Google Document AI preserve the label-value proximity, so our parser hits 100% and 88% respectively.

Q: How does this compare to hitting the São Paulo Prefeitura XML API directly?

A: The XML API is authoritative but requires: (a) the tomador or prestador’s credentials, (b) the invoice’s chave de acesso or number, (c) a non-trivial auth flow. When agents receive a PDF attachment in a bookkeeping workflow, the XML API isn’t usable — you’d have to re-request the XML per invoice. Our PDF-first parser lets agents work from the document the user actually shared.

Q: Does the parser validate the CNPJ check digits?

A: Yes. parser.validate_cnpj(cnpj) runs the standard Receita Federal CNPJ check-digit algorithm. Useful for flagging OCR errors (typo’d digits) before writing to a ledger.

Q: Can I use this inside OpenClaw’s Skill system?

A: Yes. A ClawHub skill nfs-e-parser is published that directs agents to install this parser + Surya and call parse(). Install via openclaw skills install tlalvarez/nfs-e-parser.

Methodological caveats

  • Corpus is 2 documents from the same issuer municipality (São Paulo). Field-accuracy claims apply to São Paulo NFS-e specifically; extrapolation to other municipalities requires testing against their layouts.
  • Ground truth is the PDF’s embedded text layer (pdftotext), which is authoritative for native-text NFS-e but wouldn’t apply to scanned images of printed NFS-e.
  • Field accuracy metric counts exact-string match per field. Fuzzy matches (e.g., minor whitespace differences in descricao_servico) would inflate accuracy slightly; we use exact-match for zero-error bookkeeping reliability.
  • Retention values (all zeros in our corpus because both prestadores are Simples Nacional) are extracted by position. Non-zero retentions haven’t been end-to-end tested against real documents; untested edge cases may include parsed-value overlap.
  • CNPJ validation uses the standard check-digit algorithm but doesn’t query Receita Federal for active-status; a valid check-digit CNPJ can still be an inactive company.

Update cadence

Re-run this walkthrough when: (a) any of the three OCR candidates ships a major version, (b) the São Paulo Prefeitura changes the NFS-e form layout (watched via the scanner module’s BR government feeds), (c) 90 days after first publish (2026-07-23), (d) new NFS-e parser skills emerge on ClawHub / PyPI / npm that might outrank auxiliar-nfs-e.