Insurance Certificate Data Extraction
A document-intelligence pipeline that reads ACORD 25 certificates of insurance and turns them into structured, verifiable compliance data.
- Client
- Insurance-tech client (US)
- Industry
- Insurance
- Year
- 2026
- Engagement
- 6 weeks, solo delivery
- field-level extraction accuracy on production documents
- 98%
- certificates processed per month
- 4,000+
- of certificates need no human touch
- 85%
The problem
Certificates of insurance (COIs) arrive as scanned PDFs, faxes, and photos — every carrier formats the ACORD 25 form slightly differently. Teams that need to verify coverage — policy numbers, limits, effective dates, additional insureds — end up keying this data by hand. It's slow, error-prone, and the errors are expensive: a missed expiration date or a coverage gap surfaces exactly when a claim happens.
The client's compliance team was keying data from thousands of certificates a month, at roughly 12 minutes per document. Throughput capped how many vendors they could onboard, and transcription errors kept surfacing in downstream compliance checks.
The approach
Extraction built for messy documents, not clean ones. The pipeline combines OCR with LLM-based structured extraction, tuned to the ACORD 25 layout but tolerant of carrier-specific variations, scan artifacts, and handwritten fields.
Precision where it matters most. Fields are extracted with per-field confidence scores. High-confidence data flows straight through; low-confidence fields are routed to a human review queue — so the system reduces manual work without silently introducing errors.
Validation, not just extraction. Extracted values are checked against business rules — expiration dates in the future, limits meeting contract requirements, named insureds matching vendor records — turning raw document data into a compliance decision.
The structured output feeds the client's vendor-management system over a simple API, so compliance status updates the moment a certificate lands — no re-keying, no spreadsheet handoffs.
The outcome
Review time dropped from ~12 minutes to under a minute per certificate. 85% of certificates now process with no human touch at 98% field-level accuracy, and the compliance team reviews only the low-confidence exceptions the system routes to them — so accuracy went up while manual work fell away. The pipeline now handles 4,000+ certificates a month without added headcount.
Technologies
- Python
- FastAPI
- LangGraph
- LLM extraction
- OCR
Facing something similar?
A 20-minute call is enough to tell you whether this approach fits your problem — and what it would take.