Vinícius Bispo

Internal system names, providers, and exact numbers have been abstracted or generalized for confidentiality — the architecture patterns and trade-offs described are accurate.

Context

The credit pipeline depends on extracting structured data from regulatory PDFs, XMLs, and free-form documents that come from dozens of government and registry sources. The schemas are inconsistent across providers, the language is dense, and the same document type can have multiple legitimate layouts.

The original approach — handwritten parsers per document type — broke whenever a provider changed a header, hit edge cases on the long tail of documents, and required engineering time to roll out new document types.

A purely LLM-driven approach would have its own problem: free-form text is unreliable downstream. The credit decisioning system needs structured facts ({"valid_until": "2026-12-31", "issuer": "...", "status": "..."}), not paragraphs.

The extraction service was built around a single principle: the LLM proposes, a deterministic validator accepts. The model is constrained by a JSON schema specific to the document type; outputs that don't conform are rejected at the boundary rather than passed downstream as wrong-but-confident facts.

Architecture

caller ──► Extraction API (v3) ──► LLM call (schema-constrained)
              │                            │
              │                            ▼
              │                    Output validator
              │                    (per-type schema)
              │                            │
              │                ┌───────────┴────────────┐
              │                ▼                        ▼
              │           Structured fact          Fail-fast +
              │           (typed, validated)       human review
              │                │
              ▼                ▼
        Status manager   Caller / downstream
        (workflow step)

Per-type, per-department configuration

Each document type has a JSON schema describing the expected output (fields, types, formats). The schema lives in the database, not in code, and supports overrides per department — the same regulatory document might be parsed differently depending on which credit operation department is consuming it.

The prompt is also per-type and per-department. Engineering writes the base; the operating team can adjust prompts and schemas without a deploy.

Failure semantics

The validator does one thing: it accepts structured output that matches the schema, or it rejects it. Rejection is loud — the document goes to a manual queue with the rejection reason attached. The pipeline never propagates "the model said something" as a structured fact downstream.

Schema parsing also fast-fails: if the configured schema_json for a document type is empty or malformed, the request returns 4xx immediately rather than calling the LLM with no constraints.

Input flexibility

The v3 API accepts both file uploads and inline base64 payloads with explicit mimeType. That gave two benefits: the same endpoint backs interactive flows (frontend uploads) and machine-to-machine flows (other services posting bytes), without forcing one side to stage files in S3 just for the call.

Integration with the workflow

Extraction is an explicit step inside the broader credit-operation status manager — not a side effect of upload. Each document acquires extraction state alongside its existing state machine. When extraction fails, the workflow stays inspectable: ops can see "this document type doesn't have a working schema yet" instead of a silent regression.

Trade-offs

Structured-output validation, not free-form text. Documents whose layout doesn't yet have a working schema fall back to manual review — extraction throughput is gated by schema coverage. The benefit is that downstream consumers never receive a hallucinated field as if it were valid; the pipeline either has a typed fact or it admits it doesn't.

Schemas in the database, not in code. Adding a new document type goes through admin instead of through a deploy. That increases the operational surface (someone needs to author/own each schema). The benefit is that the people closest to the regulatory domain can iterate without engineering becoming a bottleneck.

Per-department overrides, not a single global schema. A document like a tax certificate can mean different things to different credit operations; one department might need only the validity window, another might need the full debt breakdown. Overrides keep each department's contract small instead of forcing a union schema that everyone has to opt-out of. The cost is that prompts and schemas can drift across departments — owners have to keep them aligned.

LLM as an extraction step in a workflow, not a magic function. The model runs inside the same status machine as deterministic steps, with the same observability and replay semantics. That made the LLM step boring on purpose — it's an ordinary stage that can be retried, inspected, and shut off independently.

Outcome

Coverage. New document types ship through schema + prompt configuration, not through new parser code.
Trust. Downstream credit-decision steps consume typed, validated facts. The pipeline either has structured data or it routes to manual review — there's no "model said yes, it must be true" path.
Auditability. Every extracted fact carries its schema version, the document it came from, and the workflow step that produced it.
Inspectable failures. When a schema is wrong, the pipeline says so immediately rather than silently dropping a field. Misconfigurations surface in seconds, not weeks.