Vinícius Bispo

Internal system names, providers, and exact numbers have been abstracted or generalized for confidentiality — the architecture patterns and trade-offs described are accurate.

Context

Operations teams routinely upload spreadsheets that reference Brazilian public-registry offices ("cartórios") by name and city. The platform needs to bind each row to the canonical record of that office in the internal catalog — but the input is the messy reality of operator-typed prose, and the catalog itself has its own messiness:

The same office name can exist in multiple regions
Sub-regions (neighborhoods, districts) appear in operator input but are stored under parent regions
Some catalog entries are inactive duplicates of currently-active offices — same identity, different row
Names drift across spelling, abbreviation, casing, accent normalization

The original lookup ("get_by_name") was a single direct query. When it missed, the row was rejected and the operator got a generic "not found" error. In practice, the lookup missed often enough that operators developed workarounds (renaming entries to match the catalog, splitting batches by region) — pushing the messiness back onto humans rather than handling it in the resolver.

The work was to rewrite the resolver to handle the realistic input shape: alternative names, sub-regions, inactive twins, ambiguous ties.

Architecture

Spreadsheet row ──► Resolver
                        │
                        ├─► Match by canonical name (active)
                        ├─► Match including sub-regions of the cited city
                        ├─► Twin lookup: inactive entry → active counterpart
                        ├─► Cross-region dedup
                        │
                        ▼
                Single canonical record (or operator-actionable failure)

Sub-region capture

The catalog stores cartórios under the official city they belong to, but operators type the neighborhood. The resolver, when matching by city, also considers sub-regions of that city. An office in a neighborhood of São Paulo resolves when the spreadsheet says "São Paulo" or says the neighborhood — same canonical record either way.

Twin matching

Some catalog entries are inactive duplicates of active offices — the catalog kept the inactive row for historical traceability, but the active row is the one credit operations should bind to. The resolver detects this case: if the only name match is on an inactive entry that has an active twin, return the active twin. The operator's spreadsheet, which still references the old name, keeps working.

Cross-region dedup

The same office name can appear in multiple cities. The resolver dedups across regions when narrowing on a candidate set — a duplicate match in city A and city B becomes the match in the operator's cited region, not an ambiguous failure.

Deterministic tiebreaker

When all heuristics still leave more than one candidate, the resolver falls back to a deterministic tiebreaker (ORDER BY primary key). That sounds banal, but the older fallback relied on database-default ordering, which is undefined in PostgreSQL. Two runs of the same batch could produce different bindings — a silent non-determinism that made bugs unreproducible. The explicit ORDER BY pk collapsed that.

Test coverage as failure-mode documentation

Every edge case got a regression test: sub-region with active match, sub-region with no match, inactive twin with active sibling, inactive twin without sibling, cross-region match in cited region, cross-region tie. The test names read like a spec for what the resolver promises — and what it doesn't.

Trade-offs

Wider candidate set, then narrow deterministically — not a single perfect query. The resolver runs multiple search passes (canonical, sub-region, twin, cross-region dedup) and then narrows. That's slower than a single index lookup. The benefit is that the operator's input rarely needs to be reshaped to fit the catalog — the resolver does the reconciliation work instead of the human.

Sub-region match is a "best effort," not a guarantee. A neighborhood whose parent city isn't represented in the cited region won't resolve. The benefit is that the resolver's behavior is predictable: it only walks up the geographic hierarchy when there's a defensible path. Wider fuzzy matching would have looked smarter and would have bound rows to the wrong offices.

Twin-match prefers the active record, not the cited one. An operator who typed the inactive entry's exact name will get bound to the active twin, not their literal string. That's surprising in isolation but correct in practice — credit operations should always run against the active record. The resolver logs which rows were rewritten so the operator can audit if needed.

Explicit tiebreaker, not "good enough" ordering. The change from undefined-order to ORDER BY pk is invisible to operators who don't hit ties. The benefit is non-flakiness — bugs that depended on which row "won" the tie became reproducible.

Outcome

Higher resolution rate. Spreadsheets that previously had 5-15% rejection rates routinely now bind end-to-end.
Operator audit trail. Twin rewrites and sub-region matches are logged with the canonical record they resolved to.
No more flaky bindings. Deterministic tiebreaker means two runs of the same batch produce the same result.
Edge-case regression net. Every shape of input the resolver now handles has a test that fails loudly when behavior regresses.