Internal system names, providers, and exact numbers have been abstracted or generalized for confidentiality — the architecture patterns and trade-offs described are accurate.
Context
Pull requests in a busy Django monolith touch a lot of orthogonal concerns: database migrations, internal admin frontend, CI workflows, regulatory copy, model logic. A single human reviewer is rarely strong in all of them — and the ones that aren't reviewed get caught in production.
Generic "AI reviews this PR" tools surface noise: hundreds of style-level comments, no awareness of the codebase's conventions, no way to express "this kind of change deserves more scrutiny than that one." The team had tried one and stopped using it.
The goal was different: turn the LLM into a reviewer that runs a panel of area-specialist graders, each with explicit pass/fail criteria, instead of a single freeform review.
The system was first prototyped on a sibling backend and then ported into the main platform — most of the recent iteration is about adapting it to the platform's conventions.
Architecture
PR event ──► GitHub Action ──► Agent orchestrator
│
├─► Grader: Migrations (graph)
├─► Grader: Frontend
├─► Grader: CI / GitHub Actions
├─► Grader: Documentation
├─► Grader: Migrations (area)
└─► ...
│
▼
Aggregated verdict + targeted comments
│
▼
Approval gate (adaptive)
Graders, not a monolithic review
Each grader is a specialized prompt + structured-output schema, scoped to a single concern. The Migrations graders, for example, are split in two: one inspects the Django migration dependency graph (catches missing parents, accidental linearization, twin nodes that need re-numbering); the other inspects the migration content itself (data migrations, irreversible operations, schema drift).
A grader can ignore a PR entirely if no files in its area changed. That keeps the noise floor low.
Structured output, not prose
Graders return structured verdicts — pass/fail/needs-changes with anchored line-level findings — instead of paragraphs. Comments only get posted for findings that pass a confidence threshold. The narrative summary is built deterministically from the structured verdicts.
Adaptive approval gate
The gate started at "all graders must pass" and was relaxed to "at least one grader explicit-approves with no blocking findings." That cleared a backlog of unrelated graders flagging cosmetic issues in hotfix PRs.
Hotfix PRs that target main directly (instead of a release branch) are covered by the same workflow — that required an explicit fix to the fetch step so the action sees the right base ref.
Eval areas as a registry
Adding a new area is a config change, not a new pipeline: a grader file with its prompt + schema gets registered, the workflow picks it up automatically. Documentation, Frontend, CI/GitHub Actions, and Migrations were each shipped without touching the orchestrator.
Trade-offs
Many narrow graders, not one general reviewer. Higher prompt-engineering and maintenance cost — every area is its own prompt with its own failure modes. The benefit is each grader stays inside a problem it can actually reason about, and false positives in one area don't drown true positives in another. Adding scope is also bounded: a new grader has a small surface, not a sprawling system prompt.
Structured output gated by confidence, not raw model text. Some findings get dropped because they don't pass the schema or score. The benefit is that the comments developers see are signal — the team stopped ignoring the bot after a couple iterations.
Adaptive gate, not strict. Relaxing the gate to one approval risked letting through PRs that one grader genuinely wanted to block. The benefit was unblocking the team during the noisy stabilization period — and once the noise dropped, the gate can be tightened back per-area.
Port-and-adapt from a sibling repo, not a from-scratch rewrite. The agent core was already in production on another repo. Re-implementing would have been cleaner; porting meant carrying over assumptions that didn't all fit. The benefit was weeks of velocity — the platform got the reviewer running end-to-end before iterating on graders specific to its codebase.
Outcome
- Coverage. Areas the human team historically under-reviewed — migrations graph, CI workflows, internal documentation — now get a consistent first-pass on every PR.
- Signal over noise. Targeted, schema-validated comments instead of paragraph dumps. Developers started reading the bot's output instead of dismissing it.
- Bounded blast radius for new areas. Adding a grader is local: a prompt + a schema + a registry entry. New scope ships in hours rather than weeks.
- Hotfix coverage. Direct-to-main PRs (the highest-risk path) go through the same review as release PRs.