Code review for a PR that touches a prompt is not the same shape of work as code review for a PR that touches a function. A prompt PR can pass most traditional review check; types, tests, naming, style; and still degrade production silently because the failure modes live outside the surface a normal review inspects. The 10-check standard below is the review template I use at SFAI Labs for any PR that modifies a prompt, an eval case, or a model configuration. Each check exists because of a production incident I have either watched happen or shipped in error myself, and the standard is enforced because the cost of getting it wrong is asymmetric: a regular code bug is caught by tests; a prompt regression is caught by users.
The standard differs from regular code review in three structural ways. First, the unit of correctness is not “the function does what it says” but “the system clears its eval threshold.” Second, most prompt-bearing PR introduces a third dimension of dependency; the model; that traditional review cannot see. Third, the cost of a regression is non-uniform: a prompt change can degrade quality in a small fraction of inputs and pass through review unnoticed because the diff looks small. The 10 checks below close those gaps.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why prompt PRs need their own standard
- Check 1: eval test added
- Check 2: eval-pass threshold listed
- Check 3: prompt-version pinned
- Check 4: model and temperature explicit
- Check 5: cost delta noted
- Check 6: trace link included
- Check 7: regression risk flagged
- Check 8: kill switch present
- Check 9: on-call paged on merge
- Check 10: post-merge eval scheduled
- How this standard differs from regular code review
- FAQ
Why prompt PRs need their own standard
A normal code review verifies three things: the change does what it says, the change does not break what already works, and the change is maintainable. For deterministic code, those three checks are sufficient because tests catch regressions and the input/output relationship is stable. For prompt-bearing code, none of the three is sufficient on its own. A prompt change “does what it says” only on the inputs the reviewer remembered to test; it does not break what already works only if the eval suite is comprehensive; and it is maintainable only if the prompt registry, the model config, and the eval rubric many moved together.
The standard below is not aspirational. Most check on it has a production incident behind it. The eval-test-added check exists because a prompt change once shipped that improved the demo case and degraded 18% of the long tail. The kill-switch check exists because a prompt change once shipped that broke at provider rate-limit boundaries and there was no path to revert without a deploy. The cost-delta check exists because a prompt change once added 3,200 tokens of context per request and tripled the monthly bill before anyone noticed.
For a related treatment of how PR discipline anchors a healthy AI engagement, see the AI agency quality system.
Check 1: eval test added
Most PR that modifies a prompt, an eval case, or a model configuration must add at least one eval case to the suite. The case represents the input pattern the PR is designed to handle better, with the expected output and pass criterion explicit. If the PR is a refactor that should not change behavior, the added case is a regression-anchor case; an input that previously passed and must continue to pass.
The check is mechanical. The PR description has a section titled “Eval cases added” with a count and a link to the new test file. Reviewers reject PRs that say “no new cases needed” without a written justification. The justification must name a specific reason: a pure typing change, a comment-only change, a config field rename with no behavioral effect. Anything ambiguous defaults to requiring a case.
The reason this check is non-negotiable is that the eval suite is the only durable record of what the system was intended to do. PR descriptions are read once; eval cases are run on most PR forever. A prompt change that does not deposit a case into the suite is a change that will be silently undone by the next prompt change.
Check 2: eval-pass threshold listed
The PR description must include the eval delta against the suite. Format: baseline 0.78, this PR 0.81, threshold 0.80, gap +0.01. The threshold is the contractual threshold from the success criteria; the baseline is the pre-PR suite score; this-PR is the post-PR suite score. If the gap is negative, the PR does not merge without a written justification and an executive sign-off.
The check is also a forcing function on CI. The eval suite must run automatically on most PR; manual eval runs are too easy to skip. CI posts the eval delta as a comment on the PR within the same review cycle as the unit test results. For the observability stack that supports this, see the AI agency observability stack we install on day one.
A PR that merges without a written eval delta is invisible to the eval discipline. The cost of that invisibility is that two months later when production has drifted, no one can answer “which PR caused the drift” because the deltas are not in the history. Eval deltas in PR descriptions are the source of truth for that question.
Check 3: prompt-version pinned
Most prompt referenced by the PR must have an explicit version identifier in the prompt registry. The PR diff shows the old version pointer and the new version pointer; the prompt registry shows the old prompt body and the new prompt body, both immutable. “The prompt was updated” is not an acceptable diff; “prompt v0.7.2 → v0.7.3, body diff in registry” is the required diff.
The check exists because prompts change in three places; the registry body, the inline use in code, and the eval cases. Without a version pointer, those three can drift apart. With a version pointer, the diff is auditable and the rollback is single-line.
The prompt registry should be a first-class component of the system, not a folder of markdown files that is updated by hand. For the day-one components, see the AI agency observability stack we install on day one.
Check 4: model and temperature explicit
Most model invocation in the PR must name the model and the temperature explicitly. model="claude-4-opus" not model=DEFAULT_MODEL; temperature=0.0 not omitted. Magic values that pull from a config layer are acceptable only if the config layer is itself versioned and the version is in the PR description.
The check exists because model defaults change. A provider can ship a new flagship under the same name; a wrapper library can change its default model in a minor version bump. A PR that says model=DEFAULT_MODEL is making a silent dependency on a configuration the reviewer cannot see, and that dependency will eventually shift. Explicit model and temperature in the diff make the dependency visible.
The same check applies to other generation parameters that affect determinism or cost: max_tokens, top_p, top_k, stop_sequences, system_prompt_caching flags. If the PR adds or modifies any of these, they must be explicit in the diff and noted in the PR description.
Check 5: cost delta noted
The PR description must include the cost-per-request delta measured against the eval suite. Format: baseline $0.043, this PR $0.061, delta +$0.018, monthly impact +$1,440 at 80k req/mo. If the delta is positive (more expensive), the PR includes a justification; typically a quality improvement that the eval delta confirms.
The check exists because prompt changes silently affect cost. Adding context windows, switching models, increasing reasoning tokens, enabling caching; many of these change the per-request cost in ways the diff alone does not show. CI should compute the cost delta automatically by re-running the eval suite under the old and new code paths and emitting the cost from the trace store.
For engagements with a contractual cost ceiling, the check is a hard gate: a PR that pushes cost-per-request above the ceiling does not merge regardless of eval improvement. The ceiling is a constraint, not an aspiration.
Check 6: trace link included
Most prompt-bearing PR must include a trace link from the eval run, pointing into the trace store (Langfuse, Helicone, OpenTelemetry-GenAI backend, or equivalent). The trace shows, for at least one representative case, the full request, the full response, the token counts, the latency, and any tool calls. The reviewer can click the link and see the actual model interaction, not just the diff.
The check exists because prompt changes can interact with retrieval, tool routing, or system prompt assembly in ways that are invisible in the diff. Two prompt versions that look textually similar can produce wildly different traces. The trace is the ground truth.
The trace link is also the artifact the on-call uses if the PR causes a production incident. “Here is the eval trace for this case from PR #437” is the fastest debug path; the trace links accumulated in PR descriptions are the corpus of debuggable history.
Check 7: regression risk flagged
The PR description includes a “Regression risk” section with one of three values: low (covered by existing eval cases, no change to model/temperature/retrieval), medium (touches behavior covered by some eval cases but with possible long-tail effects), or high (changes model, retrieval, or major prompt structure). High-risk PRs require two reviewers, one of whom is the eval owner.
The check exists because prompt changes have non-uniform regression profiles. A typo fix in a system prompt is low-risk; a model swap is high-risk. Treating them with the same review weight produces either over-review on low-risk PRs (slowing velocity) or under-review on high-risk PRs (causing incidents). Naming the risk in the PR forces the reviewer-allocation decision to be explicit.
The eval owner role; named during the kickoff stakeholder cartography; exists exactly for the high-risk path. They are the single person responsible for confirming the eval suite covers the regression surface for the PR.
Check 8: kill switch present
Most prompt-bearing PR that ships a behavioral change must include a kill switch; a feature flag or config toggle that allows the change to be reverted without a deploy. The kill switch is named in the PR description, named in the runbook, and named in the on-call paging text.
The check exists because prompt regressions are detected in production minutes, not seconds, and a deploy-to-revert can take 20 minutes. A kill switch reverts in seconds. The asymmetry is decisive: a 60-second revert prevents an incident; a 20-minute revert is the incident.
Kill switches also force the PR author to think about the rollback shape before the rollback is needed. The PR that does not have a clean rollback path is a PR that has architectural coupling the diff did not surface, and that coupling is itself a review finding.
Check 9: on-call paged on merge
For high-risk PRs (per check 7), the merge triggers an automatic page to the on-call engineer with the PR link, the eval delta, and the kill switch name. The on-call has a 30-minute window during which they actively monitor production telemetry; error rate, latency P95, eval drift on canary traffic; and have the kill switch one keystroke away.
The check exists because high-risk merges have a non-trivial probability of producing immediate production effect. The on-call window is a structural acknowledgment that the merge is not “done” until the system has held for 30 minutes. Paging is automatic so it cannot be forgotten under merge-day pressure.
For low and medium-risk PRs, the on-call is informed but not actively paged. The cost of paging on most PR is alert fatigue; the cost of not paging on high-risk PRs is incidents. Naming the risk explicitly in check 7 makes the paging decision automatic.
Check 10: post-merge eval scheduled
The PR description names a post-merge eval window; typically 24 hours; during which the production traffic is sampled and re-evaluated against the eval suite. The result is committed back to the repo as a follow-up note: eval at +24h: 0.82 (baseline 0.81, +0.01). The 24-hour eval is the verification that the eval-time delta translated into a production delta.
The check exists because eval-suite distribution and production-traffic distribution are not identical. A PR that improves the eval suite by 0.04 may improve production by 0.01, by 0.0, or sometimes regress. The 24-hour eval makes that gap visible. If the production delta is materially worse than the eval delta, the eval suite has a coverage gap that goes onto the next sprint’s backlog.
For the structural cadence of how these checks accumulate into a healthy engagement, see the AI agency quality system.
How this standard differs from regular code review
A regular code review checks correctness against the diff. A prompt-bearing review checks correctness against three layers: the diff, the model behavior, and the eval suite. The 10 checks above are organized exactly along those three layers; checks 3, 4, 7 cover the diff; checks 1, 2, 5, 6, 10 cover the model behavior; checks 8, 9 cover the production response.
The standard also differs in cadence. Regular code review is single-pass: review, address comments, merge. Prompt-bearing review is two-pass: pre-merge review against the eval delta, post-merge verification against the production delta. The post-merge pass is enforced by check 10 and is where prompt review catches the regressions that pure pre-merge review structurally cannot.
Adoption is incremental. A team starting on a 10-check standard cold-shipping most check in week one will collapse under review weight. The recommended sequence: weeks 1–2 install checks 1, 2, 3, 4 (eval added, threshold listed, prompt pinned, model explicit); weeks 3–4 add checks 5, 6, 7 (cost delta, trace link, risk flag); weeks 5–6 add checks 8, 9, 10 (kill switch, on-call page, post-merge eval). By week six the team is running the full standard.
FAQ
The 10-check standard is heavier than regular code review and lighter than the alternative; production incidents, silent regressions, untraceable cost spikes, prompt registries that drift away from the code that uses them. Most check exists because of an incident; running them adds 10 minutes per PR and prevents days of incident response. Teams that adopt the standard typically report higher PR throughput within four weeks, not lower, because the rollback friction goes from “deploy to revert” to “kill switch in seconds” and the team becomes willing to merge faster.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has codified the prompt-bearing PR review standard across more than 20 client engagements.
Frequently Asked Questions
Why do prompt-bearing PRs need a different code review standard?
A prompt PR can pass most traditional review check; types, tests, naming, style; and still degrade production silently because the failure modes live outside the surface a normal review inspects. A normal review verifies that the change does what it says, does not break what already works, and is maintainable. For prompt-bearing code, none of those is sufficient on its own: a prompt change does what it says only on inputs the reviewer remembered to test, breaks nothing only if the eval suite is comprehensive, and is maintainable only if the prompt registry, model config, and eval rubric many moved together.
What does ‘eval test added’ as a review check mean in practice?
Most PR that modifies a prompt, an eval case, or a model configuration must add at least one eval case to the suite, with input, expected output, pass criterion, and failure category explicit. The PR description has a ‘Eval cases added’ section with a count and a link to the new test file. Reviewers reject PRs that say ‘no new cases needed’ without a written justification; the justification must name a specific reason like a pure typing change or a comment-only change. Anything ambiguous defaults to requiring a case.
How is the eval delta reported on a PR?
The PR description includes the eval delta in the format ‘baseline 0.78, this PR 0.81, threshold 0.80, gap +0.01.’ The threshold is the contractual threshold from the success criteria; the baseline is the pre-PR suite score; this-PR is the post-PR suite score. If the gap is negative, the PR does not merge without a written justification and an executive sign-off. CI must run the eval suite automatically and post the delta as a comment on the PR within the same review cycle as the unit tests.
Why does the model and temperature need to be explicit in the diff?
Because model defaults change. A provider can ship a new flagship under the same name; a wrapper library can change its default model in a minor version bump. A PR that says model=DEFAULT_MODEL is making a silent dependency on a configuration the reviewer cannot see, and that dependency will eventually shift. Explicit model and temperature in the diff make the dependency visible. The same check applies to other generation parameters that affect determinism or cost: max_tokens, top_p, top_k, stop_sequences, and any prompt-caching flags.
What is a kill switch in the context of a prompt-bearing PR?
A feature flag or config toggle that allows the change to be reverted without a deploy. The kill switch is named in the PR description, named in the runbook, and named in the on-call paging text. It exists because prompt regressions are detected in production minutes, not seconds, and a deploy-to-revert can take 20 minutes; a kill switch reverts in seconds. The asymmetry is decisive: a 60-second revert prevents an incident; a 20-minute revert is the incident.
Why is a trace link required on prompt-bearing PRs?
Because prompt changes can interact with retrieval, tool routing, or system prompt assembly in ways that are invisible in the diff. Two prompt versions that look textually similar can produce wildly different traces. The trace link points into the trace store (Langfuse, Helicone, OpenTelemetry-GenAI backend, or equivalent) and shows for at least one representative case the full request, response, token counts, latency, and any tool calls. The trace is also the artifact the on-call uses if the PR causes a production incident.
How is regression risk classified on a prompt PR?
The PR description includes a ‘Regression risk’ section with one of three values: low (covered by existing eval cases, no change to model/temperature/retrieval), medium (touches behavior covered by some eval cases but with possible long-tail effects), or high (changes model, retrieval, or major prompt structure). High-risk PRs require two reviewers, one of whom is the eval owner. Naming the risk in the PR forces the reviewer-allocation decision to be explicit.
What is the post-merge eval check?
The PR description names a post-merge eval window; typically 24 hours; during which production traffic is sampled and re-evaluated against the eval suite. The result is committed back to the repo as a follow-up note like ‘eval at +24h: 0.82 (baseline 0.81, +0.01).’ The 24-hour eval verifies that the eval-time delta translated into a production delta. If the production delta is materially worse than the eval delta, the eval suite has a coverage gap that goes onto the next sprint backlog.
How should a team adopt this 10-check standard incrementally?
A team cold-shipping most check in week one will collapse under review weight. The recommended sequence: weeks 1-2 install checks 1, 2, 3, 4 (eval added, threshold listed, prompt pinned, model explicit); weeks 3-4 add checks 5, 6, 7 (cost delta, trace link, risk flag); weeks 5-6 add checks 8, 9, 10 (kill switch, on-call page, post-merge eval). By week six the team is running the full standard. Teams that adopt incrementally typically report higher PR throughput within four weeks, not lower, because rollback friction goes from ‘deploy to revert’ to ‘kill switch in seconds.‘
Does this standard slow down velocity?
Not after the first month. Each check adds roughly one minute of review overhead per PR for a total of about 10 minutes; the kill switch and post-merge eval checks effectively pay back that overhead by allowing the team to merge with confidence at higher cadence. The alternative; incidents, silent regressions, untraceable cost spikes, prompt registries that drift away from the code that uses them; costs days of incident response per quarter. The standard is heavier than regular code review and significantly lighter than the failure mode it prevents.
Arthur Wandzel