Per-eval pricing is the contract structure where an AI agency is paid per eval threshold passed; recall@10 > 0.85, faithfulness > 0.9, latency P95 < 800ms; rather than per feature delivered, per hour worked, or per milestone hit. The agency and buyer co-author an eval rubric at kickoff, agree threshold tiers (must-pass, should-pass, stretch), and payment is released per tier as evals lock green on a buyer-readable harness. The case for this model is that it is the only AI pricing structure that makes vendor incentive and buyer outcome point at the same number. Feature-list scoping rewards artifacts that may not perform; T&M rewards hours regardless of outcome; eval-threshold billing rewards the quality bar the buyer signed up for. This piece argues why per-eval pricing should be the default for serious AI engagements in 2026, the mechanics of building it into a contract, and the edge cases; eval ownership, threshold drift, model-upgrade re-eval; that decide whether it works in practice.
It is a spoke under the AI project economics manifesto, which argues that AI economics has shifted from feature cost to evaluation cost; and contracts that do not encode the shift produce predictable misalignment.
What per-eval pricing is
Per-eval pricing is a contract structure where the agency invoices against measured eval performance rather than artifact delivery. The unit of billing is the eval threshold; a named, locked, contractually agreed performance level on a buyer-readable test set, scored by a co-owned rubric, on a CI-integrated harness.
A typical contract surface reads: the agency is paid $X when the system reaches recall@10 > 0.85 on the locked retrieval test set, paid $Y when faithfulness > 0.9 on the locked grounding test set, paid $Z when latency P95 < 800ms on the locked load profile, and paid a stretch bonus when many three exceed the should-pass tier. The agency is not paid for “implementing the retrieval system.” The agency is paid for the retrieval system performing.
The model operates on the same primitive ranked second-best in AI project pricing models, ranked by alignment with outcomes; the eval bar as the contract surface. Per-eval pricing is the granular version: one threshold, one payment, one named test set. Each eval is its own milestone.
The eval suite is co-authored. The buyer brings domain examples and acceptance criteria; the agency brings test-set engineering and scoring infrastructure. The rubric is locked at signing. The thresholds are contractually fixed; buyer cannot move them up to avoid payment, agency cannot move them down to claim it.
Why feature-list scoping breaks for AI
Feature-list scoping is the inheritance from deterministic-software contracts. The SOW lists features; the agency builds them; the buyer accepts them; payment is released per delivered feature. This worked when “feature delivered” was a well-defined predicate.
For AI systems, “feature delivered” is structurally underspecified. A RAG pipeline can be delivered; code shipped, integration green; and still produce hallucinated answers 20 percent of the time. An agent can be wired and still fail recovery on edge cases that are 5 percent of production traffic but 80 percent of customer churn. A classifier can be deployed and still miss the long-tail labels the buyer cared most about.
The feature was delivered. The system did not perform. The agency invoiced. The buyer paid for an artifact that needed another six weeks of eval-driven iteration to be useful, and that iteration was a change order. This is the core dysfunction laid out in stop scoping AI projects in features, scope them in evaluations and stop budgeting AI projects in story points, budget them in eval runs.
Feature-list scoping pays for the easy part; integration, wiring, demo; and not for the hard part: the eval-driven iteration that makes the system performant. Per-eval pricing inverts this. Integration is cost of doing business; the agency is paid when the eval bar lands.
The three incentive alignments per-eval pricing creates
Per-eval pricing creates three structural alignments no other pricing model produces simultaneously.
Quality alignment. Under feature-list scoping, the agency is rewarded for shipping artifacts. Under T&M, for hours. Under milestone payment, for hitting milestone definitions. Under per-eval pricing, the agency is rewarded for the system performing at a contractually locked level on a co-owned test set. The agency’s revenue function is the buyer’s quality function. No other AI pricing model produces this alignment as cleanly.
Early surfacing of under-evaluation. A common failure mode is that the eval suite is built late, by the agency alone, against examples the agency selected, and the buyer first sees the results at acceptance. By that point the engagement is weeks from launch and the buyer has no leverage to reject a weak rubric. Per-eval pricing forces the eval suite to be built first, co-owned, and locked, because the contract surface depends on it. Eval discipline is pulled from the decline of the project to the beginning, which is where it belongs.
Replacing feature-list scoping. Once payment is tied to eval thresholds, the SOW migrates from “build features X, Y, Z” to “achieve eval performance A, B, C.” Features become implementation choices rather than contract surfaces. The agency picks the architecture; the buyer locks the bar. This is the structural shift detailed in the AI project economics manifesto.
Mechanics: how to write per-eval pricing into a contract
A per-eval pricing contract has five mandatory components.
One: the eval suite definition. A named, versioned test set with documented examples and scoring rubric. The buyer co-authors at least 50 percent of the examples; the agency engineers the test infrastructure. The suite is locked at signing; additions require a change order, the existing suite cannot be silently shrunk.
Two: the threshold table. Named thresholds (recall, faithfulness, latency, accuracy, calibration; domain-dependent) with locked numerical bars per tier. The numerical bar is what the buyer is willing to pay for; the agency must believe the bar is achievable. Negotiation at signing is exactly this number.
Three: the payment schedule per threshold. Each threshold has a named payment. Must-pass payments cover the agency’s expected delivery cost plus margin floor; should-pass payments fund the iteration premium; stretch payments fund the upside bonus.
Four: the eval harness. A CI-integrated runner producing deterministic, reproducible scores against the locked test set. Buyer-readable from kickoff. Shared infrastructure; neither side can change scoring without notification; both sides see the same dashboards.
Five: the change-order clause. Eval suites drift as new failure modes surface in production. The contract must specify how new test cases get added, who funds them, and how thresholds re-baseline. Without this clause, the agency is exposed to unbounded test-set expansion and the buyer is exposed to thresholds that rarely reflect production reality.
Threshold tiers: must-pass, should-pass, stretch
The three-tier structure is the practical version of per-eval pricing. Pure binary thresholds produce knife-edge contracts that turn most release into a high-stakes eval run. Tiered thresholds smooth the payment curve and let the agency invoice against incremental progress.
Must-pass thresholds are the bar for shippability. Below this bar, the system is not deployable. Payments cover the agency’s full delivery cost; the agency must be willing to invoice only must-pass tier and still operate profitably.
Should-pass thresholds are the bar for the buyer’s full quality expectation. Most engagements aim here. The agency invests the iteration premium; additional eval loops, prompt refinement, retrieval tuning, model selection; to reach this tier. Payments fund the iteration premium and the agency’s target margin.
Stretch thresholds are the bar for materially exceeding expectation. Stretch is rare; some engagements hit it, most do not. Payments are bonus; high upside for the agency, high signal for the buyer. Stretch tier produces the case-study eval scores worth publishing per why AI agency case studies should publish raw eval scores.
A typical contract prices must-pass at 60 percent of total budget, should-pass at 30 percent, stretch at 10 percent. Exact split depends on engagement profile.
Edge cases: eval ownership, drift, and model upgrades
Three edge cases decide whether per-eval pricing works in practice.
Eval ownership. The unsafe answer is “the agency builds it”; the failure mode where the agency builds a suite they can pass and the buyer first sees it at acceptance. The unsafe alternative is “the buyer builds it”; most buyers lack the eval-engineering capacity. The right answer is split ownership: the buyer brings production examples and acceptance criteria; the agency brings eval infrastructure, scoring rubric design, and held-out set engineering. Both sides are on the cap table of the eval suite.
Threshold drift. The eval suite locked at signing rarely covers most production failure mode. The contract must specify how the suite gets extended. Default rule: failure modes inside the originally agreed scope are the agency’s responsibility against the existing fee; failure modes outside the originally agreed scope are change orders. The line between in-scope and out-of-scope is where the contract lawyering happens, and it is worth lawyering at signing rather than at the first production fire.
Model upgrades. When the agency upgrades the underlying model; Sonnet to Opus, GPT-5 to GPT-5.5; the eval suite must be re-run. Some thresholds improve automatically; some degrade (better model, worse calibration on the agency’s prompt template). The contract must specify a re-eval clause: the agency re-runs the full eval suite within N business days of any underlying model change, with regressions remediated under the existing fee. This protects the buyer from silent model swaps that quietly degrade quality for margin.
A contract that covers signing-time eval mechanics but not production-time eval governance produces alignment for the build phase and misalignment for everything after.
When per-eval pricing fails
Per-eval pricing is not universal. Three project profiles are bad fits.
Buyer cannot co-own the eval suite. If the buyer lacks domain expertise to author production examples, lacks engineering capacity to read eval dashboards, or lacks discipline to lock thresholds against internal pressure, per-eval pricing collapses to “agency builds the suite alone”; the failure mode the structure was designed to prevent. T&M with cap is the right model until eval discipline matures.
Quality is structurally unmeasurable. Some AI work; open-ended creative generation, ambient-experience features, deeply-subjective UX; has no clean eval rubric. Faithfulness, recall, calibration, and latency do not capture what the buyer cares about. Forcing per-eval pricing produces a rubric that measures the wrong thing precisely. Outcome-based fees on a downstream business metric usually fit better.
Eval bar unknowable at signing. Some engagements need a discovery phase to determine what the bar should be; which failure modes matter, what production mix the system will face. Discovery on T&M with cap, build on per-eval pricing tied to a bar locked at the decline of discovery, is the mature 2026 structure here.
The argument is not “per-eval pricing usually wins.” The argument is that per-eval pricing should be the default for the build phase of any AI engagement where the eval bar is knowable.
Frequently asked questions
What is per-eval pricing exactly?
Per-eval pricing is a contract structure where an AI agency is paid per eval threshold passed; recall@10 > 0.85, faithfulness > 0.9, latency P95 < 800ms, and so on; rather than per feature delivered, per hour worked, or per milestone hit. The eval suite is co-authored at signing, the rubric is locked, and payment is released per tier as evals lock green on a buyer-readable harness.
Why is per-eval pricing a better fit for AI than feature-list scoping?
Because “feature delivered” is structurally underspecified for AI. A RAG system can ship and still hallucinate; an agent can be wired and still fail edge cases; a classifier can be deployed and still miss the long tail. Feature-list scoping pays for the easy part; integration and wiring; and not for the hard part, the eval-driven iteration that makes the system perform. Per-eval pricing inverts this. Integration is cost of doing business; payment lands when the eval bar lands.
How are threshold tiers structured?
Three tiers: must-pass (the bar for shippability, payments cover the agency’s full delivery cost), should-pass (the bar for full quality expectation, payments fund the iteration premium), and stretch (the bar for materially exceeding expectation, payments are bonus). A typical split is 60 / 30 / 10 percent of total budget.
Who builds the eval test set?
Split ownership. The buyer brings production examples and acceptance criteria; the agency brings eval infrastructure, rubric design, and held-out set engineering. Buyer-only suites are too narrow or too vague; agency-only suites produce the failure mode where the buyer first sees the rubric at acceptance.
How do you handle threshold drift when production reveals new failure modes?
The contract must specify a change-order clause. Default rule: failure modes inside the originally agreed scope are the agency’s responsibility against the existing fee; failure modes outside scope are change orders. The line is worth lawyering at signing rather than at the first production fire.
What happens when the underlying model is upgraded?
A model-upgrade clause requires the agency to re-run the full eval suite within N business days of any model change, with threshold regressions remediated under the existing fee. This protects the buyer from silent model swaps that quietly degrade quality for margin.
When does per-eval pricing fail?
Three conditions: the buyer cannot co-own the eval suite (use T&M with cap until eval discipline matures); quality is structurally unmeasurable (use outcome-based fees on a downstream business metric); the eval bar is unknowable at signing (use T&M for discovery, then per-eval for build).
How is per-eval pricing different from milestone payment?
Milestone payment ties payment to artifact delivery; “system deployed to staging,” “documentation handed off.” These are gameable through definition. Per-eval pricing ties payment to measured performance against a locked test set with a co-owned rubric. The eval threshold itself is the milestone, not the existence of the artifact. The gameability vector shrinks from milestone-definition to test-set-overfitting, which held-out evaluation catches.
Key takeaways
- Per-eval pricing bills the agency per eval threshold passed rather than per feature, per hour, or per milestone. The eval suite is co-authored, the rubric is locked, the harness is buyer-readable from kickoff.
- The case rests on three alignments produced simultaneously: vendor incentive on quality, early surfacing of under-evaluation, and structural replacement of feature-list scoping.
- Mechanics require five contract components: the eval suite definition, the threshold table, the payment schedule per threshold, the eval harness, and the change-order clause.
- Three tiers; must-pass, should-pass, stretch; smooth the payment curve. A typical split is 60 / 30 / 10 percent of total budget.
- Three edge cases decide enforceability: split eval ownership, threshold-drift handling, and model-upgrade re-eval cadence.
- Per-eval pricing fails when the buyer cannot co-own the eval suite, when quality is structurally unmeasurable, or when the eval bar is unknowable at signing. Outside these conditions, it should be the default.
The shift from feature cost to evaluation cost is the underlying economic move. Per-eval pricing is the contract structure that encodes the shift.
Arthur Wandzel