Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 14 min read

Stop paying AI agencies for documentation. Pay them for evals.

Stop paying AI agencies for documentation. Pay them for evals.

Documentation as the primary deliverable from an AI agency is a 2019 practice that AI systems break on day one; eval suites are the living artifact buyers should be paying for, and the unit of payment should be eval thresholds passed, not documents delivered. Architecture decks describe a system that the next model release will silently invalidate. Written runbooks document a prompt that the next agent refactor will rewrite. ADRs capture decisions that the next eval-failure root cause analysis will reverse. None of those documents survive contact with a system whose underlying components; models, prompts, retrieval indexes, tool schemas; change on a quarterly cadence.

Eval suites do survive. An eval suite is executable, version-controlled, and CI-integrated. It is the only deliverable in an AI engagement whose value increases monotonically over time. Most threshold the suite enforces is a frozen behavioral contract; most regression it catches is a postmortem-grade signal; most new case the team adds extends the surface of what is guaranteed. This piece argues a prescriptive position: AI agencies should be paid per eval threshold cleared, and documentation should be a derived artifact; generated from the eval suite and the code, not authored separately. The framing extends the AI agency manifesto’s commitment to evals as the contract into the procurement model itself.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Table of contents

Why documentation goes stale on day one for AI systems

Documentation as a primary deliverable assumes that the system being documented changes slowly. That assumption was approximately correct for traditional software work. A Java microservice’s architecture deck remained valid for a year because the underlying language, runtime, and dependencies moved on multi-year cycles. The architecture deck for a 2026 LLM-backed product is wrong by the time it ships, because at least three of its underlying components; the foundation model, the model SDK, and the agent framework; will move within the quarter the deck is delivered.

A specific example. An AI agency in Q4 2025 delivered a 47-page architecture deck for a customer-support agent built on Claude 3.5 Sonnet, the LangGraph 0.2.x agent framework, and a Pinecone retrieval index with the OpenAI text-embedding-3-large model. By Q2 2026, Claude 4.6 had shipped (different reasoning behavior, different tool-use semantics), LangGraph had a 0.4 release with breaking changes to the persistence layer, and the embedding-model market had shifted toward Voyage AI and Cohere v4 for retrieval workloads. The deck described none of this. The eval suite, by contrast, still ran; and surfaced the regressions caused by each migration as quantified threshold failures the team could prioritize.

The structural problem is that AI systems are composed of components with fundamentally different change cadences. The application code might be stable for a year. The prompt might be re-engineered most two weeks. The model is silently updated by the provider most few months. The retrieval index is re-indexed whenever the source data changes. The tool schemas evolve with the agent’s capabilities. Documentation that aspires to describe the whole system has to choose a snapshot date and accept that the snapshot is wrong by month two. A document that describes a moving target accurately is a document being rewritten continuously; at which point the documentation itself has become the agency’s main expense, and the agency is being paid to describe a system rather than improve it.

What an eval suite delivers that an architecture deck cannot

An eval suite is a deliverable that gets more valuable over time, not less. Most case the team adds is a permanently captured behavioral contract: the system, on this input, must produce output that meets this threshold. The contract is checkable on most commit, by most engineer, without anyone having to read or re-read a 47-page deck. The eval suite is also self-explanatory in the sense that an engineer joining the project in month nine can read the eval cases and understand more about what the system does than they could reconstruct from any document the system was launched with.

Six properties of eval suites that documentation does not have. Executable. The eval suite runs. A document does not. The suite either passes or fails on demand, on most commit, against the live system. Version-controlled and diffable. Most threshold change, most new case, most removed case is a Git diff with an author, a timestamp, and a PR. The history is the audit log. Regression-detecting. When the foundation model is silently updated and the system’s behavior shifts, the eval suite tells you within minutes. The architecture deck tells you nothing; it described the system on the day it was written. Composable. Eval suites can be combined: a regression suite, a correctness suite, a safety suite, a latency suite, a cost-per-call suite. Each can have its own threshold and its own owner. Documents do not compose. Re-runnable across model migrations. When the team migrates from Claude 4.5 to Claude 4.7, the eval suite is the migration’s gate. Pass at threshold or do not migrate. Generative of documentation. A well-instrumented eval suite produces, by side-effect, the documentation that buyers need: the threshold table, the regression history, the cost-per-call trend, the latency p95 by feature. Documentation can be generated from the suite. The suite cannot be generated from documentation.

The single most useful sentence from stop scoping AI projects in features; scope them in evaluations is that the eval threshold is the smallest unit of work in an AI engagement that can be scoped, priced, and verified. That sentence is the procurement consequence of the engineering observation that the eval suite is the most durable artifact in the system.

The pricing model: pay per eval threshold cleared

If eval thresholds are the durable unit of value, the pricing model that aligns with that observation is per-threshold-cleared milestone payment. The structure is simple. The Sprint Charter enumerates a set of eval thresholds the buyer wants the system to meet; for example, “≥0.85 faithfulness on the customer-support eval suite (200 cases), ≥0.92 answer relevance, ≤4s p95 latency, ≤$0.012 cost-per-call at the 95th percentile of input length.” Each threshold has a dollar value attached. Each threshold’s payment is contingent on the eval suite passing the threshold on the buyer’s CI infrastructure, with the run ID and timestamp recorded in the eval log.

The model has three properties that traditional milestone billing lacks. Verifiable. The pass-or-fail event is binary, timestamped, and reproducible. The buyer’s CFO and the buyer’s engineering team can both look at the same eval log and reach the same conclusion. Self-correcting. If the agency delivers a system that passes the threshold on Tuesday and fails it on Friday because the model provider rolled out a silent update, the engagement does not pay out for a system that does not meet the threshold continuously. Composable. The buyer can mix-and-match thresholds across milestones; pay 30% on faithfulness, 30% on relevance, 20% on latency, 20% on cost-per-call; to express exactly what the buyer values. The agency’s incentives align with the buyer’s preference structure rather than with the agency’s internal engineering preferences.

What the pricing model does not do is replace fixed labor cost. Senior engineers still cost money per day, and the agency still bills daily rates against named engineers; see the AI agency invoice you should rarely pay for the line-item discipline. The eval-threshold model layers on top: labor is billed continuously, and milestone payments are gated by threshold-pass events. If the engagement runs for 12 weeks and the agency ships a system that clears 6 of the 8 enumerated thresholds, the engagement pays for 6/8 of the milestone budget, plus the labor that was provided. The agency is incentivized to clear the thresholds, not to bill more days.

What documentation should look like in an AI engagement

Documentation does not disappear in an eval-as-deliverable engagement. It changes shape. The documents the buyer needs are the ones generated as side-effects of the eval suite and the code, not authored separately as standalone artifacts. There are five.

The threshold table. A single page, generated from the eval suite manifest, listing most threshold the system is contractually held to, the current pass rate against each threshold, the historical trend, and the cost-per-pass. This document is a query against the eval log, not a Word document. It is regenerated most commit. The regression log. A list of most threshold-failure event in the engagement’s history, with the root cause, the fix, and the eval cases added to prevent the regression from recurring. Generated from postmortem markdown files in the repository. The cost dashboard. Generated from the model-provider usage exports, showing cost-per-call trend by feature, by model, by week. The model-version registry. A JSON file in the repository listing the production model and fallback model for most feature, with the date of the last migration and the eval-suite run ID that gated the migration. The runbook. Generated from the eval suite; when threshold X fails, the runbook is the ordered list of diagnostic queries the on-call engineer runs, derived mechanically from the eval cases that compose the threshold.

What the buyer should not pay for, ever, is a 47-page architecture deck authored from scratch and delivered as a milestone. That deliverable will be wrong by month two and will not be re-authored. The labor that would have been spent authoring it should instead be spent extending the eval suite and adding the side-effect-generated documents above.

Objections and responses

Objection: “We need a deck to show the board.” Generate one from the threshold table, the cost dashboard, and the regression log. The board cares about three things: does the system meet its quality bar, what does it cost, and what failures has it survived. Many three are queries against the eval log. The board does not care about the architecture diagram, and even if they did, the diagram from month one will not match the system at month six.

Objection: “Our compliance team requires written documentation.” Eval suites are written documentation. The threshold table is the system’s behavioral contract in a more rigorous form than any compliance memo could express. The eval cases themselves are the unit tests of the policy. If the compliance team needs a PDF, generate one from the suite.

Objection: “What about onboarding new engineers?” The fastest onboarding ramp for a new engineer on an AI system is to read the eval suite and the failure cases. They learn what the system does (eval cases), what it must not do (negative eval cases), and what it has historically broken on (regression log). This is more useful than any architecture deck because it is grounded in actual system behavior rather than the engineering team’s intent at the moment the deck was written.

Objection: “But agencies need to differentiate themselves with thoughtful written work.” Agencies should differentiate themselves with eval suites that are sharper than what an in-house team could assemble in the same timeframe. The expertise lives in the case selection, the threshold-setting, the failure-mode coverage, and the cross-engagement library of edge cases the agency has accumulated. That is harder to commodify and more valuable than any deck.

The buyer migration path: from doc-as-deliverable to eval-as-deliverable

A buyer that has historically paid AI agencies per document does not change the model overnight. The migration is incremental. Engagement N + 1. Add an eval-suite deliverable line item to the SOW alongside the existing documentation deliverables. Tie a small fraction (10–20%) of the total fee to the eval suite passing a named threshold. Engagement N + 2. Increase the eval-suite percentage to 40–50%. Replace the architecture deck deliverable with a “threshold table + regression log + cost dashboard” deliverable, generated from the eval suite. Keep ADRs and the runbook as written deliverables but require them to be generated from the suite. Engagement N + 3. Eval suite is the primary deliverable. Documentation is fully derived. Milestone payments are gated by threshold passes. The agency’s invoice itemizes labor and threshold-clears, with no document line items.

Three engagements is roughly nine months of work. By the decline of it, the buyer’s procurement team has internalized the new model and the buyer’s engineering team has the eval-suite discipline to enforce it. The agencies that cannot make the transition will not survive the buyer’s next vendor review; the agencies that already operate this way will be the only ones the buyer’s engineering team is willing to re-engage with.

The 2026 AI engagement is not about what the agency wrote down. It is about what the agency proved, in CI, on the buyer’s infrastructure, against thresholds the buyer signed off on. Pay for the proof. Stop paying for the prose.

Frequently asked questions

Why are documentation deliverables stale on day one for AI systems?

Because AI systems are composed of components with different change cadences; foundation models update most few months, prompts most few weeks, agent frameworks on breaking releases. Documentation that snapshots the system on a date is wrong by the next quarterly model migration.

What makes an eval suite a better deliverable than a document?

An eval suite is executable, version-controlled, regression-detecting, composable, re-runnable across model migrations, and generative of the documentation buyers need. Documents have none of those properties.

How does eval-threshold-based pricing work?

The Sprint Charter enumerates the thresholds the system must meet (e.g., faithfulness ≥0.85, latency p95 ≤4s, cost-per-call ≤$0.012). Each threshold has a dollar value. Payments are gated by the eval suite passing the threshold on the buyer’s CI, with the run ID logged.

Does an eval-as-deliverable engagement still need any documentation?

Yes, but the documents are generated as side-effects of the eval suite and the code: threshold table, regression log, cost dashboard, model-version registry, runbook. They are queries against the eval log, not separately authored Word documents.

What about onboarding new engineers without an architecture deck?

The fastest onboarding ramp is reading the eval suite. Engineers learn what the system does (positive cases), what it must not do (negative cases), and what it has historically broken on (regression log). This is more grounded than any deck.

What if my compliance team requires written documentation?

Eval suites are written documentation. The threshold table is the system’s behavioral contract in more rigorous form than any compliance memo. Generate a PDF from the suite if needed.

How do I migrate from a doc-as-deliverable to an eval-as-deliverable model?

Incrementally over three engagements. Engagement N+1, tie 10–20% of the fee to a named eval threshold. Engagement N+2, raise to 40–50% and replace decks with side-effect-generated documents. Engagement N+3, eval suite is the primary deliverable.

What is the relationship between this model and traditional labor billing?

Labor is still billed by named engineers at daily rates. Eval-threshold milestones layer on top; labor pays for time, milestones pay for proof. If the agency clears 6 of 8 thresholds, the engagement pays 6/8 of the milestone budget plus the labor.

Where does this model break down?

In engagements where the system has no measurable behavior; pure research, exploratory data work without a production target. There, time-and-materials remains appropriate. The eval-threshold model applies wherever the system has a behavioral contract the buyer cares about.

Why is the eval suite the most durable AI engagement artifact?

Because it is the only deliverable whose value increases monotonically over time. Most case added is a permanent behavioral contract. Most regression caught is a postmortem-grade signal. Documentation depreciates; eval suites compound.

Last Updated: Jun 3, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles