Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 19 min read

The case against fixed-price AI development contracts

The case against fixed-price AI development contracts

Fixed-price AI development contracts are the procurement default that has not survived contact with how AI software is built in 2026. Procurement teams love them because they look like risk transfer; agencies sell them because that is what the RFP asks for; and almost most large fixed-price AI engagement I have audited in the last two years has either silently failed its eval bar, blown through change orders, or been quietly converted into time-and-materials by month four. The procurement instinct that worked for a CRM rollout in 2018 is the wrong instinct for a retrieval-augmented agent in 2026. The structure of AI work; eval-driven discovery, variable inference cost, in-flight model upgrades; is incompatible with a price fixed before the work begins. This is the case against fixed-price for AI development, and the alternatives that distribute risk honestly.

The argument is not that fixed-price is bad in general. Fixed-price works fine for marketing site builds, mobile app v1s, and well-scoped data migrations; anywhere the unknowns are bounded and scope can be specified before kickoff. AI development is different in a way procurement frameworks have not caught up to, and that gap is where most failed engagements live. The alternatives; time-and-materials with a cap, eval-milestone billing, fixed-fee discovery plus variable production, and capped retainer; are not exotic. They are how shipping firms work. Fixed-price is how deck-producing firms work. For the broader frame on what an AI dev partner should be, see the AI agency manifesto.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why fixed-price was the default; and why it stopped working

The fixed-price contract is a procurement artifact from a world where software scope was knowable in advance. You wrote a spec, you bid the spec, you delivered the spec, you got paid. The buyer transferred risk to the vendor; the vendor priced the risk into the bid; both sides agreed on the artifact at the end. That model required two things to be true: the work had to be specifiable up front, and the cost of doing the work had to be predictable. For deterministic software in 2018, both were roughly true. For AI software in 2026, neither is.

A modern AI engagement does not produce a deterministic artifact. It produces a system whose behavior is measured against an eval suite, where the eval suite itself is co-discovered with the system, where the model underneath the system is being upgraded by a third party most six to twelve weeks, where most request has a variable inference cost paid in tokens, and where the failure modes are emergent rather than enumerable in advance. None of those properties are compatible with a price fixed at kickoff. They are compatible with a billing model that prices the work as it is discovered, and a contract structure that distributes the irreducible variance honestly between buyer and vendor. The 2018 procurement reflex of “we want a fixed number” is, in 2026, a request for a fictional number.

Reason 1: eval-driven discovery rewrites the scope mid-project

The one of the largest reason fixed-price contracts fail in AI work is that the scope that was bid on day one is not the scope that ships on day 60. This is not because the agency is bad at scoping; it is because the eval suite is the real specification, and the eval suite is co-discovered with the system. On day one, the buyer says “we want a customer-support agent.” On day 12, the agency runs the first eval pass and discovers that 18 percent of real tickets have multi-account state that was rarely mentioned in the kickoff, that 6 percent involve a regulated workflow nobody flagged, and that the “happy path” the buyer described represents 41 percent of actual ticket volume. The scope did not change because anyone moved the goalposts. The scope changed because the data revealed itself.

In a fixed-price world, this is a change order, and change orders poison engagements. Either the agency absorbs the cost; at which point they are losing money on the engagement and start cutting corners (see reason 4); or the buyer absorbs it through a renegotiated bid, in which case the fixed-price was a fiction. In a time-and-materials or milestone world, this is just Tuesday: the eval delta moves, the team adjusts the next two-week increment, the buyer sees the new failure modes, and the contract structure absorbs the change without a renegotiation. Eval-driven discovery is not a defect of AI engagements; it is the central feature of doing the work properly. A pricing model that punishes it is a pricing model that punishes good work.

Reason 2: inference cost is variable and unknowable upfront

A fixed-price contract has to price the cost of running the system, not just the cost of building it. For deterministic software, that cost is the agency’s labor; predictable, bid-able, padded for risk. For AI software, that cost is the agency’s labor plus the inference bill, and the inference bill is variable along multiple axes nobody can fix in advance. Token-per-request depends on the prompts that emerge from the design, which depend on the eval results, which depend on the data. The blended cost-per-request depends on the routing decisions across providers, which depend on which providers are cheapest at production time. Per-call cost depends on whether the team uses prompt caching, batching, or a smaller model on the easy half of the traffic; choices that are made in week 4, not at the bid.

The 2024 reflex was for agencies to either hold the API keys and mark up the bill (a quiet form of token arbitrage that buyers should refuse; see token arbitrage in AI agency engagements) or to bake a worst-case inference budget into the fixed price. The first creates a misalignment where the agency profits from inefficiency. The second produces a number that is either a massive overcharge if the system ends up cheap or a loss-leader if it ends up expensive; and either way, the buyer is paying for the agency’s risk pricing rather than the actual cost of inference. Pass-through inference billing on the buyer’s keys, with the agency optimizing toward a target cost-per-request, is the only structure that aligns the incentive correctly. Fixed-price collapses that alignment back into a guess.

Reason 3: model upgrades reset the baseline mid-contract

A 90-day fixed-price AI engagement signed in February will, almost without exception, see at least one major model upgrade from a frontier provider before the engagement ends. The upgrade is not a courtesy. It changes prompt behavior, reasoning characteristics, refusal patterns, latency, cost, and tool-call accuracy. The eval delta from the baseline shifts overnight; sometimes upward, sometimes downward, usually in ways the team has to characterize and adapt to. In the last 18 months I have not run an engagement where at least one model in the routing layer did not change underneath us mid-build.

Under fixed-price, this is a contractual nightmare. Did the buyer pay for a system on the old model or the new one? If the new model makes the original architecture obsolete; and it sometimes does; does the agency rebuild for free or charge a change order? If the new model fails on a workflow the old one handled, is that a defect under the contract? None of these questions have clean answers under fixed-price; many of them have clean answers under a structure where the agency is paid to keep the system at or above an eval threshold over time, regardless of which model is underneath it. The work in 2026 is not “build a system in March and walk away”; it is “operate a system that survives the model upgrade in April.” Fixed-price assumes the wrong artifact is being purchased.

Reason 4: fixed-price incentivizes cutting corners on evals

This is the reason that gets least discussed in procurement and matters most in production. A fixed-price contract pays the agency the same whether they build a thorough eval suite or a thin one. A thorough eval suite catches more failures, surfaces more rework, and erodes the agency’s margin. A thin eval suite; 8 happy-path examples, no adversarial cases, no regression tests, no PII boundary checks; passes the contract and ships on time. The agency does not cut corners because they are dishonest; they cut corners because the contract structure punishes the rigorous version. Most hour spent on adversarial eval cases is an hour lost on the agency’s margin. The system passes the eval gate because the gate was designed thin. It then fails in production six weeks later, when the buyer’s own data hits a failure mode no eval covered.

Eval-milestone billing inverts this. If the agency is paid per eval threshold passed; not per hour spent or per fixed deliverable, but per validated quality bar; then a thicker eval suite is an asset, not a cost. The agency wants to surface failure modes early because each captured failure mode is part of the contract structure. The buyer wants the eval suite to be thorough because thoroughness is what they are buying. Fixed-price aligns nobody on quality. Eval-milestone billing aligns everyone. The eval-driven AI development guide covers what a real eval suite looks like in practice, and why it has to be the contract artifact rather than the deck.

Reason 5: buyers pay for the agency’s risk premium without the agency absorbing the risk

The polite case for fixed-price is that the buyer pays a premium and the agency absorbs the variance. In AI engagements, this is not how it plays out. The agency does charge the premium; usually 30 to 60 percent over a time-and-materials estimate, sometimes more; but they do not absorb the variance. When the variance bites, three things happen. The agency negotiates a change order, which means the buyer is paying the variance directly on top of the premium they already paid. The agency cuts corners on evals or scope, which means the buyer is paying the variance in the form of lower-quality software. Or the agency disengages early on a “we delivered the contracted scope” technicality, which means the buyer is paying the variance in the form of an unfinished system that nobody has a contractual obligation to finish.

In many three patterns, the buyer pays the variance anyway, on top of the risk premium they already paid for not paying it. This is the worst possible structure for a buyer: a premium for protection that does not protect them. Time-and-materials with a cap gives the buyer a bounded worst case while letting actual hours land where the work demands.

What to use instead

The argument against fixed-price is only useful paired with a workable alternative. There are four pricing structures that distribute AI engagement risk honestly, and the right one depends on the engagement shape.

Time-and-materials with a cap. The default for any AI engagement under 90 days. The agency bills hourly against an agreed rate card; the buyer caps the engagement at a number that represents their risk tolerance; if the work hits the cap before completion, the contract triggers a renegotiation point rather than a quiet failure. This structure assumes the buyer has technical leadership capable of reviewing eval deltas and PR throughput weekly; without that, the cap is just a budget line nobody monitors. For shops that have run this well, see the discussion in fixed-price vs hourly AI development.

Eval-milestone billing. For engagements where the success criteria can be expressed as a sequence of eval thresholds, the agency is paid per threshold passed. The contract names the baseline eval, the target eval, and three to five intermediate milestones, each tied to a payment tranche. This structure aligns the agency around the eval suite as the artifact, which is exactly what AI engagements should be aligned around. It works best for production replacement projects where the existing system already produces a measurable baseline the new one must beat.

Fixed-fee discovery, variable production. For larger engagements, a hybrid: a fixed-fee, two-to-four-week discovery phase that produces the eval baseline, the architecture decision record, the data audit, and the cost model; and then a separate variable-billing phase for the production build, scoped against the artifacts produced in discovery. This is the structure that procurement teams find most palatable because it preserves a fixed number for the up-front phase, where it is appropriate, and shifts to a variable model for the build phase, where it is necessary. The discovery deliverables are concrete enough to bid; the build phase is uncertain enough that bidding it is dishonest.

Capped retainer. For engagements that are operational rather than build-once; agentic systems in production, eval-driven iteration over months, model-upgrade absorption; a monthly retainer with a cap on hours and a defined eval-and-uptime SLA. The buyer pays a predictable monthly number; the agency commits to keeping the system above an eval threshold; the cap protects the buyer from a runaway month. This is the structure I recommend for any AI engagement past month three, and the AI retainer vs project pricing comparison covers the trade-offs in detail.

What to do at the procurement table

If your default contract template is fixed-price, the conversion to one of the four alternatives above is not a contract problem; it is an organizational one. Procurement has been measured for years on how aggressively they fix prices, so a time-and-materials engagement reads internally as procurement losing the negotiation. The conversion requires the technical buyer; the CTO, the head of AI, the director of engineering; to own the engagement structure and explain why the variance the fixed-price was hiding is real, irreducible, and better handled by a different structure than a higher premium.

The framing that works: a fixed-price AI contract is a number someone at the agency made up, and the buyer is paying for that fiction. The variance does not disappear under fixed-price; it gets repriced and hidden in the bid. Once procurement sees that frame, they usually convert.

The conclusion procurement does not want to hear

The procurement preference for fixed-price is, at its core, a request for the agency to lie convincingly about a future cost. The agency that will lie convincingly is the agency that has padded the bid the most aggressively, scoped the work the most narrowly, and designed the eval suite the most thinly. The agency that will not lie is the agency that says: “Here is our hourly rate, here is our cap, here is the eval threshold we will commit to, here is how the inference bill will pass through, here is how we will handle a model upgrade mid-engagement.” The first agency is easier to procure. The second agency is the one whose software works in production six months later. Fixed-price selects for the first kind of agency, which is why it is the wrong default for AI work in 2026.


Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run AI engagements under many four of the alternative pricing structures described above, across more than two dozen client engagements in the last 18 months.

Frequently Asked Questions

Why are fixed-price contracts a bad fit for AI development in 2026?

Fixed-price assumes the work is specifiable up front and the cost of doing it is predictable. Neither holds for AI software. Eval-driven discovery rewrites scope mid-project as the data reveals itself. Inference cost is variable and unknowable until prompts, routing, and caching are designed. Model upgrades from frontier providers reset the baseline mid-contract. Fixed-price also incentivizes thin eval suites (which pass the gate but fail in production) and lets the agency charge a risk premium without absorbing the risk. The variance does not disappear; it gets repriced and hidden in the bid.

What should buyers use instead of fixed-price for AI engagements?

Four structures distribute the risk honestly. (1) Time-and-materials with a cap is the default for engagements under 90 days. (2) Eval-milestone billing pays the agency per validated quality threshold, aligning everyone around the eval suite as the artifact. (3) Fixed-fee discovery plus variable production preserves a fixed number for the up-front phase and shifts to variable for the build. (4) Capped retainer is the right structure for operational engagements past month three, with a defined eval and uptime SLA and a cap on monthly hours.

How does eval-driven discovery break fixed-price scoping?

On day one the buyer describes a scope. On day 12 the first eval pass against real data reveals failure modes nobody flagged: multi-account state, regulated workflows, ticket categories nobody mentioned. The scope did not change because anyone moved the goalposts. The scope changed because the data revealed itself. Under fixed-price this is a change order and change orders poison engagements. Under time-and-materials or milestone billing it is just Tuesday: the eval delta moves, the team adjusts the next two-week increment, and the contract structure absorbs the change.

Why is inference cost incompatible with fixed-price contracts?

A fixed-price has to price both labor and the cost of running the system. The inference bill varies with token-per-request (driven by prompt design), blended cost-per-request (driven by routing decisions), and per-call cost (driven by caching, batching, and model-tier choices). Many of those are decided in week 4, not at the bid. Agencies either hold the keys and mark up the bill (token arbitrage) or bake a worst-case inference budget into the price, producing either an overcharge or a loss-leader. Pass-through inference billing on the buyer’s keys is the only structure that aligns the incentive correctly.

What happens to a fixed-price AI contract when a frontier model is upgraded mid-engagement?

A 90-day fixed-price engagement signed in February will see at least one major model upgrade before it ends. The upgrade changes prompt behavior, reasoning characteristics, refusal patterns, latency, cost, and tool-call accuracy. The eval delta shifts overnight. Under fixed-price the questions are messy: did the buyer pay for a system on the old model or the new one? If the new model makes the architecture obsolete, who pays for the rebuild? None of those questions have clean answers under fixed-price; many of them have clean answers under a structure that pays the agency to keep the system at or above an eval threshold over time.

How does fixed-price incentivize thin eval suites?

A fixed-price contract pays the agency the same whether they build a thorough eval suite or a thin one. A thorough suite catches more failures, surfaces more rework, and erodes margin. A thin suite (8 happy-path examples, no adversarial cases, no regression tests) passes the gate and ships on time. The agency is not dishonest; the structure punishes rigor. Eval-milestone billing inverts this: paying per threshold passed makes a thicker eval suite an asset, not a cost. The agency wants to surface failure modes early because each captured failure is part of the contract.

Why does the agency’s risk premium under fixed-price fail to protect the buyer?

Fixed-price agencies typically charge 30 to 60 percent over a time-and-materials estimate as a risk premium. When variance bites, three things happen: the agency negotiates a change order (so the buyer pays the variance directly on top of the premium), the agency cuts corners on evals or scope (so the buyer pays the variance in lower-quality software), or the agency disengages on a ‘we delivered the contracted scope’ technicality (so the buyer pays the variance as an unfinished system). In many three patterns the buyer pays the variance anyway, on top of a premium for protection that did not protect them.

When is fixed-price still the right pricing model?

Fixed-price works fine when the unknowns are bounded and scope can be specified before kickoff: marketing site builds, mobile app v1s, and well-scoped data migrations. It also works for small AI engagements with a single, narrow eval (a one-off classifier with a fixed dataset, for example). The rule of thumb is that fixed-price is appropriate when the artifact at the end is deterministic and specifiable. AI engagements that include retrieval, agentic workflows, multi-model routing, or production traffic against drifting data are not specifiable in advance, and fixed-price is the wrong default for them.

How do you convince a procurement team to abandon fixed-price for AI work?

The conversion is organizational rather than contractual. Procurement has been measured for years on aggressively fixing prices, so a time-and-materials engagement reads internally as procurement losing the negotiation. The technical buyer (CTO, head of AI, director of engineering) has to take ownership of the engagement structure and explain that the variance the fixed-price was hiding is real, irreducible, and best handled by a different structure than a higher premium. The framing that works: a fixed-price AI contract is a number someone at the agency made up, and the buyer is paying for that fiction.

What is eval-milestone billing and how does it work in practice?

Eval-milestone billing pays the agency in tranches tied to validated quality thresholds rather than hours or fixed deliverables. The contract names the baseline eval (where the system starts), the target eval (where it must end), and three to five intermediate milestones, each tied to a payment tranche. The agency is paid as the eval delta moves. This works best for production replacement projects where the existing system already produces a measurable baseline the new one must beat. It aligns the agency around the eval suite as the artifact, which is exactly what AI engagements should be aligned around.

Last Updated: May 23, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles