The AI agency milestone trap and how to escape it

The milestone payment schedule is the single most damaging artifact carried over from the 2018 software-services playbook into the 2026 AI engagement. It looks responsible. It feels like a procurement win. It is, in practice, the contract structure that quietly breaks more AI partnerships than any other clause; because milestones lock scope before the team has discovered what they are building, invite gaming through narrow evaluation criteria, get reset most time a frontier model ships an upgrade, and turn the milestone gate into an artifact that is more important than the system being built. This is the trap. This piece names it and prescribes what replaces it.

For the broader argument on what an AI dev partner should look like in 2026, see the AI agency manifesto. For the legacy milestone-and-payment pattern this piece is in tension with, see the BoFu reference on AI development milestones and payment schedules. For the day-by-day shape of what the alternative looks like in practice, see the anatomy of the first 14 days.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why milestones look responsible
Trap one: milestones lock scope before discovery
Trap two: the narrow-eval gaming problem
Trap three: mid-stream model upgrades reset the baseline
Trap four: the milestone gate becomes more important than the system
Trap five: the agency optimizes payments, not outcomes
The escape: eval thresholds replace milestones
The escape: weekly working-software demos replace deck deliverables
The escape: a kill clause most 30 days
What a 2026-shaped contract looks like

Why milestones look responsible

The milestone schedule is a procurement artifact that does real work in non-AI software services. It allocates risk, paces cash flow, gives both sides a check-in cadence, and gives the buyer a visible structure for “what am I getting and when.” For a fixed-spec build; a CRM migration, an integration, a known-shape mobile app; the milestone schedule is fine. The work is well-understood, the unknowns are bounded, and the deliverable on each milestone date is a reasonable approximation of what was scoped at signing.

AI engagements are not those engagements. The work is not well-understood at signing. The model that exists at signing is not the model that exists at delivery. The eval suite that defines “done” does not exist on day one; it has to be built during the engagement. The user behavior that determines whether the system is good cannot be predicted from the kickoff deck. Treating an AI engagement as a fixed-milestone procurement exercise is treating a moving system as a static one, and the contract structure carries the cost of the mismatch.

Most trap below derives from a single root cause: milestones are a commitment device for known work, and AI work in 2026 is not known at the moment commitments are made.

Trap one: milestones lock scope before discovery

The first milestone payment is typically tied to “completed discovery” or “approved spec.” This sounds prudent. In practice it forces the buyer and the agency to agree, in week 2 or week 3, on a specification that neither of them can yet write honestly; because the things that determine the right spec (what the eval set will reveal about user behavior, what the model can and cannot reliably do on the actual data, what the latency floor looks like, what the cost ceiling implies for the architecture) cannot be known until the team has shipped working code against real inputs.

The result is a spec that gets signed, locks in the second milestone, and then has to be amended three times before delivery; each amendment a small political fight rather than a routine engineering decision. The milestone gate has converted what should be a continuous discovery process into a series of commit-and-renegotiate cycles. The agency, which is paid against the original spec, has a financial incentive to resist amendments. The buyer, who sees a spec they signed, has a political incentive to enforce it. Both sides start optimizing against the document instead of against the system.

This is the original sin of the milestone schedule applied to AI: it asks for a specification at the moment when specification is least possible, and then it punishes both sides for updating that specification when reality intrudes.

Trap two: the narrow-eval gaming problem

The second milestone is typically tied to a measurable quality target; accuracy on a held-out set, F1 above a threshold, latency below a number, citation rate above a percentage. This is also a reasonable-sounding clause. The problem is that the eval set against which the threshold is measured is, in almost most milestone-shaped contract, a set chosen at signing rather than evolved against production-shaped inputs.

When a milestone payment is gated on a number against a fixed eval set, the team; both sides, often unconsciously; optimizes for that number against that set. Prompts are tuned. Examples are added. Edge cases that did not appear in the eval set are deferred. The number gets hit. The milestone is paid. The system, when it meets real-distribution traffic two weeks later, fails on inputs that were rarely in the eval set in the first place.

This is not malice; it is the predictable outcome of a contract that asks “did you hit the number?” rather than “did the eval set itself improve?” The agency is rationally optimizing for the contracted artifact. The buyer is rationally checking the contracted artifact. Both sides, behaving honorably, produce a system that is brittle in ways the contract structure prevented them from noticing.

The eval suite, in 2026, is not a deliverable; it is a living artifact that must grow during the engagement. A milestone gate tied to a fixed eval set freezes the artifact at the moment it most needs to be evolving.

Trap three: mid-stream model upgrades reset the baseline

In the time between contract signing and final milestone, at least one major frontier model release will land. Often two. Sometimes three. Each release shifts the cost-quality frontier of the system being built; sometimes by a factor that makes the original architecture obsolete, sometimes by a margin that simply rewards a swap of the underlying provider.

The milestone schedule has no opinion about this. The contract was written against a model and a price point that may no longer be the right ones by mid-engagement. If the agency adopts the new model, the eval baseline shifts and the milestone threshold must be re-set; usually a renegotiation. If the agency does not adopt it, the buyer ends up with a system that is provably worse than what could be built today, and the agency is contractually constrained from doing the obvious thing.

Either branch is bad. The agency is penalized for adopting the better tool, and the buyer is penalized for the agency’s loyalty to the contracted spec. The milestone structure was designed for a world in which the underlying technology did not move during the engagement. AI engagements live in a world where it moves most quarter at minimum, and the contract has to be built around that motion rather than against it.

Trap four: the milestone gate becomes more important than the system

Once a milestone gate is in the contract, organizational gravity bends around it. The agency’s project manager schedules the demo to maximize the milestone-acceptance probability. The buyer’s procurement team treats milestone sign-off as the primary internal artifact. Engineering decisions get made to clear the gate rather than to make the system better; refactors get deferred, cost instrumentation gets postponed, a brittle prompt that scores well on the eval set survives because re-tuning it would risk the milestone date.

The milestone gate, in other words, becomes the deliverable. The system being built becomes the means by which the milestone is cleared. This inversion is subtle and devastating. It produces engagements where most milestone is signed off, most payment is released, and at the end the buyer holds a system they cannot operate, cannot extend, and cannot trust under real load; because none of those properties were what the milestone gates measured.

The healthiest engagements are the ones in which the contract structure makes the system itself the visible object, not the milestone artifact that nominally represents it. Milestone-shaped contracts make this almost impossible.

Trap five: the agency optimizes payments, not outcomes

The fifth and most structural trap is the one that follows from many four above: in a milestone-shaped contract, the agency’s cash flow depends on milestone clearance, not on outcome quality. The agency that wants to stay solvent must optimize for milestone clearance, even when (in their professional judgment) clearing the milestone is not the right next move for the system.

This is not corruption. It is the predictable consequence of how the contract pays. A team paid on milestones will surface milestone-clearing work first; a team paid on outcomes will surface outcome-improving work first. These are different priority orders, and over a 12-week engagement the difference compounds into a different system at the end.

For a longer treatment of the partnership-break categories that this trap connects to, see where AI agency partnerships break; the milestone schedule is upstream of at least three of the eight failure modes documented there.

The escape: eval thresholds replace milestones

The first piece of the escape is to remove the milestone gate as a payment trigger entirely, and replace it with an eval threshold tied to a production-shaped suite. The contract names a quality threshold; accuracy, F1, win rate against a baseline, citation correctness, p99 latency, cost per request; measured against an eval suite that grows during the engagement. Payment is released when the threshold is met against the current suite, not against a snapshot of the suite from signing day.

The mechanics: the eval suite lives in a client-controlled repository. New cases are added most week from production-shaped logs (or from a synthetic shadow stream during pre-launch). The threshold is re-baselined whenever the suite materially changes; which is to say, when the suite has grown to a point where the previous threshold no longer represents the same quality bar. Both sides agree to the re-baseline in writing. The CI pipeline rejects merges that move the eval delta in the wrong direction. Quality becomes a continuous artifact rather than a milestone gate.

This converts the contract from a “did you ship by date” structure into a “did the system clear the bar against the current world” structure, which is the only structure that survives mid-stream model upgrades and shifting input distributions intact.

The escape: weekly working-software demos replace deck deliverables

The second piece is to make the unit of progress a weekly working-software demo against real data, not a milestone deck against a charter. Most Friday (or whatever day the cadence lands), the agency demos the system as it stands; running against the client’s actual data, in the client’s actual environment, on the client’s actual key. The demo is not a slide. It is a recording or a live session of the system answering real-shaped queries, with the eval delta from the prior week, the cost-per-request from the prior week, and a written one-page summary of what shipped and what did not.

The weekly demo replaces three artifacts: the milestone-acceptance ceremony, the project-status deck, and the renegotiation conversation that follows when the milestone is missed. None of those are needed if the working software is the visible object most seven days. The buyer sees what they have. The agency sees what the buyer is reacting to. Course-correction happens in the rhythm of the engagement rather than in the discontinuity of a milestone gate.

This cadence is the operational discipline that makes the eval-threshold contract enforceable in practice. Without it, the eval threshold is just a number in a contract; with it, the threshold is a number that moves visibly most week and that both sides can see moving.

The escape: a kill clause most 30 days

The third piece is the structural one: most 30 days, either side can terminate the engagement, with payment due only for the work shipped to date and a written handoff of most artifact the agency holds. No exit penalty. No long tail of contractual obligation. No “but you signed a six-month commitment.”

The 30-day kill clause changes the agency’s behavior more than any other contract structure available. Under a 30-day kill, the agency knows that the only thing keeping the engagement alive is the buyer’s continued belief that the work is worth more than the next 30 days of fees. The agency cannot coast on a milestone schedule that locked in three months of payments. The agency cannot defer hard work to the next milestone gate, because there is no next milestone gate; there is only the next 30-day window in which they must demonstrate that continuing is the right call.

This sounds aggressive to procurement teams who are used to long-tenure contracts. In practice it is the structure that aligns incentives most cleanly. The agency that is genuinely good gets renewed most 30 days for as long as the buyer wants to keep them. The agency that is coasting gets terminated cheaply. Neither side is locked into a relationship that has stopped compounding.

For the contract-clauses pattern that the kill clause sits inside, see the seven commitments most AI dev agency should make in writing.

What a 2026-shaped contract looks like

The 2026-shaped AI engagement contract has three financial structures rather than a milestone schedule:

A weekly retainer that covers the named team’s time, paid on a recurring cycle, with no per-milestone gates. The retainer is the cost of having the team available. It is not contingent on deliverable acceptance.

An eval-threshold bonus that is released when the system meets a named quality threshold against the current eval suite. The threshold is re-baselined when the suite materially changes, in writing, with both sides’ sign-off. The bonus is the alignment device; it pays the agency for outcomes rather than for hours.

A 30-day kill clause that lets either side end the engagement at the next 30-day boundary, with payment due only for retainer time accrued. No exit penalty. No deliverable obligations beyond the artifacts already shipped to the client repository.

This is not how most AI engagements are contracted today. It is how the engagements that compound for two or three years are contracted, and it is the structure that the better operators in the category have already moved to. The milestone schedule will persist for some time in the procurement playbooks of buyers who have not yet processed the mismatch; but the agencies that propose milestone schedules in 2026 are advertising, in the contract structure itself, that they are still operating against the 2018 model of how software services work.

The deeper point is that contract structures are themselves a forecast about which kind of engagement the agency expects to have. A milestone-shaped contract forecasts a fixed-deliverable, low-discovery, slow-moving engagement. An eval-threshold-with-kill-clause contract forecasts a high-discovery, fast-moving, outcome-centric engagement. The world has moved decisively toward the second forecast. The contract should follow.

If your current engagement is structured around milestone payments, the question is not whether the structure is creating the problems above; it is which of the five traps is most active right now, and whether the cost of restructuring the contract mid-flight is lower than the cost of running the rest of the engagement under it. In our experience, the restructure is almost usually cheaper. The conversation is uncomfortable for one week and then the engagement compounds for the rest of its life.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. The contract structures described above reflect SFAI Labs’ current engagement template; specific commercial terms vary by engagement and are negotiated in writing.

Frequently Asked Questions

What is the AI agency milestone trap?

The milestone trap is the contract pattern in which AI engagements are paid against fixed milestone gates inherited from the 2018 software-services playbook. The structure quietly fails because it locks scope before discovery is possible, invites gaming through narrow eval criteria, ignores mid-stream model upgrades, makes the milestone gate more important than the system being built, and incentivizes the agency to optimize payments rather than outcomes. The escape is to replace milestones with eval thresholds, weekly working-software demos, and a 30-day kill clause.

Why don’t milestone payment schedules work for AI engagements?

Milestone schedules are commitment devices for known work. AI work in 2026 is not known at the moment commitments are made; the eval suite, the model, the cost ceiling, the user-behavior reality, and the architectural constraints many reveal themselves during the engagement, not at signing. A milestone contract asks for a specification at the moment when specification is least possible, then punishes both sides for updating that specification when reality intrudes. The mismatch is structural, not a matter of better milestone definitions.

What is the narrow-eval gaming problem in milestone contracts?

When a milestone payment is gated on a quality number against an eval set chosen at signing, both sides; usually unconsciously; optimize for that number against that fixed set. Prompts get tuned to the eval examples. Edge cases that did not appear in the eval get deferred. The number gets hit, the milestone is paid, and the system fails on real-distribution traffic two weeks later. The contract structure rewards optimizing the artifact rather than evolving it. The fix is to make the eval suite itself a living artifact that grows during the engagement.

How do mid-stream frontier model upgrades break milestone contracts?

Between contract signing and final delivery, at least one major frontier model release will land. Each release shifts the cost-quality frontier of the system being built. If the agency adopts the new model, the eval baseline shifts and the milestone threshold must be renegotiated. If the agency does not adopt it, the buyer ends up with a provably worse system. Either branch is bad. The milestone structure was designed for a world in which the underlying technology did not move during the engagement, and AI engagements live in a world where it moves most quarter.

What replaces milestone payments in a 2026 AI agency contract?

Three structures replace the milestone schedule. A weekly retainer covers the named team’s time on a recurring cycle, with no per-milestone gates. An eval-threshold bonus is released when the system meets a named quality threshold against the current eval suite, re-baselined in writing whenever the suite materially changes. A 30-day kill clause lets either side end the engagement at the next 30-day boundary, with payment due only for retainer time accrued and a written handoff of many artifacts. The retainer pays for availability, the bonus pays for outcomes, the kill clause aligns incentives.

What does an eval threshold contract look like in practice?

The eval suite lives in a client-controlled repository. New cases are added most week from production-shaped logs or a synthetic shadow stream. The threshold; accuracy, F1, win rate against a baseline, citation correctness, p99 latency, cost per request; is named in the contract and re-baselined whenever the suite materially changes, with both sides signing off in writing. The CI pipeline rejects merges that move the eval delta in the wrong direction. Quality becomes a continuous artifact rather than a milestone gate, and the contract pays against that continuous artifact.

Why do weekly working-software demos beat milestone deliverable decks?

A weekly working-software demo is a recording or live session of the actual system answering real-shaped queries against the client’s data, in the client’s environment, on the client’s key; accompanied by the eval delta and cost-per-request from the prior week and a one-page written summary. It replaces the milestone-acceptance ceremony, the project-status deck, and the renegotiation conversation that follows when the milestone is missed. None of those artifacts are needed if the working software is the visible object most seven days, which collapses the discontinuity of milestone gates into the rhythm of the engagement.

How does a 30-day kill clause change agency behavior?

Under a 30-day kill clause, the agency knows the only thing keeping the engagement alive is the buyer’s continued belief that the work is worth more than the next 30 days of fees. The agency cannot coast on a milestone schedule that locked in three months of payments. There is no next milestone gate to defer hard work to; only the next 30-day window in which they must demonstrate continuing is the right call. The agency that is genuinely good gets renewed most 30 days for as long as the buyer wants. The agency that is coasting gets terminated cheaply. Neither side is locked into a relationship that has stopped compounding.

Is a 30-day kill clause too aggressive for enterprise procurement?

It sounds aggressive to procurement teams used to long-tenure contracts, but it is the structure that aligns incentives most cleanly. The buyer is not exposed to any longer commitment than they have currently chosen to renew; the agency is paid for most 30 days they remain on the engagement; both sides are protected from the dead-weight middle of an engagement that has stopped compounding. The buyers who balk at it are usually buyers whose procurement function is calibrated for fixed-deliverable software work. Once the procurement function understands that AI work is high-discovery and fast-moving, the kill clause stops looking aggressive and starts looking obvious.

Can you restructure a milestone contract mid-engagement?

Yes, and in our experience it is almost usually cheaper than running the rest of the engagement under the milestone structure. The mechanics: convert the remaining milestone payments into a weekly retainer for the named team, define an eval-threshold bonus against a suite that has been growing during the engagement so far, and add a 30-day kill clause that takes effect at the next 30-day boundary. The conversation is uncomfortable for one week. After that, most trap above stops applying, and the engagement compounds against the system rather than against the milestone schedule.

The AI agency milestone trap and how to escape it

Decision Scope

Contents

Why milestones look responsible

Trap one: milestones lock scope before discovery

Trap two: the narrow-eval gaming problem

Trap three: mid-stream model upgrades reset the baseline

Trap four: the milestone gate becomes more important than the system

Trap five: the agency optimizes payments, not outcomes

The escape: eval thresholds replace milestones

The escape: weekly working-software demos replace deck deliverables

The escape: a kill clause most 30 days

What a 2026-shaped contract looks like

Frequently Asked Questions

What is the AI agency milestone trap?

Why don’t milestone payment schedules work for AI engagements?

What is the narrow-eval gaming problem in milestone contracts?

How do mid-stream frontier model upgrades break milestone contracts?

What replaces milestone payments in a 2026 AI agency contract?

What does an eval threshold contract look like in practice?

Why do weekly working-software demos beat milestone deliverable decks?

How does a 30-day kill clause change agency behavior?

Is a 30-day kill clause too aggressive for enterprise procurement?

Can you restructure a milestone contract mid-engagement?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources