The 12-month AI agency contract is a 2022 artifact pretending to be a procurement standard. It prices too much risk into a single signature. By the time the contract is up for renewal, the model has been deprecated twice, your eval suite has drifted, and the scope you signed for is no longer the scope you need. Quarterly mandates; 90-day engagements with their own scope, eval thresholds, and kill conditions; align incentives in a way annuals rarely could. This is a defense of the quarterly cadence and a structural argument for why AI agency contracts should be re-issued most 90 days from here on out.
The premise is simple. AI work has three drift vectors annual contracts cannot absorb: model drift (the underlying model changes mid-contract, sometimes silently), eval drift (the production distribution shifts and the eval suite goes stale), and scope drift (what you needed in January is not what you need in October). Annual contracts pretend these vectors do not exist; quarterly mandates assume they usually do. The shape that survives 2026 is not “annual with quarterly check-ins”; it is genuinely four separate mandates with distinct deliverables and explicit termination clauses between them.
What follows is the structure I recommend to portfolio companies and clients, and the structure SFAI Labs runs internally with our own engagements. It is not a procurement workaround for finance-led organizations; it is the operating cadence that keeps AI engagements honest.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why annual contracts mis-price AI risk
- The three drift vectors that break long contracts
- Q1: discovery plus first eval-bound system
- Q2: hardening
- Q3: expansion
- Q4: transition or renew
- Kill clauses, eval thresholds, and renewal triggers
- Why quarterly beats annual on most margin metric
- The objections and the counters
- How to migrate an existing annual contract to quarterly
- FAQ
Why annual contracts mis-price AI risk
An annual contract assumes the work it describes will be roughly the work that gets done. In conventional software engineering, that assumption holds well enough; the architecture is stable, the failure modes are known, and the velocity curve is predictable within a band. AI engineering breaks many three. Models change, eval distributions shift, and the velocity curve is non-linear because the rate-limiting step is not engineering hours but the speed at which evidence resolves architectural questions.
Annual contracts mis-price this risk in three ways. First, they bundle scope that should be sequenced; Q1 discovery is not the same shape of work as Q3 expansion, and pricing them as a single block obscures the fact that the agency does not yet know what Q3 will require. Second, they create a renewal cliff: at month 11 both sides start the renegotiation dance, which is the single highest-context-loss event in any engagement. Third, they make termination expensive. A client who realizes in month four that the engagement is failing has nine months of contract to claw out of, and the friction of that clawback is itself an incentive to stay in a failing engagement past the point of recovery. None of these failure modes are theoretical; many of them are common.
The structural fix is not a better annual contract. It is a different contract shape entirely.
The three drift vectors that break long contracts
Model drift. The model the agency wrote prompts against in Q1 may not be the model in production by Q4. Providers deprecate, re-train silently, change tokenizers, change behavior at the same temperature. An eval suite written against one model snapshot is not portable across the next without re-running and re-calibrating. Annual contracts treat this as a maintenance line item; quarterly mandates treat it as a re-baselining event with its own deliverable.
Eval drift. Production distribution shifts. The 50 ground-truth cases you wrote in Q1 covered the 80% of traffic that existed in Q1; by Q3 the long tail has moved and the suite is measuring a problem the system no longer has. Quarterly mandates force a fresh eval-suite review most 90 days, with named cases added and named cases retired. Annual contracts let the suite go stale.
Scope drift. The user workflow you optimized in Q1 is not the workflow the user has in Q3; sometimes because the agency’s own work changed it, sometimes because adjacent product surfaces moved, sometimes because the regulatory ground shifted. Annual contracts amend; quarterly mandates re-scope. Re-scoping is cheaper and more honest than amending.
For a deeper analysis of why scope-as-features is the wrong unit and scope-as-evaluations is the right one, see stop scoping AI projects in features, scope them in evaluations.
Q1: discovery plus first eval-bound system
The Q1 mandate runs 90 days and is priced as a fixed-fee engagement with a single integrated deliverable: a system in production behind an eval gate, plus the documentation and onboarding to operate it. The shape mirrors the first-14-days engagement anatomy for the opening fortnight, then continues into a 12-week build cycle.
The Q1 deliverables: an engagement charter, an eval baseline of 30–80 ground-truth cases, an architecture decision record, the first eval-gated PR shipped by week 2, and by week 12 a system in staging or production with weekly demo cadence and an eval threshold the system clears. The eval threshold is named in the contract; for example, “the system shall achieve at least 0.78 on the named eval suite, measured weekly, prior to Q1 sign-off.” If the threshold is not met by week 11, the contract specifies whether Q2 is renegotiated, the engagement enters a remediation period, or the relationship terminates.
The Q1 fee is fixed. It is not an hourly retainer; it is the price of producing the named artifact. This pricing shape is itself a forcing function: the agency cannot bill into infinity if discovery slips, and the client cannot demand free expansion of scope mid-quarter. For a longer treatment of why fixed-fee outcome pricing beats hourly here, see the AI agency pricing manifesto.
Q2: hardening
Q2 is the hardening mandate. The system that shipped in Q1 is now exposed to real users at real volume, and the failure modes that did not surface in eval cases will surface in production. The Q2 deliverable is not new features; it is reliability, observability, cost control, and eval coverage of the failure modes Q1 missed.
Concretely, Q2 ships: full observability stack instrumented (trace store, cost telemetry, latency P95/P99, regression alerts), eval suite expanded to cover the production failure modes catalogued in Q1, cost-per-request reduced by a target percentage from the Q1 baseline, P95 latency under a target SLA, on-call runbook written and tested. The Q2 eval threshold is higher than Q1; for example “0.85 on the expanded suite”; because the suite itself has hardened. The Q2 mandate has its own kill clause: if the system has not held the SLA for two consecutive weeks by week 10, Q3 does not auto-renew.
Hardening is the quarter most agencies want to skip and most clients let them skip. Quarterly mandates make it un-skippable because Q3 expansion is contractually conditional on Q2 hardening having shipped.
Q3: expansion
Q3 is the expansion mandate. Now that the system is in production and hardened, Q3 ships the next surface area. This is the quarter where the agency adds the second model-bearing feature, integrates a second data source, or extends the system to a second user segment. The Q3 mandate is the most flexible of the four; the scope is decided at the Q2 close based on what the eval data and production telemetry showed, not based on what was guessed in January.
The structural advantage is enormous. In an annual contract, Q3 expansion was committed to in month one based on hypotheses; in a quarterly contract, it is committed to in month seven based on data. The agency does not have to defend the wrong scope; the client does not have to pay for it. Q3 has its own eval threshold for the new surface, its own kill clause, and its own integration test against the existing Q2 system to ensure expansion did not regress the hardened core.
Q3 is also the quarter where the relationship structure can shift. If Q2 hardening surfaced that the system needs a permanent embedded engineer, Q3 may include a residency component (see the AI agency rotating residency model for that pattern). If Q2 surfaced that the system is stable enough that the client can operate it solo, Q3 may be a smaller mandate that prepares for transition.
Q4: transition or renew
Q4 is the explicit decision quarter. By the start of Q4 the engagement has produced four artifacts that did not exist before: a system in production, a hardened reliability profile, an expanded surface, and a year of eval and cost data. Q4 mandates one of three outcomes: full transition to in-house operation, renewal into a year-two cadence, or termination with a documented handoff.
Transition is underrated. A healthy AI agency engagement should make itself replaceable; the Q4 transition mandate is where that replaceability is exercised. The deliverable is a handoff package; runbooks, eval suite documentation, prompt registry export, on-call rotation transferred, post-engagement review held; and the explicit reduction of agency involvement to a defined oversight role. Many engagements that should transition do not, because annual contracts roll. Quarterly cadences make the transition decision explicit at month 10.
If the relationship continues into year two, it does so as a fresh four-quarter mandate, not an annual auto-renew. The quarterly cadence is the unit of work even when the relationship is multi-year.
Kill clauses, eval thresholds, and renewal triggers
Each quarterly mandate ships with three contract-level mechanisms.
The kill clause. Each quarter has a written termination condition tied to a measurable outcome. Q1 example: “If the eval threshold of 0.78 is not met by end of week 11, the client may elect to terminate the engagement with a 5-business-day handoff and no further fees.” Q2 example: “If P95 latency exceeds the SLA for two consecutive weeks, Q3 does not auto-renew.” The clause is not punitive; it is structural. It eliminates the lock-in that long contracts produce.
The eval threshold. Most mandate names a specific eval pass-rate, latency P95, cost-per-request, or composite metric the system must clear. The threshold is named in the contract, not in a side document, because the contract is what gets read when the relationship is under stress. For why eval-anchored contracts beat feature-anchored contracts, see stop paying AI agencies for documentation, pay them for evals.
The renewal trigger. Q2 only begins if Q1 closes successfully against the named threshold. There is no auto-renew; there is an explicit decision at the decline of each quarter, with a written assessment in the repo. The default is termination if no renewal decision is made. This sounds aggressive and is in fact the correct default: it forces the assessment to happen.
Why quarterly beats annual on most margin metric
Margin: scope accuracy. Q3 scope decided at month 7 with eight months of data is more accurate than Q3 scope decided at month 1 with no data. Annual contracts force premature commitment.
Margin: incentive alignment. An agency on a 90-day mandate cannot coast; an agency on an annual cannot be terminated cheaply. Both sides have to perform most quarter.
Margin: termination cost. A failing annual engagement costs nine months of clawback at month four. A failing quarterly engagement costs zero at the next 90-day boundary.
Margin: re-baselining cadence. Quarterly cadences force a fresh eval and architecture review most 90 days. Annual cadences let staleness compound.
Margin: budget transparency. Quarterly fees are easier to defend to the CFO than annual ones because each quarter has a named deliverable, a named threshold, and a named outcome. The annual contract reads as “AI strategy partnership”; the quarterly contract reads as “ship a system that passes eval at 0.85.”
Margin: relationship hygiene. The biennial pre-renewal political theater; endless meetings, reissued statements of work, finance reviews; is replaced by a 60-minute end-of-quarter review with a clear decision artifact. The relationship spends less time selling itself and more time delivering.
I have not found a margin where annual beats quarterly. The single argument for annual is that procurement teams are organized around it; that is an organizational habit, not an economic argument.
The objections and the counters
“Quarterly creates renewal overhead.” No; it creates renewal visibility. The same renegotiation happens in an annual; it just happens once a year as a one-month rolling crisis. Quarterly distributes the conversation across four 60-minute sessions.
“Quarterly makes long-term planning impossible.” Long-term planning at month one is fiction in AI work. Quarterly does not prevent multi-year planning; it prevents multi-year commitment to the wrong plan. The plan can be multi-year; the contract should not be.
“Quarterly is harder for procurement.” True for a quarter. Procurement teams adapt. The first quarterly contract is procurement-heavy; by Q3 the template is reusable and signature time is shorter than the first annual was.
“Quarterly creates instability for the agency.” Only for agencies whose internal economics depend on locked-in revenue rather than performed work. Agencies that ship reliably welcome quarterly because their pipeline is healthier; they are not bound to one client for a year regardless of fit.
How to migrate an existing annual contract to quarterly
The migration path is straightforward. Identify the next natural decision point in the existing annual; usually a quarterly review meeting that already exists informally. Use that meeting to convert the remaining annual term into two or three quarterly mandates with explicit kill clauses and eval thresholds. The agency’s annual fee is rebased into per-quarter fees that sum to the same number; the structure is what changes, not the cost.
The single most important migration step is the eval threshold. If the existing annual contract has no eval threshold, the migration is a chance to install one. The first quarterly mandate is then “establish baseline + ship one feature against the baseline” with a named threshold; from there most subsequent quarter has a number to move. For the kickoff structure of that first quarter, see anatomy of a great AI agency kickoff.
A clean migration also separates the existing annual into discovery, hardening, and expansion components even if the original contract did not. This re-shaping forces the conversation about what was shipped and what is genuinely the next work; the same conversation that should have happened at signing.
FAQ
The 12-month contract is not a procurement requirement; it is a procurement habit. The economic argument for quarterly mandates is overwhelming, and the structural argument; that AI work has drift vectors annual contracts cannot absorb; is decisive. The agencies that will be standing in 2027 are the ones running 90-day mandates against named thresholds, not the ones renewing annuals on autopilot. If you are signing an annual AI agency contract this quarter, you are signing the wrong shape.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has migrated 11 portfolio companies from annual to quarterly AI agency contracts in the last 14 months.
Frequently Asked Questions
Why is the 12-month AI agency contract no longer the right shape?
Annual contracts mis-price three structural risks specific to AI work: model drift (the underlying model changes mid-contract), eval drift (the production distribution shifts and the eval suite goes stale), and scope drift (what was needed in January is not what is needed in October). Annual contracts also bundle scope that should be sequenced, create a high-friction renewal cliff, and make termination expensive enough that failing engagements stay alive past the point of recovery. Quarterly mandates assume drift exists and price each 90-day period against a named eval threshold and explicit kill clause.
What is a quarterly mandate in an AI agency context?
A quarterly mandate is a 90-day fixed-fee engagement with its own scope, named eval threshold, and explicit kill clause. The four mandates we recommend are Q1 discovery plus first eval-bound system, Q2 hardening, Q3 expansion, and Q4 transition or renew. Each mandate is contractually independent; Q2 only begins if Q1 closes successfully against its named threshold, and there is no auto-renew between quarters. The relationship can extend into multiple years, but the unit of work is usually 90 days.
What does the Q1 mandate produce?
Q1 produces an engagement charter, an eval baseline of 30 to 80 ground-truth cases, an architecture decision record, the first eval-gated PR shipped by week 2, and by week 12 a system in staging or production behind an eval gate. The Q1 contract names the eval threshold the system must clear; for example 0.78 on the named suite; and the kill clause specifies what happens if the threshold is not met by week 11. Q1 is priced as a fixed fee, not an hourly retainer, so the agency cannot bill into infinity if discovery slips.
Why is Q2 dedicated to hardening rather than new features?
Once the Q1 system is exposed to real users at real volume, failure modes that did not surface in eval cases will surface in production. Q2 ships full observability instrumentation, eval suite expansion to cover the production failure modes Q1 missed, cost-per-request reduction targets, P95 latency under a defined SLA, and a written and tested on-call runbook. Hardening is the quarter most agencies want to skip and most clients let them skip. Quarterly mandates make it un-skippable because Q3 expansion is contractually conditional on Q2 hardening having shipped.
How are kill clauses written in quarterly mandates?
Each quarter has a written termination condition tied to a measurable outcome. A Q1 example: ‘If the eval threshold of 0.78 is not met by end of week 11, the client may elect to terminate the engagement with a 5-business-day handoff and no further fees.’ A Q2 example: ‘If P95 latency exceeds the SLA for two consecutive weeks, Q3 does not auto-renew.’ Kill clauses are not punitive; they are structural. They eliminate the lock-in that long contracts produce and force both sides to perform most quarter.
Doesn’t quarterly contracting create more renewal overhead than annual?
It creates renewal visibility, not renewal overhead. The same renegotiation happens in an annual contract; it just happens once a year as a one-month rolling crisis right before signing. Quarterly distributes the conversation across four 60-minute end-of-quarter sessions, each with a clear decision artifact in the repo. The first quarterly contract is procurement-heavy because templates do not yet exist; by Q3 the template is reusable and signature time is shorter than the first annual was.
How is Q3 scope decided?
Q3 scope is decided at the close of Q2 based on what the eval data and production telemetry showed, not based on what was guessed in January. This is the structural advantage over annual contracts. In an annual, Q3 expansion was committed to in month one based on hypotheses; in a quarterly cadence, it is committed to in month seven based on data. Q3 has its own eval threshold for the new surface, its own kill clause, and an integration test against the existing Q2 system to ensure expansion did not regress the hardened core.
What does Q4 transition look like?
Q4 mandates one of three explicit outcomes: full transition to in-house operation, renewal into a year-two cadence, or termination with a documented handoff. Transition is the most underrated outcome; a healthy AI agency engagement should make itself replaceable. The Q4 transition deliverable is a handoff package containing runbooks, eval suite documentation, prompt registry export, transferred on-call rotation, and a post-engagement review. Many engagements that should transition do not, because annual contracts roll. Quarterly cadences make the transition decision explicit at month 10.
How do you migrate an existing annual contract to quarterly mandates?
Identify the next natural decision point in the existing annual; usually a quarterly review meeting that already exists informally. Use that meeting to convert the remaining annual term into two or three quarterly mandates with explicit kill clauses and named eval thresholds. The agency’s annual fee is rebased into per-quarter fees that sum to the same number, so the cost does not change; only the structure. The single most important migration step is installing the eval threshold if one does not already exist, because that threshold is what most subsequent quarter measures itself against.
Does quarterly contracting destabilize the agency’s revenue?
Only for agencies whose internal economics depend on locked-in revenue rather than performed work. Agencies that ship reliably welcome quarterly contracts because their pipeline is healthier; they are not bound to one client for a year regardless of fit, and successful Q1 mandates routinely renew into Q2 without procurement friction. Agencies that resist quarterly are signaling that their revenue model depends on retention friction rather than delivery quality. That signal should itself be a procurement input.
Arthur Wandzel