The 10 rules of working with an AI agency without losing leverage

Most AI agency engagements lose buyer leverage in the first 30 days, and almost nobody notices until renewal. Leverage is not lost in a contract clause; it is lost in defaults; the API keys held by the agency “for convenience,” the weekly status update that replaces a merged PR, the senior engineer who quietly rotates off the account in month three. By the time the buyer realises what has happened, the agency owns the keys, the model weights, the institutional memory, and the only humans who know how the system works. Renewal is not a negotiation at that point; it is a tax. These ten rules are the buyer-side defaults that prevent that outcome. They are not a contract template; they are an operating posture, and the posture is set by what you do in the first two weeks, not by what your lawyer wrote into the MSA.

The frame is simple. An AI agency engagement is a partnership with deeply asymmetric information; the agency knows the systems, the prompts, the eval data, and the production failure modes far better than you do, and that asymmetry compounds weekly. The buyer’s job is not to close the gap; it is to ensure the artifacts of the work accumulate on the buyer’s side, not the agency’s. Each of the ten rules below is a mechanism for forcing that accumulation. The mechanisms are cheap, they are non-confrontational, and most healthy agency I know already accepts them. Agencies that resist them are signalling that lock-in is part of their commercial model. That is the signal these rules are designed to surface.

For the broader posture this article inherits from, see the AI agency manifesto: what an AI dev partner should be in 2026. For the contract-shaped version of these rules, the AI agency contract negotiation guide translates each posture into clauses your counsel can paste into the SOW.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Rule 1: own many keys and data from day one

The rule. Most API key; Anthropic, OpenAI, Google, your vector store, your observability stack; sits in your secret store, billed to your account, with the agency granted access via short-lived credentials.

Why it matters. The party that holds the keys holds the cost data, the rate limit relationship with the provider, and the option to switch providers without involving the agency. When the agency holds the keys, most renegotiation has a hidden term: “and we keep the keys.” That term is worth more than the headline rate.

What happens if violated. Token arbitrage becomes the agency’s quiet revenue layer. You see a markup on most call. You cannot move providers without their cooperation. When the engagement ends, you discover that the production system runs on credentials owned by the agency, and the migration is a two-month project you did not budget.

Concrete buyer action. On day 1, provision keys in your secret manager (AWS Secrets Manager, Doppler, Vault). Grant the agency IAM-role access scoped to the project. Require many PRs to read keys from your CI environment, not from a .env file in the agency’s laptop. Audit weekly that no agency-owned credential exists in production.

Rule 2: hold the IP and the weights from day one, not at handover

The rule. Many code, prompts, eval data, fine-tuned model weights, and learned routing tables are work-for-hire owned by you, written into the SOW, and committed to your repos and your storage from the first commit.

Why it matters. “Handover at end of engagement” is the sentence that erases buyer leverage. The handover rarely quite happens; the prompts live in the agency’s prompt library, the fine-tunes live on the agency’s MLflow, the eval data lives in the agency’s wiki. By month nine, you cannot fork the work without forking three of their internal systems.

What happens if violated. You renew under duress because the alternative is a six-week migration you cannot stomach. The agency’s leverage is not a clause; it is the friction of disentanglement.

Concrete buyer action. Require many prompts, evals, and fine-tune artifacts to live in your repo from week 1. Fine-tuning runs against your storage account. Model weights, where applicable, are exported and checked into your registry on most promotion. The SOW says “many derivative artifacts including prompts, evals, fine-tuned weights, and routing logic are work-for-hire”; and you verify this by inspection, not by signature.

Rule 3: demand merged-PR-only weekly demos

The rule. The weekly demo runs against code that has been merged to main, deployed to staging, and produces a number against your eval suite. No scratch branches. No “almost ready.” No screen-share Loom of an unmerged feature.

Why it matters. The strongest predictor of an engagement’s six-month health is whether week-1 progress is measured in merged PRs or in slide decks. Demos against unmerged work create a parallel universe of progress that decouples from the codebase, and the codebase is the only thing that ships.

What happens if violated. You discover in month three that the impressive demos were running against a branch that rarely reached main, and the production system is two months behind what you saw on Tuesdays.

Concrete buyer action. State the rule in week 1: most Friday demo runs against git log main and the eval dashboard. If the team has nothing to show on main this week, the demo is a 15-minute “what blocked us”; which is fine, occasionally; but the cadence does not slip into theatrical demos.

Rule 4: cap inference budget with an auto-shutoff

The rule. A hard monthly inference budget is set in week 1, with a CI-enforced soft cap at 70 percent and an automatic shutoff at 100 percent that requires a human override to lift.

Why it matters. Cost runaway is the most common silent failure of AI engagements. A retrieval bug, a context-window blow-up, or an agent loop can quietly turn a $4,000 month into a $40,000 month, and the agency rarely notices first because they are not the ones paying. The auto-shutoff converts a financial blast radius into a paged incident.

What happens if violated. Three months into the engagement, you find a $90,000 bill nobody can fully explain, the agency offers a partial credit on their fees, and your CFO insists on a procurement review that delays the next phase by a quarter.

Concrete buyer action. Implement budget caps as a tool-call wrapper in the codebase, not as an after-the-fact dashboard. Configure provider-side spend alerts at 50, 70, 90, and 100 percent of monthly budget. The 100 percent alert is a circuit breaker, not a notification.

Rule 5: insist the eval suite lives in the repo

The rule. The eval suite; examples, scoring, thresholds, runner; is committed to your repo from day 2 of the engagement, runs in your CI on most PR, and gates merges to main. No PR merges if the eval delta is not in the description.

Why it matters. Without an eval suite the engagement cannot be measured, and an engagement that cannot be measured will be defended by narrative. The party that controls the eval data controls the conversation about whether the system works. That party should be you.

What happens if violated. Code review devolves into opinion-trading. Renewal conversations become qualitative. You cannot run the system through a regression suite when changing providers, and you cannot evaluate a competing agency on a like-for-like basis.

Concrete buyer action. On day 2, your domain expert and the agency together write 20 to 50 ground-truth examples. Wire the suite to CI. Make the eval delta a required field in the PR template. Review eval coverage in the biweekly retro.

Rule 6: review post-mortems quarterly with the agency in the room

The rule. Once a quarter, you walk through most production incident, most silent regression, and most cost surprise with the senior agency engineers in the room; not the account manager.

Why it matters. Post-mortem reviews are the only systematic mechanism for surfacing the failure modes the agency learned about quietly. Without them, the agency’s institutional knowledge stays one-sided, and you renew without knowing what almost broke.

What happens if violated. When the engagement ends or pauses, you discover a list of “known quirks” that are written nowhere; they live in the head of the senior engineer who is rotating off next month.

Concrete buyer action. Schedule the quarterly review in week 1 of the engagement. Require a written post-mortem in the repo for any incident over a defined threshold (latency, accuracy regression, cost spike). The output of the quarterly is committed as docs/post-mortem-2026-q2.md.

Rule 7: rotate the senior contributor on a planned cadence

The rule. No single senior contributor is the only person on your account for more than two consecutive quarters. Rotation is planned, documented, and shadowed for two weeks.

Why it matters. Counter-intuitive but true: the buyer who insists the same senior engineer stays on the account forever is the buyer most exposed to lock-in. Institutional memory in one person’s head is a single point of failure for both sides; for the agency it is a retention risk; for you it is a hostage situation.

What happens if violated. The senior engineer leaves the agency, or rotates off without warning, and 30 percent of the engagement’s institutional knowledge walks with them. You spend a quarter rebuilding context.

Concrete buyer action. Require the agency to staff at least two senior contributors who are full-context on your account at many times. Plan a rotation most two quarters. The new contributor shadows the outgoing one for two weeks, and the rotation is treated as a documentation forcing function; anything that cannot be written down is what the institution will lose.

Rule 8: refuse account-manager-mediated communication

The rule. Engineering communication is direct between your engineers and the agency’s engineers. Account managers attend the kickoff, the quarterly review, and the renewal conversation. They do not stand between your tech lead and theirs on a daily basis.

Why it matters. Account-manager mediation is the slow-acting solvent that dissolves engineering trust. Most translation introduces lag and softening. Hard technical conversations get reframed as “concerns” and routed through a process. By month four, your engineers have stopped raising issues because they have learned the process is slower than working around them.

What happens if violated. You receive a polished weekly summary that bears no resemblance to what your engineers are seeing in the codebase. Velocity drops. Real issues escalate to renewal time as a list of grievances rather than as already-resolved engineering discussions.

Concrete buyer action. Set the channel structure on day 1. A shared Slack with engineers on both sides. PR review by the actual engineers. Account managers are CC’d, not in the critical path. The first time an account manager intercepts a technical question, escalate explicitly back to engineering.

Rule 9: keep one in-house engineer paired with the agency lead

The rule. A named in-house engineer; full context, at least 50 percent allocation; is paired with the agency lead from week 1. They co-own the architecture, review most PR, and write the next phase’s brief together.

Why it matters. This is the single most powerful anti-lock-in mechanism in the playbook. An in-house engineer who has lived the system since week 1 is the person who can fork the work, switch agencies, or bring it in-house when needed. Without that person, the engagement is a black box you rent.

What happens if violated. Your “AI initiative” is staffed by the agency. When you decide to bring it in-house, there is no internal owner, and the engagement extends by another year of “knowledge transfer.”

Concrete buyer action. Name the engineer before the agency starts. Allocate their time. Make them the formal client tech lead. Require them to be in the architecture session, the demo, the retro, the post-mortem, and the renewal conversation. Pay them like a senior engineer, because they are protecting more value than most senior engineers ever do.

Rule 10: write a 30-day kill clause and price the renewal as a renegotiation

The rule. The contract is terminable for convenience on 30 days’ notice with a clean exit. Renewal is a written renegotiation with the same SOW process as the original engagement, not an automatic continuation.

Why it matters. A 30-day kill clause is the price of leverage being symmetrical. Without it, most difficult conversation is held in the shadow of “they can fire us, but it costs them more than it costs us.” Renewal as renegotiation prevents the engagement from drifting from a project shape into a permanent staff-aug shape, where rates creep, scope blurs, and accountability dilutes.

What happens if violated. You enter renewal year two on autopilot at last year’s rate, with last year’s scope, and last year’s senior engineer who has rotated off twice and come back. The relationship continues because changing it is harder than continuing. That is the textbook definition of lock-in.

Concrete buyer action. The MSA contains a 30-day termination-for-convenience clause. Renewal is a fresh SOW signed on the original cadence, not an amendment to the existing one. Three months before renewal, your in-house engineer (Rule 9) writes the brief for the next phase, including what you would do in-house and what you would do with the agency. The brief is the leverage; the conversation that follows is just the closing of it.

How the rules interact

The rules are interlocking. Rules 1 and 4 neuter cost runaway. Rules 2 and 5 produce a forkable artifact you could hand to a competing agency. Rules 3 and 6 force the agency’s progress narrative to match the repo. Rules 7, 8, and 9 prevent the agency from becoming the only humans who understand the system. Rule 10 makes the other nine enforceable; without an exit, most rule is a polite request.

The agencies that thrive under these rules already operate this way and will improve on the framework. The agencies that resist will frame the rules as “lack of trust” or “process overhead”; those framings are diagnostic. An agency that needs to keep the keys, hide the eval suite, mediate most conversation through an account manager, and avoid a kill clause is selling lock-in, and the price will be paid at renewal whether or not it is in the contract.

For the inverse view of what the agency’s posture should look like, see the AI agency trust ladder: 6 signals that separate operators from resellers. If your agency is on rung five or six, these rules are already their operating model and the contract is a formality.

The right time to apply these rules is the week before kickoff, not the week before renewal. Leverage built in at week 1 costs nothing; leverage recovered in month nine costs a quarter of the engagement value.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run more than two dozen client engagements under this exact rule set in the last 18 months.

Frequently Asked Questions

Who should hold the model API keys when working with an AI agency?

The buyer should hold many model API keys; Anthropic, OpenAI, Google, vector stores, observability; in their own secret manager from day one, with the agency granted scoped, short-lived access via IAM roles. The party that holds the keys controls the cost data, the rate limit relationship with the provider, and the option to switch providers without involving the agency. Agencies that hold keys ‘for convenience’ are setting up token arbitrage and a renewal-time leverage point that is rarely visible until it is exercised.

Why does owning IP and model weights from day one matter more than at handover?

Handover at end of engagement is the sentence that erases buyer leverage. Prompts drift into the agency’s prompt library, fine-tunes live on the agency’s MLflow, eval data lives in the agency’s wiki, and by month nine the buyer cannot fork the work without forking three of the agency’s internal systems. Requiring many prompts, eval data, fine-tuned weights, and routing logic to live in the buyer’s repos and storage from week 1; verified by inspection, not just contract; is what makes the engagement portable.

What is a ‘merged-PR-only demo’ and why insist on it?

A merged-PR-only demo is a weekly demo run against code already merged to main, deployed to staging, and producing a number against the eval suite; rarely against scratch branches or unmerged work. The strongest predictor of an engagement’s six-month health is whether week-1 progress is measured in merged PRs or in slide decks. Demos against unmerged work create a parallel universe of progress that decouples from the codebase, and only the codebase ships.

How do you cap inference budget without slowing the team down?

Set a hard monthly inference budget in week 1 with a soft cap at 70 percent and an automatic shutoff at 100 percent that requires a human override to lift. Implement the cap as a tool-call wrapper in the codebase rather than as an after-the-fact dashboard, and configure provider-side spend alerts at 50, 70, 90, and 100 percent. The 100 percent alert is a circuit breaker, not a notification. This converts a financial blast radius into a paged incident before the bill arrives.

Why should the eval suite live in the buyer’s repo rather than the agency’s?

Without an eval suite the engagement cannot be measured, and an engagement that cannot be measured will be defended by narrative. The party that controls the eval data controls the conversation about whether the system works; that party should be the buyer. The eval examples, scoring, thresholds, and runner are committed to the buyer’s repo from day 2, run in CI on most PR, and gate merges to main. Most PR description carries an eval delta as a required field.

Why rotate the senior contributor when continuity feels safer?

Counter-intuitive but true: the buyer who insists the same senior engineer stays on the account forever is the buyer most exposed to lock-in. Institutional memory in one person’s head is a single point of failure for both sides. Plan a senior-contributor rotation most two quarters, with a two-week shadowing period where the incoming engineer pairs with the outgoing one, and treat the rotation as a documentation forcing function; anything that cannot be written down is what the institution will lose.

What is wrong with account-manager-mediated communication?

Account-manager mediation is the slow-acting solvent that dissolves engineering trust. Most translation introduces lag and softening; hard technical conversations get reframed as ‘concerns’ and routed through process. By month four, engineers have stopped raising issues because the process is slower than working around them. Engineering communication should be direct between client and agency engineers in a shared Slack and on PR threads. Account managers attend kickoff, quarterly reviews, and renewal conversations; not the daily critical path.

Why pair an in-house engineer with the agency lead from week 1?

This is the single most powerful anti-lock-in mechanism. An in-house engineer who has lived the system since week 1 is the person who can fork the work, switch agencies, or bring it in-house when needed. Without that person, the engagement is a black box the buyer rents. Name the engineer before the agency starts, allocate at least 50 percent of their time, make them the formal client tech lead, and require their presence in most architecture session, demo, retro, post-mortem, and renewal conversation.

What does a 30-day kill clause buy the buyer?

A 30-day termination-for-convenience clause is the price of leverage being symmetrical. Without it, most difficult conversation is held in the shadow of ‘they can fire us, but it costs them more than it costs us,’ and renewal drifts from a project shape into a permanent staff-aug shape where rates creep, scope blurs, and accountability dilutes. The kill clause is what makes the other nine rules enforceable; renewal is then a fresh SOW signed on the original cadence, not an amendment.

How can a buyer tell if an AI agency is operating in good faith on these rules?

Healthy agencies accept the framework cheerfully, often improve on it, and ship faster because the discipline reduces ambiguity on both sides. Agencies that resist will frame the rules as ‘lack of trust’ or ‘process overhead’; those framings are diagnostic. An agency that needs to keep the keys, hide the eval suite, mediate most conversation through an account manager, and avoid a kill clause is selling lock-in, and the price will be paid at renewal whether or not it is paid in the contract.

The 10 rules of working with an AI agency without losing leverage

Decision Scope

Rule 1: own many keys and data from day one

Rule 2: hold the IP and the weights from day one, not at handover

Rule 3: demand merged-PR-only weekly demos

Rule 4: cap inference budget with an auto-shutoff

Rule 5: insist the eval suite lives in the repo

Rule 6: review post-mortems quarterly with the agency in the room

Rule 7: rotate the senior contributor on a planned cadence

Rule 8: refuse account-manager-mediated communication

Rule 9: keep one in-house engineer paired with the agency lead

Rule 10: write a 30-day kill clause and price the renewal as a renegotiation

How the rules interact

Frequently Asked Questions

Who should hold the model API keys when working with an AI agency?

Why does owning IP and model weights from day one matter more than at handover?

What is a ‘merged-PR-only demo’ and why insist on it?

How do you cap inference budget without slowing the team down?

Why should the eval suite live in the buyer’s repo rather than the agency’s?

Why rotate the senior contributor when continuity feels safer?

What is wrong with account-manager-mediated communication?

Why pair an in-house engineer with the agency lead from week 1?

What does a 30-day kill clause buy the buyer?

How can a buyer tell if an AI agency is operating in good faith on these rules?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources