Most AI agency contracts in 2026 still read like 2018 web-development MSAs. They cover scope, milestones, and IP assignment — and almost nothing that actually fails on a modern AI engagement. Eval datasets walk out the door. Inference costs spike 4x at launch. Prompt libraries stay in agency-only repos. Fine-tuned weights never get handed over because nobody wrote the clause.
The fix is a short, AI-specific set of commitments your agency should put in writing — before the SOW is signed, not after the wheels come off. Below: seven commitments, each with clause language for a redline, a compliance test, and the failure mode when skipped.
This piece is a spoke under the AI agency manifesto. The manifesto argues for a new operating model; this piece is the contractual surface area where that model has to be enforceable.
Why these seven and not the usual SOW checklist
Generic SOW templates were written for deterministic software. AI engagements break those assumptions on four dimensions:
- The deliverable is not just the code. It is the code, the prompts, the eval set, the fine-tuned weights, and the runtime cost profile.
- The bug surface is wider. Prompt injection, training-data leakage, and jailbreak resilience are not in any 2018 MSA.
- The cost curve is different. Token spend at scale dwarfs build cost in many production agents — and is often passed through opaquely.
- The exit is harder. Without the eval set and prompt registry at offboarding, the buyer holds an artifact they cannot maintain.
Each commitment below targets one of those gaps. They map to practitioner pain across r/MachineLearning and HN, and the BCG 2024 Build for the Future survey finding that only 28% of AI deployments measurably hit business goals — partly because contracts did not specify what hitting them looked like. Cross-references to OWASP LLM Top 10, NIST AI RMF, and EU AI Act Article 28 anchor the framework in standards bodies.
Commitment 1: Eval transparency
The commitment. The agency hands over the full evaluation suite — datasets, rubric weights, judge prompts, pass/fail thresholds — at every milestone, not just final delivery. The buyer can re-run any eval against any checkpoint at any time and get the same number.
Clause to redline.
Evaluation Artifacts. Agency shall deliver, at each Milestone, the complete Evaluation Artifacts comprising: (a) all evaluation datasets including held-out test sets; (b) all evaluation rubrics and scoring functions; (c) all judge-model prompts and version identifiers; (d) all pass/fail thresholds and the rationale for each threshold; (e) reproducible execution scripts. Client shall have an unrestricted, perpetual, royalty-free license to use the Evaluation Artifacts, including the right to re-run them against any successor system.
Test of compliance. Before milestone payment, the buyer’s engineer pulls the eval suite, runs it locally against the latest checkpoint, and reproduces the score within rounding error. If they cannot, the milestone is not complete.
Failure mode. The buyer cannot tell whether a future change is a regression. Vendor lock-in becomes total: only the agency knows whether the system still works. Engagements break here when an agency loses a senior engineer and the eval logic walks out the door with them.
Commitment 2: Model-weight ownership
The commitment. If the agency fine-tunes a model on the buyer’s data, the resulting weights are the buyer’s property — delivered, not licensed. Same for adapters, LoRAs, distilled student models, and embeddings derived from buyer data.
Clause to redline.
Trained Artifacts. Any model weights, adapter weights, low-rank adaptations, distilled models, fine-tuned embeddings, or other parameters trained, derived, or computed using Client Data shall be Work Product owned exclusively by Client upon creation. Agency shall deliver, within five business days of Milestone acceptance, all such Trained Artifacts in their native checkpoint format together with training-run configuration, hyperparameters, dataset manifest, and logs sufficient to reproduce the artifact.
Test of compliance. Pre-signature: ask the agency to commit to a delivery format (Hugging Face checkpoint, Modal volume, S3 bucket — name it). Post-milestone: download, load, run an inference. If you cannot, you do not own it.
Failure mode. The system is not portable. When the relationship ends, the weights stay on agency infrastructure — and the buyer pays re-training costs to migrate. This is the most common form of soft lock-in in 2026 AI engagements.
Commitment 3: Prompt registry handoff
The commitment. Every prompt, system message, tool-use schema, retrieval template, and routing rule that the production system relies on is delivered as version-controlled artifacts inside the buyer’s source tree — not in an agency-private notebook, Cursor workspace, or shared Notion page.
Clause to redline.
Prompt and Configuration Artifacts. Agency shall maintain all prompts, system messages, tool-use schemas, retrieval templates, function-calling specifications, routing logic, and runtime configurations (“Prompt Artifacts”) in a version-controlled repository owned and accessible to Client. No Prompt Artifact required for production operation may exist solely in agency-controlled tooling, notebooks, or third-party SaaS. At each Milestone, Agency shall verify that the production system can be redeployed from Client’s repository without reference to any agency-private resource.
Test of compliance. Run the redeployment drill: clone the buyer’s repo on a clean machine with only the buyer’s API keys. Build, deploy, exercise. If the system needs a prompt living anywhere else, the test fails.
Failure mode. Prompts are now what source code was in 2010 — the actual asset. Prompt libraries kept as “agency templates” and withheld at sunset turn a six-month prompt-engineering cycle into a six-month re-do.
Commitment 4: Inference-cost visibility
The commitment. The buyer sees raw token spend — provider invoices, by model, by route, by day — with at most a stated, fixed markup. A monthly cap is set; overruns require written approval before they are incurred.
Clause to redline.
Inference Cost Pass-Through. Agency shall pass through provider inference costs (including but not limited to OpenAI, Anthropic, AWS Bedrock, Google Vertex AI) at cost plus a stated markup not to exceed [X]%. Agency shall provide Client with monthly invoicing including (a) the underlying provider invoice, redacted only as required by Agency-provider confidentiality, (b) per-model and per-route token usage, and (c) reconciliation against Client’s internal usage logs where available. A monthly Inference Spend Cap is set in the SOW; spend exceeding the Cap requires written Client approval before incurrence.
Test of compliance. First post-launch month: reconcile the agency invoice against the provider’s own usage dashboard (OpenAI usage console, Anthropic Workbench, Bedrock CloudWatch). Numbers should match within 1–2% drift.
Failure mode. Inference is the largest variable cost in many production AI features. Passthrough invoices arrive at 3–4x the buyer’s internal estimate with no visibility on which routes are responsible. By the time finance flags it, six figures are gone.
Commitment 5: Data-residency disclosure
The commitment. The agency tells the buyer, in writing, where every byte of buyer data — prompts, completions, embeddings, training data, eval data — is processed and stored. Provider, region, sub-processor list, retention windows. Updated when it changes.
Clause to redline.
Data Residency and Sub-Processors. Agency shall maintain and provide to Client a current Sub-Processor and Data Residency Schedule listing, for each category of Client Data: the model providers and infrastructure providers that process it, the geographic regions of processing and storage, the retention period, and the legal basis for any cross-border transfer. Agency shall notify Client at least thirty (30) days before adding a new sub-processor or changing a region of processing for production data, and Client shall have the right to object. This Schedule shall be incorporated into the Data Processing Agreement (DPA).
Test of compliance. Read the schedule. Cross-check against the buyer’s compliance posture (GDPR Article 28, HIPAA BAA scope, sectoral residency rules). If the schedule is shorter than the system’s actual call graph, it is wrong.
Failure mode. Quiet GDPR or sectoral violations the buyer discovers in audit, not in development. EU AI Act Article 28 obligations on providers of high-risk AI systems also flow through to deployers — buyers inherit the obligation regardless of who built the system.
Commitment 6: Security-finding response SLA
The commitment. When a security finding is reported — whether a classical web-app issue or an AI-specific one (prompt injection, training-data leakage, jailbreak, model exfiltration) — the agency commits to a triage and patch SLA, with severity definitions referenced to OWASP LLM Top 10.
Clause to redline.
Security Response Service Levels. Agency shall acknowledge security findings within four (4) business hours and provide an initial severity classification within twenty-four (24) hours. Severity is determined with reference to OWASP LLM Top 10 (LLM01 Prompt Injection, LLM02 Insecure Output Handling, LLM03 Training Data Poisoning, LLM06 Sensitive Information Disclosure, etc.) and CVSS where applicable. Critical findings shall be remediated or have a documented mitigation in place within seven (7) calendar days; High findings within thirty (30) days. AI-specific findings shall not be downgraded for lacking a CVSS score.
Test of compliance. File a synthetic finding (e.g., a prompt-injection vector via a non-privileged input) during the engagement. Measure the response. The drill is the test.
Failure mode. AI-specific issues land in the “we don’t have a process for that” pile. OWASP’s LLM Top 10 exists precisely because traditional security programs miss these classes. Without an SLA that covers them explicitly, prompt-injection findings get triaged as P3 and stay there.
Commitment 7: Sunset rights
The commitment. When the engagement ends — for any reason — the buyer gets a defined offboarding package within a defined window. No artifact required to keep the system running may be withheld. The agency commits to a hand-over period at agreed rates.
Clause to redline.
Sunset and Transition Assistance. Upon expiration or termination of this Agreement for any reason, Agency shall, within fifteen (15) business days, deliver to Client the Sunset Package comprising: (i) all Work Product including Trained Artifacts, Prompt Artifacts, and Evaluation Artifacts; (ii) infrastructure-as-code and deployment scripts; (iii) credentials and access transitions for any Client-paid third-party services; (iv) a written runbook covering operation, monitoring, and known issues; (v) up to thirty (30) hours of Transition Assistance billed at the standard rate. Agency shall not condition delivery of the Sunset Package on payment of any disputed invoice.
Test of compliance. At any milestone, ask: “If we ended the engagement today, what would land in our hands within fifteen days?” The agency should answer in five sentences. If the answer requires a meeting, the package does not exist yet.
Failure mode. Hostage offboarding — the buyer cannot end the relationship without paying a re-implementation cost. This is what turns a missed-fit engagement into a multi-quarter migration.
How to use this framework as a buyer
You do not need to win all seven points to have a workable contract. You need to know which ones you are giving up, and why.
A useful pattern: send the seven commitments to the agency before the SOW arrives. Ask them to mark each as agreed, agreed with redline, or declined with rationale. Three reactions tell you something:
- Agreed across the board, no questions. Disciplined or not reading carefully. Probe a couple of clauses to find out which.
- Selective redlines on Commitments 4 and 7. Reasonable — markup percentages and offboarding windows are negotiable. Not declining the commitments themselves is the signal.
- Decline on Commitments 1, 2, or 3. Walk away. These are not policy preferences — they are the mechanism by which AI work product becomes yours rather than theirs.
For a deeper pre-signature checklist, see the CTO vetting checklist. The companion piece on contract negotiation covers MSA / SOW / DPA / NDA layering.
How to use this framework as an agency
If your shop cannot make these commitments, your work product is partly extracted rent. The fix is operational, not contractual:
- Eval transparency → eval-as-code from project day one, not retrofitted at handoff.
- Prompt registry → prompts live in the buyer’s repo from the first PR.
- Inference-cost visibility → metering and tagging are part of the deploy pipeline, not a finance afterthought.
Agencies that operate this way close enterprise contracts faster because legal and procurement stop fighting over IP, residency, and offboarding. The seven commitments are a sales asset, not a concession.
Frequently asked questions
Do these commitments apply to fixed-price engagements or only retainer/T&M?
Equally. The commercial structure is independent of the work-product commitments. Some agencies argue that fixed-price exempts them from prompt-registry handoff — it does not. If the buyer pays for the system, the buyer owns the artifacts it runs on. Pricing is a separate negotiation about who carries scope risk.
Are these commitments compatible with using OpenAI, Anthropic, or Bedrock under the hood?
Yes. None require self-hosted models. Commitment 5 (data residency) just makes existing reality visible. Commitment 4 (inference cost visibility) is mechanically simple when the agency is a passthrough provider customer. Commitment 2 (model weights) only triggers if you actually fine-tune; if you do not, the clause has nothing to act on.
What about confidential agency tooling — internal frameworks and helper libraries?
Carve them out explicitly. The standard pattern is a Pre-Existing IP schedule attached to the MSA, listing agency-owned components, with the buyer granted a perpetual royalty-free license to use them in the delivered system but not extract them. Preserves agency control over reusable IP while satisfying Commitments 1 and 3.
How do these commitments interact with the EU AI Act?
Article 28 of the EU AI Act assigns provider obligations to whoever places a high-risk AI system on the market. When the agency builds and the buyer deploys, the buyer typically becomes the provider — inheriting documentation, conformity-assessment, and post-market-monitoring obligations. Without Commitments 1, 5, and 7, the buyer cannot meet those obligations because the underlying artifacts and data flows live with the agency. The framework is partly an EU-AI-Act-readiness instrument.
Is “agency owns the prompts” ever defensible?
Rarely, and only narrowly. An agency can defensibly own a generic prompt template (e.g., a router pattern reused across clients). It cannot defensibly own a prompt iterated against the buyer’s domain data, eval set, and user feedback — that is buyer-specific work product regardless of who typed it. The line is “abstracted pattern” vs. “client-tuned artifact.” Commitment 3 covers the second category.
What is a reasonable markup on inference cost passthrough?
2026 norms: 0% (true passthrough on agency infrastructure for accounting convenience), 10–15% (covers payment-processing and treasury overhead), or no markup when the buyer brings their own provider account. Above 25% requires explicit justification — at that point the agency is monetizing inference rather than building. State the number; do not bury it in a blended hourly rate.
How do these commitments differ from a standard MSA’s IP clause?
A standard work-product clause assumes deterministic source code: the developer writes the file, the file is delivered, IP transfers. AI engagements produce three new artifact classes — eval suites, prompt registries, trained weights — that a generic IP clause either omits or treats ambiguously. Commitments 1, 2, and 3 are AI-specific extensions that sit alongside the IP clause.
What if the agency refuses to commit to a sunset package?
That is the single strongest signal in vendor evaluation. An agency that cannot describe the offboarding artifact set in five sentences is either not operationally ready for enterprise buyers or building lock-in on purpose. Either way, any future migration becomes a multi-quarter project the buyer pays for. Commitment 7 is the cheapest to make and the most diagnostic when refused.
Key takeaways
- Generic SOW templates miss the three artifact classes that matter most in AI work: eval suites, prompt registries, and trained weights. Commitments 1, 2, and 3 close those gaps.
- Inference cost visibility (Commitment 4) is the financial control most often missing from AI contracts and the source of the largest post-launch surprises.
- Sunset rights (Commitment 7) are the cheapest commitment to make and the most diagnostic of agency operating maturity when refused.
- The seven commitments are a procurement instrument first and a relationship instrument second. Send them before the SOW arrives.
- Cross-reference real standards: OWASP LLM Top 10 for security severity, NIST AI RMF for governance posture, EU AI Act Article 28 for provider-obligation flow-through.
Related reading
- The AI Agency Manifesto: What an AI Dev Partner Should Actually Be in 2026 — the pillar these commitments operationalize.
- AI Agency Contract Negotiation: Key Terms to Include — broader contractual structure (MSA / SOW / DPA / NDA).
- AI Agency Vetting Checklist for CTOs — pre-signature diligence pairing with this framework.
- How to Choose an AI Development Agency — selection criteria upstream of contracting.
- Verify an AI Agency’s Technical Expertise — the technical interview that surfaces whether they can deliver against Commitment 1.
Arthur Wandzel