Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 17 min read

The 6 anti-patterns we see in every failed AI agency engagement

The 6 anti-patterns we see in every failed AI agency engagement

Failed AI agency engagements rhyme. They fail in the same six ways, on the same timeline, almost usually because the buyer did not see the anti-pattern when it first appeared. By the time the project is unsalvageable, the symptoms are loud; runaway costs, missed milestones, a system nobody wants to touch; but the original sins were quiet, and many of them were visible in the first three weeks.

What follows is a field taxonomy of the six anti-patterns we see across patterns observed in failed AI engagements. Each has a name, a tell, what it looks like in practice, and a prevention tactic. None are exotic. Many are avoidable. Most stem from the gap between what an AI dev partner should be in 2026 and what most agencies still default to.

Two of these and you have a problem. Three and you have a project that will not ship. Four and you have a write-off in motion; the only question is how much of the budget you save by ending it now.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Anti-pattern 1: the deck-first kickoff

The kickoff is a 60-slide deck. Roadmap, mission statement, “why now” hockey-stick, agency org chart, three-phase Gantt that ends on quarter boundaries. No repository. No environment. No eval suite. No code by the decline of week one.

Why it fails. The deck-first kickoff is theatrical, not technical. It reassures the buyer that adults are in the room, but produces zero artifacts that can be tested, run, or merged. By week two the deck is stale, the Gantt is fictional, and nobody on either side has touched a model. The first meaningful technical decisions get made in week four under deadline pressure, by people who have not yet built shared context.

What it looks like in practice. A two-hour kickoff with an agenda labeled “alignment and roadmap.” The deliverable is the deck itself. When the buyer asks when code will start, the answer is “after we lock the spec”; and the spec is another deck. Real engineering teams treat the kickoff as a working session: a repo gets created, a “hello world” model call gets shipped, a placeholder eval suite gets committed, and the buyer’s domain expert gets added as a reviewer on day one.

Prevention. Insist on artifacts in the first 72 hours. By end of day three the engagement should have a private repo, the buyer added as collaborator, a working “hello world” model call in staging, an empty evals/ directory with a README, and a decisions.md log with at least one entry. The first 14 days of an AI agency engagement walks through what that working kickoff looks like day by day.

Anti-pattern 2: eval-as-afterthought

Evals appear in the proposal as a bullet under “quality assurance” and in the project plan as a one-week task in phase three, sandwiched between “polish” and “launch.” They do not appear in the repo until week six, when QA starts asking how to know if the model is working. By the time anyone writes the first eval, the architecture is locked, the prompts are calcified, and retrofitting evals means rewriting half the system.

Why it fails. Evals are not a QA artifact. They are the spec. An AI system without evals is a system without a definition of working. If you cannot measure whether the model is doing the right thing, you cannot tell when it stops; and it will, most time a model gets updated, most time a prompt gets edited, most time retrieval shifts. Eval-as-afterthought engagements ship something that looks fine on the demo and breaks two weeks after launch in ways nobody can diagnose.

What it looks like in practice. The agency mentions evals only when the buyer brings them up. The suite, when it arrives, is six happy-path test cases written by a junior in a notebook, no adversarial inputs, no edge cases, no production-derived failures, LLM-as-judge with no ground truth. No threshold. No CI gate. The phrase “we will iterate on evals after launch” gets said unironically.

Prevention. Make the eval suite a week-one deliverable. Insist the first PR include a test case derived from a real example the buyer provides, a measurable threshold tied to a business outcome, and a CI step that runs the suite on most subsequent PR. The argument for scoping AI projects in evaluations rather than features is exactly this: if you cannot write the eval, you do not understand the feature.

Anti-pattern 3: the milestone-as-payment-gate

The contract has four milestones, each unlocking 25% of the fee; “Phase 1: discovery & spec,” “Phase 2: prototype,” “Phase 3: integration,” “Phase 4: launch.” Quarterly. Lumpy. The buyer’s leverage is concentrated entirely at sign-off; the agency’s incentive is to get sign-off, not to ship something durable.

Why it fails. Milestone gates optimize for the appearance of completion at four discrete moments and ignore the seven weeks in between. The agency sprints to the milestone, banks the payment, and decompresses for two weeks. In those decompression weeks, technical debt accumulates, evals do not get added, observability does not get built, and post-mortems do not get written; none of it shows up in the milestone definition. By milestone four, the system is held together by the prompt-of-the-week and a nervous senior engineer.

What it looks like in practice. The agency is responsive in the two weeks before sign-off and quiet in the four weeks after. PRs slow. Standups shrink. The “polish phase” keeps growing. By milestone three, the buyer is signing off on something that does not work because saying no triggers a contract dispute and a 12-week reset. We unpack the dynamics in the AI agency milestone trap; escape requires changing the payment cadence, not the milestones.

Prevention. Replace milestone gates with weekly or bi-weekly payments tied to demonstrable progress: PRs merged with eval deltas, evals added, decisions logged. If the agency refuses to move off lump milestones, that is a commercial signal; they are protecting margin in the decompression weeks. The case against fixed-price AI development contracts covers the tradeoffs.

Anti-pattern 4: the scope spec without an eval rubric

The SOW runs 14 pages. It enumerates features, describes user flows, includes wireframes. It does not define what “working” means in measurable terms. Acceptance criteria read “the system shall produce relevant responses” and “the chatbot shall handle edge cases gracefully.” Contract signed. Six weeks later the agency demos a system that meets the SOW as written, and the buyer says “this is not what I wanted.”

Why it fails. Natural-language acceptance criteria do not survive contact with a probabilistic system. “Relevant” and “graceful” mean whatever the agency wants at sign-off and whatever the buyer wishes at delivery. The dispute is unwinnable because the spec was unfalsifiable from the start; the procurement-department fingerprint on AI projects, written by someone who has rarely shipped an AI system and accepted because nobody wanted to delay kickoff.

What it looks like in practice. The proposal’s “AI features” section reads like a marketing brochure. Acceptance criteria use “intelligent” three or more times. No eval suite is mentioned in the SOW. No threshold. No benchmark dataset. When the buyer asks how the system will be tested, the agency says “user acceptance testing in phase four.”

Prevention. Refuse to sign an SOW without an eval rubric attached. The rubric specifies a representative test set the buyer agrees is fair, a quantitative threshold for each capability (accuracy, recall, refusal rate, latency, cost-per-query), and a procedure for adding new evals when failures appear in production. If the agency cannot help write this rubric in the proposal phase, they cannot deliver against it in implementation. The eval rubric is the contract; the SOW is the wrapper.

Anti-pattern 5: account-manager-mediated communication

Most question goes through an account manager. The buyer’s CTO sends a question. The AM reformats and forwards it 18 hours later. The engineer answers in two sentences. The AM expands it back to 12 sentences and sends it 24 hours after the original. A two-message Slack exchange becomes a four-day round trip.

Why it fails. The AM layer is friction priced as service. It protects engineering time and homogenizes the agency’s voice, but on AI engagements it is catastrophic because AI projects need fast, technical, ambiguous conversation between domain experts and engineers. The AM cannot answer the question, cannot judge whether it is urgent, and cannot make the technical trade-off it is asking about. By week six, the agency’s engineers have a different mental model than the buyer’s, and the AM is reconciling them in slide decks.

What it looks like in practice. The buyer rarely gets a Slack message directly from an engineer. The engineering lead is on the kickoff and the launch, nowhere in between. Status updates are PM-speak (“tracking green to milestone two”) instead of engineering-speak (“the eval suite caught a regression in the legal-clause classifier on Monday and we rolled back the prompt change”). Technical questions get vague answers because they are filtered through someone who does not own the answer.

Prevention. Demand engineer-to-engineer channels from day one. Buyer’s technical lead and agency’s engineering lead share a Slack channel, GitHub thread, or Linear project; somewhere they can talk directly and asynchronously. The AM can attend the weekly check-in but should not be the bottleneck on most question.

Anti-pattern 6: post-launch ghost mode

The agency ships. Launch dinner. Champagne. LinkedIn post. The handoff doc is 43 pages of Confluence nobody reads. Six weeks later the buyer’s team is in production, the model has been silently updated by the provider, an eval is failing, costs have tripled, and the lead engineer is on a different project answering Slack on a 36-hour delay. The “30 days of post-launch support” clause runs out exactly when the first real production incident hits.

Why it fails. AI systems have a long tail of post-launch surprises that do not exist in traditional software: model deprecations, silent provider updates, retrieval drift as the corpus grows, cost spikes from changing usage patterns, prompt-injection attempts. A 30-day window covers the warranty surface, not the failure surface. The agency’s incentive at end-of-engagement is to disengage cleanly; the buyer’s is to have a partner for the period the system is still finding its production failure modes; roughly the first six months, not the first 30 days.

What it looks like in practice. The handoff is a single two-hour calendar event labeled “knowledge transfer.” Architecture in 90 minutes, questions in 30. Recordings filed. Slack channels archived. The engineer who built the eval suite does not return calls in week 12 because they have rolled onto a new client. The buyer’s team owns a system they did not build, documented in someone else’s voice, with a model layer they have rarely debugged.

Prevention. Negotiate a 90-day post-launch retainer at reduced rate: named individuals on call, defined response-time SLA, weekly check-ins for the first month tapering to monthly. Insist the lead engineer; not a junior; is the named contact. Most production AI failures happen in the first 90 days. The 9 unmistakable signals it’s time to fire your AI development agency covers signals that first appear in this window; a properly structured retainer either prevents them or surfaces them early.

How to use this taxonomy

The six anti-patterns are diagnostic, not prescriptive. The point is not to grade an agency; the point is to recognize the anti-pattern early enough to course-correct without ending the engagement. Most failed engagements had two or three anti-patterns visible by end of week three. The buyer either did not name them, did not have leverage, or assumed they would self-resolve. None of those assumptions hold.

Across patterns observed in failed AI engagements, most cluster rolls up to one root cause: a gap between the agency’s operating model and the actual demands of shipping production AI in 2026. The deck-first kickoff is a 2018 management-consulting habit applied to a 2026 engineering problem. Eval-as-afterthought is a holdover from the era when “AI” meant “demo.” Milestone payments are a procurement convention from a world where deliverables were fungible. The scope spec without an eval rubric is what happens when legal teams write contracts engineers cannot execute. AM-mediated communication is the structure of a 2015 dev shop billing 2026 rates. Post-launch ghost mode is the natural consequence of an industry that still treats AI projects as one-and-done. The partnership failures we have seen across 40 AI agency engagements make the case with receipts.

Many six anti-patterns are loud once you know what to listen for. The first 14 days are diagnostic enough to spot many of them. If by end of week two you have a working repo, an eval suite with thresholds, weekly billing, an SOW with a measurable rubric, direct engineer-to-engineer communication, and a post-launch plan written into the contract, you are in a working engagement. If any of those six are missing, the fix is cheaper now than in six weeks. The conversation to have is not “are we on track”; that question rarely gets a useful answer; but “which of the six anti-patterns are we currently exhibiting, and what is the smallest change that gets us out of it.”

Agencies that ship work in 2026 do not avoid these anti-patterns by accident. They have built operating systems that make them structurally hard to fall into: working kickoffs by default, evals as week-one deliverables, weekly billing, eval-rubric SOWs, direct comms, 90-day post-launch retainers. That is what an AI dev partner should be; and the gap between agencies that operate that way and agencies that do not is the cleanest signal in the market right now.


Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has shipped, inherited, or audited dozens of AI engagements over the last two years and has seen each of these anti-patterns more times than he would like to admit.

Frequently Asked Questions

What are the most common anti-patterns in failed AI agency engagements?

Six recurring anti-patterns surface across patterns observed in failed AI engagements: the deck-first kickoff (no code in week one, only slides), eval-as-afterthought (evals deferred to phase three or skipped entirely), the milestone-as-payment-gate (lump-sum payments tied to four discrete milestones, decompression in between), the scope spec without an eval rubric (natural-language acceptance criteria that cannot be falsified), account-manager-mediated communication (most technical question filtered through a non-technical AM), and post-launch ghost mode (30-day support windows that end before the first real production incident).

How early can these AI agency anti-patterns be spotted?

Many six anti-patterns are visible in the first three weeks of an engagement, and most are visible in the first 72 hours. The deck-first kickoff is obvious by end of day three. Eval-as-afterthought shows up the moment the agency cannot produce a week-one eval suite. AM-mediated communication appears the first time an engineer’s answer takes 36 hours and arrives reformatted. The first 14 days of an engagement are diagnostic enough to spot many six failure modes; and the fix is dramatically cheaper at week three than at month three.

Why is the deck-first kickoff a problem?

A deck-first kickoff produces zero artifacts that can be tested, run, or merged. By the decline of week two, the deck is stale, the Gantt is fictional, and nobody on either side has touched a model. The first meaningful technical decisions get made in week four under deadline pressure, by people who have not yet built shared context. Real engineering teams treat the kickoff as a working session: a repo gets created, a hello-world model call gets shipped, a placeholder eval suite gets committed, and the buyer’s domain expert gets added as a reviewer on day one.

Why is eval-as-afterthought the most damaging anti-pattern?

Evals are not a QA artifact; they are the spec. An AI system without an eval suite is a system without a definition of working. If you cannot measure whether the model is doing the right thing, you cannot tell when it stops doing the right thing, which it will most time a model gets updated, most time a prompt gets edited, most time retrieval shifts. Eval-as-afterthought engagements ship something that looks fine on the demo and breaks two weeks after launch in ways nobody can diagnose because there is no baseline.

How should AI agency engagements structure payments to avoid the milestone trap?

Replace milestone-gated payments with weekly or bi-weekly payments tied to demonstrable progress: PRs merged with eval deltas, evals added, decisions logged. Quarterly milestones with 25 percent payments concentrate the buyer’s leverage entirely at sign-off and create a sprint-and-decompress rhythm where technical debt accumulates in the four weeks after each sign-off. A continuous-cadence engagement keeps both sides accountable in real time. If the agency refuses to move off lump milestones, that is a commercial signal; they are protecting margin in the decompression weeks.

What should an SOW for an AI engagement contain?

An SOW for an AI engagement must include an eval rubric as an attachment. The rubric specifies a representative test set the buyer agrees is fair, a quantitative threshold for each capability (accuracy, recall, refusal rate, latency, cost-per-query), and a procedure for adding new evals when failures appear in production. Natural-language acceptance criteria like ‘the system shall produce relevant responses’ do not survive contact with a probabilistic system. The eval rubric is the contract; the SOW is the wrapper.

Why is account-manager-mediated communication an anti-pattern on AI projects?

AI projects need fast, technical, ambiguous conversation between domain experts and engineers. The AM cannot answer the technical question, cannot judge whether it is urgent, and cannot make the trade-off the question is asking about. Signal degrades at most translation. By week six, the agency’s engineers have a different mental model of the project than the buyer’s, and the AM is reconciling them in slide decks. The fix is direct engineer-to-engineer channels; a shared Slack channel, GitHub thread, or Linear project; with the AM attending the weekly check-in but not bottlenecking most question.

How long should post-launch support last on an AI engagement?

A 90-day post-launch retainer at reduced rate is the right structure for AI engagements, with named individuals on call (the lead engineer, not a junior), a defined response-time SLA for incidents, and weekly check-ins for the first month tapering to monthly for the next two. AI systems have a long tail of post-launch surprises; model deprecations, silent provider updates, retrieval drift, cost spikes, prompt injection; that do not exist in traditional software. A 30-day support window covers the warranty surface, not the failure surface, and most production AI failures happen in the first 90 days.

What if an engagement already has two or three of these anti-patterns?

Two anti-patterns is a problem. Three means the project will not ship as currently structured. Four means the engagement is a write-off in motion and the only question is how much budget you save by ending it now. The fix at week three is to name the anti-pattern explicitly, propose the smallest structural change that addresses it (weekly billing instead of milestone, a week-one eval suite, a shared Slack channel), and renegotiate. The conversation to have is not ‘are we on track’; which rarely gets a useful answer; but ‘which of the six anti-patterns are we currently exhibiting, and what is the smallest change that gets us out of it.’

Last Updated: May 30, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles