Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 13 min read

When to fire your AI development agency: 9 unmistakable signals

When to fire your AI development agency: 9 unmistakable signals

Fire your AI development agency the day you can name three of the nine signals below; not the day you finally feel “ready.” Buyers wait four to six months too long to terminate underperforming AI vendors, because the cost of a bad agency is paid in stalled roadmaps and team morale, both easy to rationalize as “early-stage friction.” The cost of replacing is paid in one quarter of awkward transition.

If your vendor matches the forward-deployed AI dev partner described in the manifesto, none of these signals will ring true. Two or three: escalate. Five or more: you have already been fired by your roadmap. Show the signals, give a four-week cure window where reasonable, then pull the plug.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Table of Contents

Signal 1: No production code in your repo by week six

The signal. Six weeks in: a kickoff deck, a Notion workspace, two Figma flows, a Streamlit demo on a vendor URL. Nothing merged to a branch your team controls.

Why it matters. The forward-deployed pattern in what the first 14 days should look like assumes a trivial PR by day 1 and an eval-gated feature PR by day 14. Six weeks is three times that window; staff-augmentation-cosplay territory.

What to do. Escalate in writing: we expected merged code by week N, we have none, the next two weeks must produce one production-grade PR through your CI, failure triggers termination. If your contract has no clause firing on shipped artifacts, bake one into the next relationship using AI agency contract negotiation.

How to document. Access timeline, “deliverables” produced in lieu of code, proposed vs actual sprint plan. The replacement uses this to scope the rebuild.

Signal 2: They cannot show you an eval suite

The signal. You ask for the eval suite for the most recent feature shipped. They pivot to a slide on “AI quality assurance methodology.” They name-drop Promptfoo or LangSmith but cannot screen-share an actual evals/ directory with a threshold and a passing CI run. Pressed, they call evals “extensive QA” or “human-in-the-loop review.”

Why it matters. Evals are the contract between buyer and vendor in 2026. A model is non-deterministic; a demo is one sample point; only an eval suite turns vibes into a measurable specification. An agency without eval discipline ships software whose behavior nobody can prove in court, in regulation, or in a postmortem.

What to do. Within ten business days, the agency must produce a written threshold, scaffold the suite, and merge it. Refusal, missed deadline, or notebooks-as-evals is the termination trigger. See AI model evaluation testing services.

How to document. Save the QA checklist, demo screencast, and slide deck they produced. Save the eval gaps; failure modes the system has shown in production that a real eval would have caught. The next vendor cites those as their first eval cases.

Signal 3: Your bill includes opaque inference markup

The signal. The invoice has a line for “AI infrastructure” that does not match an Anthropic, OpenAI, Google Cloud, or AWS receipt you can verify. The agency holds the API keys; you do not see the token bill; the markup is somewhere between 30% and 200% over wholesale and the agency will not say which.

Why it matters. Token arbitrage is the most predictable failure mode in the 2026 market. Inference is a usage-based cost passed through unchanged; any vendor hiding it is monetizing a dimension that should be transparent. OpenAI’s pricing page and equivalent pages from Anthropic and Google are public; reconciliation takes ten minutes. Hidden markup signals a business model assuming long-term information asymmetry.

What to do. Move the API keys to a buyer-owned account this week. Buyer holds keys, pays the provider directly, agency optimizes inference cost as part of scope and reports cost-per-feature each sprint. Resistance is a fire-now signal.

How to document. Six months of invoices. Estimate retail cost using public pricing and reported volumes. The delta is the markup; the savings line in the replacement’s pitch.

Signal 4: Account managers outnumber engineers on calls

The signal. Your status call has six attendees: account director, project manager, “delivery lead,” junior engineer, and two of your people. The senior engineer who pitched has not been on a call in five weeks. Technical questions return “we will follow up after syncing internally.”

Why it matters. The non-engineer-to-engineer ratio is a near-perfect predictor of agency economics. High-overhead shops route most interaction through account management because senior engineers are over-allocated. The forward-deployed alternative removes those layers; engineers are the project managers because shipped artifacts are the status updates. Contrasted in AI agency vs in-house team decision.

What to do. Cancel the standing weekly call. Replace with an async Slack standup the engineer writes and a 30-minute biweekly demo of merged PRs with the engineer driving. If the senior engineer who pitched is not running these, the engagement is ending. One-week cure.

How to document. Attendee lists from your last ten calls. Who spoke; which questions were deferred. The “deferred to internal sync” pattern is the artifact.

Signal 5: They cannot debrief a production incident

The signal. A production AI feature degrades; wrong outputs, latency spikes, a hallucination embarrasses a customer. You ask for a postmortem. You get a one-paragraph email apologizing and proposing “additional QA review.” No timeline, no root cause, no fix, no eval case.

Why it matters. Incident discipline is the bright-line test between agencies that have shipped real software and agencies that have only ever demoed it. Senior engineers debrief conversationally; detection time, the diff that caused the regression, the fix, the new eval case. People who only ship demos cannot, because they have rarely been on call. The postmortem culture documented by the SRE community made the discipline standard outside AI; agencies not practicing it are a generation behind.

What to do. Demand a written postmortem within five business days, in your format: timeline, root cause, fix, regression test, eval case added. Failure to deliver at quality is a fire-now signal.

How to document. Customer-facing impact, internal detection log, agency response. The triplet is the audit trail your next vendor reads on day one.

Signal 6: The architecture is hard-coded to one model vendor

The signal. Six months in, most prompt calls one provider’s SDK directly. No provider abstraction. The agency dismisses router patterns as “premature optimization.” When a frontier model from another provider outperforms the incumbent on your task, the agency proposes a six-week rewrite to switch.

Why it matters. Models change most few months. Pricing curves are unstable; provider availability has wobbled in the last 18 months; new entrants regularly leapfrog incumbents on specific tasks. A thin router abstraction; 200 to 400 lines wrapping a provider-agnostic client; adds days up front and saves quarters of forced rewrites. Not architecting for this is evidence of inexperience or a kickback relationship with the provider whose tokens are being marked up.

What to do. Router refactor as the next sprint: provider-agnostic client, model-id in env config, routing by task, eval re-runs across two providers. One to two weeks for a non-trivial codebase. Refusal or a five-figure change-order for two days of work is the termination trigger.

How to document. The provider-coupling map: most file importing a single-vendor SDK, most prompt with provider-specific syntax, most hard-coded model id. The rebuild is priced against this map.

Signal 7: Scope grows without shipped pilots

The signal. Three months in, no shipped pilot but two proposed scope expansions: a “data-readiness assessment” and a “responsible-AI governance program,” each with a six-figure rider. The original pilot has slipped twice.

Why it matters. Strategy work has a place; grounded in shipped pilots. Agencies that grow scope before shipping are running the 2018 management-consulting playbook: collect retainers for adjacent work while the original work asymptotically approaches done. BCG’s Where’s the Value in AI? finding; ~10% from algorithms, ~90% from people, process, and integration; is wallpaper in agency proposals as a pretext for selling more strategy work.

What to do. Reject the expansion in writing. No new scope until the original pilot ships and meets its eval threshold. If they cannot ship the pilot, no amount of “data readiness” will rescue them.

How to document. Original SOW, most expansion proposal, slip history. The cleanest possible brief for the replacement.

Signal 8: They will not commit to a knowledge-transfer plan

The signal. Twelve months in, you ask for a transfer plan that would let an internal team or successor agency take over. They stall. They propose “an extended retainer” and cite “tribal knowledge that takes years to develop.”

Why it matters. A 2026 agency should commit in writing to making the buyer independent of them; documented architecture, evals runnable by your team, inheritable runbook, help recruiting in-house engineers. An agency that resists is selling lock-in, not engineering. The lock-in is the deliverable, and the buyer is the product.

What to do. Add a transfer milestone to the contract: architecture doc, runbook, eval handoff, two paired-coding weeks with your team. Tie 20% of remaining fees to completion. Refusal is a termination trigger; timelines over 90 days are the same signal told politely.

How to document. Transfer dossier: architecture diagrams, eval ownership, runbooks, provider account access, secrets, on-call rotation, monitoring, recent postmortems. Dossier quality predicts switching cost.

Signal 9: Your own engineers stop asking them questions

The signal. Six months in, your engineers have stopped routing AI questions through the agency. When something breaks, they debug it. When a feature comes up, they spec it and sometimes share the spec afterward “for review.” The agency has become an expensive observer.

Why it matters. The quietest signal, often the most decisive. Your engineers stopped asking because they correctly assessed that the agency’s answers are slower, more expensive, and less relevant than what they produce themselves with a Cursor seat and a Claude Code subscription. Stack Overflow’s 2025 Developer Survey reported 84% of professional developers use AI tools daily. They have the leverage the agency was supposed to bring; what they need is senior judgment, eval discipline, and production operations experience. If the agency does not provide those, your team has already replaced them.

What to do. Run a 15-minute survey: when did engineers last ask the agency a substantive question, what was it, what was the answer, did they use it. The pattern is the verdict.

How to document. Survey results, anonymized. Engagement cost vs value contributed in the last 90 days (PRs, reviews, architectural decisions, incidents). The ratio is the case for termination and tells you whether to replace or insource; see AI agency vs in-house team decision.

How to fire an AI agency cleanly

A clumsy firing leaves stalled systems, ghosted handoffs, and legal exposure. A clean firing produces a transferred system and a vendor reference call.

  1. Align internally. CFO, CTO, legal, exec sponsor in one room. Name the signals. Pre-approve the replacement.
  2. Read the termination clause. Most MSAs require 30 to 60 days notice. If yours has none, that is the second-largest learning from this engagement.
  3. Send notice in writing. Polite, factual, brief. Cite the clause and effective date.
  4. Schedule the wind-down call. Two hours, engineers from both sides, checklist (knowledge transfer, repo access, key rotation, deploy access, monitoring credentials, runbook handoff). Output: a one-page plan with named owners and dates.
  5. Run the wind-down with kickoff discipline. Daily standups. Eval re-runs to baseline at handoff. A final architecture review the replacement attends.
  6. Close on a reference call. Done well, this turns termination into a relationship reset.

The sequence is 30 to 60 days, in parallel with the replacement search. Cost of firing is bounded; cost of not firing compounds.

Frequently Asked Questions

Q: How quickly should I act once I see one of these signals? A: One signal is escalation-grade, not usually firing-grade. Give a four-week cure for signals 1, 2, 4, 6, 7, and 8. Signals 3, 5, and 9 are fire-now; they reflect business-model misalignment, production discipline failure, or the team’s own verdict, none of which a cure changes.

Q: What if my contract penalizes early termination? A: Many “penalties” are recovery of unrecouped onboarding costs far smaller than continuing. Compare worst-case termination to the next 90 days of fees plus opportunity cost. The math almost usually favors termination.

Q: How do I avoid a gap between firing and replacement? A: Run the replacement search in parallel with the cure-window escalations. Identify two to three candidates before the decision is final.

Q: Should I bring the work in-house instead? A: If AI is core to product strategy and you can hire fast enough, in-house wins on long-term economics. If hiring is too slow, hire a forward-deployed replacement and let them recruit in parallel. If AI is non-core, a replacement agency is almost usually right.

Q: Will firing damage my reputation in the AI vendor market? A: Done politely and factually, no. A well-run termination signals a sophisticated buyer with clear standards.

Q: How do I know if the next agency is any better? A: Use a structured evaluation; the 90-minute field guide for evaluating an AI agency is the rubric I run when a portfolio company asks me to sit in on a vetting call.

Q: What if I cannot get the API keys back? A: Rotate them on the providers’ side immediately on notice. Anthropic, OpenAI, and Google support revocation in seconds. Highest-priority handoff item.

Q: Should I write a postmortem on the engagement itself? A: Yes. A short internal document; what we hired them to do, what they delivered, what we missed in vetting, what we would change next contract; is the highest-leverage artifact a failed engagement produces.

Q: When should I re-evaluate the next engagement against these signals? A: At the 14-day mark, the 6-week mark, and quarterly thereafter.

Closing

Firing an AI agency is a managerial skill, not a moral act. The nine signals turn “things feel off” into a falsifiable diagnostic. Acting one quarter too late usually costs more than acting one quarter too early.

The best preventive medicine is hiring a vendor who would rarely trigger the signals; shape in the AI agency manifesto, questions in the field guide.; Arthur Wandzel, CEO, SFAI Labs

Last Updated: May 22, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles