The single highest-leverage operating decision an AI agency makes in 2026 is which engagements it refuses. Most shape of work an agency takes on is either compounding; building muscle, evals, infra, and reputation that the next engagement inherits; or it is melting wax: revenue today, scar tissue tomorrow, and a portfolio of half-built systems that nobody will write a case study about. After running this firm through enough cycles to see the pattern repeat, we have a written kill list. Five engagement shapes that we now decline regardless of price, brand of buyer, or how interesting the underlying problem looks at first glance. This is internal hygiene, and we publish it deliberately because the buyers we want should be glad to see it.
The frame is simple. An AI agency in 2026 lives or dies on three things: shipped systems with measurable eval deltas, a referenceable book of clients, and a margin profile that lets it pay senior engineers and still invest in tooling. Each of the five engagement types below quietly attacks one of those three. Saying yes to them is not a neutral act of capacity allocation; it is an act of strategic self-harm dressed up as revenue. Saying no is not a luxury; it is the work.
This is not a manifesto piece. The manifesto already exists. This is the operating consequence of the agency manifesto we published earlier this year: if you take that frame seriously, certain engagements become incompatible with it, and you have to write down which ones and why.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Type 1: “Build us a chatbot for everything”
- Type 2: “We have no usable data, but we want fine-tuning”
- Type 3: “Demo for our investor pitch deck in two weeks”
- Type 4: “We’ll provide the prompts, you just code it”
- Type 5: “We’ll pay on a success metric we define mid-project”
- The arithmetic of saying no
- How we communicate a decline
Type 1: “Build us a chatbot for everything”
The pitch arrives roughly weekly. A mid-market company has watched its competitors ship customer-facing assistants, has read three McKinsey notes about generative AI productivity gains, and has decided that the answer is a single conversational interface; “a ChatGPT for our data”; that will simultaneously handle customer support, internal HR questions, sales enablement, and product documentation lookup. The budget is real. The timeline is six months. The scope is the entire enterprise.
We decline these. The reason is not technical squeamishness. The reason is that an undefined-scope chatbot is a graveyard for evals, and an engagement without evals is one we cannot run. When the surface area is “everything an employee or customer might ever ask,” there is no way to write a 50-case ground-truth eval suite that is representative. The eval baseline becomes a fiction; the threshold becomes a wish; and at month four the conversation devolves into “the bot got this one weird question wrong” anecdotes, which is the failure mode we are professionally allergic to.
The revenue we lose by declining is real; usually $300k to $900k per engagement, paid in fat monthly retainers. The margin we save is larger. A wide-scope chatbot project burns roughly twice the hours of a tightly-scoped one because most new department added to the surface area introduces a new eval dimension, a new edge case, and a new internal stakeholder with veto power. The post-launch trajectory is worse: usage numbers disappoint, the system is quietly retired by month nine, and the case study we hoped to publish rarely materializes. We have watched this exact movie three times. We do not need a fourth screening.
When a buyer presents this shape, the productive counter-offer is to pick one workflow; a single high-volume customer support intent, a single sales-enablement query class; and run a 14-day forward-deployed engagement on that wedge with a real eval suite. Roughly one in three buyers accept the narrowing. The other two go elsewhere, build something that fails on schedule, and sometimes circle back the following year.
Type 2: “We have no usable data, but we want fine-tuning”
The buyer has read that fine-tuning is how serious teams get serious results. They have watched a competitor announce a fine-tuned domain model. They want one. When we ask to see the training set, the answer is some combination of: “We have a ticketing system going back five years,” “Our knowledge base is in Confluence,” and “We can pull together what we need.” Pressed on labels, schema, deduplication, PII boundaries, and per-example quality control, the answers thin out fast.
We decline these because fine-tuning a frontier model in 2026 is not the bottleneck the buyer thinks it is. With Claude Opus 4.7 in its 1M-context configuration and GPT-5 holding state across hour-long agentic tasks, the marginal performance you get from a custom fine-tune over a strong retrieval-augmented prompt is small enough that it is dominated by the data-quality problem the buyer has not solved. The OpenAI fine-tuning documentation is unusually direct about this: “We recommend first attempting to get good results with prompt engineering, prompt chaining, and function calling.” Anthropic’s guidance on fine-tuning Claude models, published as part of their Claude Developer Platform docs, pushes the same prerequisite; clean, deduplicated, label-aligned training pairs; and the work to produce those is roughly four times the work of the fine-tune itself.
What the engagement requires, then, is a six-to-ten-week data-quality project the buyer did not budget for and does not believe they need. We have learned not to bury that scope inside a “fine-tuning project” line item; the resulting engagement is one in which the agency spends eight weeks doing data labeling that the client wanted to skip, the client spends eight weeks frustrated that the model has not been touched yet, and the eventual fine-tune produces an improvement that does not justify the price. Margins compress, the case study evaporates, and the senior engineers who should have been deployed elsewhere are stuck doing CSV hygiene. We say no, and we publicly explain the precondition so the buyer can either fix it and come back, or hire a different firm that is willing to discover the problem on the buyer’s dime.
Type 3: “Demo for our investor pitch deck in two weeks”
The startup has a Series A pitch in three weeks. They want a working AI demo to show in the room. They have a clear vision, a clean codebase, and a budget. The work is technically straightforward. The contract is short. The buyer is a smart founder.
We decline these. The reason is not capacity, and it is not a snobbery about pre-revenue companies. The reason is that “demo for investor pitch” is the canonical shape of disposable work. Whatever we ship is a single-use artifact: it must look real for 12 minutes in a room, it does not need to handle production traffic, it does not need to be eval-gated, it does not need to survive a model API change, and after the round closes it will be discarded and replaced with whatever the in-house team builds. The fastest path to a “working demo” is a hardcoded happy path with a thin layer of LLM cosmetics on top, and that is precisely the shape of work our engineers refuse to do.
The economic argument against taking these jobs is straightforward. The hourly rate is competitive, but the engagement produces nothing reusable: no eval suite to carry forward, no architecture decision record, no production system, no published case study (because the founder cannot reference it without disclosing which round they were raising). The reputational argument is sharper. We have seen demo-only engagements lead to “but you built it” disputes a year later when the in-house team’s production system performs differently than the demo did. There is no clean exit from a deliverable that was rarely specified to be production-grade and that everyone in the room treated as if it were.
The productive counter-offer here is a 14-day forward-deployed engagement with the explicit understanding that the output is production-eval-gated code, not a demo. Founders with strong technical instincts take this trade roughly half the time. The other half want the demo and only the demo, and we send them to a contractor network that runs the demo-shop business model honestly. That is a healthier outcome for everyone, including the founder, and we keep our pipeline full of work that compounds.
Type 4: “We’ll provide the prompts, you just code it”
The buyer’s engineering or product leader has spent six months becoming a competent prompt engineer. They have a folder full of system prompts, few-shot examples, and tool definitions. What they want from us is “implementation”: wrap their prompts in a service, ship it to production, do the boring DevOps. They will own the prompts. They will own the model selection. They will own the architecture. We will own the keyboard.
We decline these. The pitch sounds reasonable until you trace the failure trajectory. The prompts the buyer hands us are tuned against a particular model; usually whatever model they were experimenting with three months ago; and against a tiny set of inputs in a Jupyter notebook. The moment those prompts hit the variance of real production traffic, they fail in ways that require restructuring the prompt, switching the model, redesigning the tool boundary, or restructuring the agent loop entirely. The buyer who hired us for “implementation” now has a pile of failures and a contract that says we cannot touch the prompts. The conversation that follows is the worst conversation in this business.
The deeper problem is that “you provide the prompts” is a polite refusal of the agency’s opinion, and an agency without an opinion is a staffing firm. We are not a staffing firm. The Anthropic model card for Claude Opus 4.7, published at the decline of 2025, is unambiguous on this point: prompt design is inseparable from model selection, tool design, eval design, and agent loop architecture. If the buyer wants us to take responsibility for the system performing in production, they have to let us own the entire stack. If they want to own the prompts, they need to own the system, and we are the wrong shape of partner.
The productive counter-offer is collaborative ownership: the buyer’s prompt engineer joins our weekly working session, we propose changes against an eval baseline, and the prompts evolve under joint authorship with us as the technical lead. Roughly two in three buyers find this acceptable once it is named clearly. The third drops out, which is the correct outcome for both sides.
Type 5: “We’ll pay on a success metric we define mid-project”
The contract is a flat rate for setup plus a performance bonus tied to a metric. The metric will be defined “once we have a baseline.” The buyer is enthusiastic about the alignment of incentives. The agency is, on first read, also enthusiastic. Performance pricing sounds modern and sounds like the right thing.
We decline these without exception. The reason is that “we’ll define the metric mid-project” is a structural mechanism for payment retraction, and we have learned this the expensive way. By the time the metric is defined, the project is half-built, the agency is committed, the buyer has many the leverage, and the metric will be selected; not maliciously, but inevitably; to land just barely outside the system’s actual performance. The performance bonus becomes the variable that absorbs most disagreement about scope, quality, and timing, and the agency ends up subsidizing the buyer’s discovery process with its own margin.
The honest version of performance pricing requires the metric, the threshold, and the measurement methodology to be defined and signed before the engagement begins, with eval cases written in advance and a baseline established in week one. We are happy to do this version, and roughly 15% of our engagements are structured this way. The dishonest version; “we’ll figure out the metric as we go”; is a price-discovery exercise on the agency’s discount, and we name it as such in writing during the sales process. Buyers who insist on the open-metric version are buyers who expect us to lose, and there is no engagement shape we are less willing to enter than one structured around our loss.
For more on the surrounding pattern of engagement red flags, see the field guide to recognizing the warning signs in an AI consulting engagement, which catalogs adjacent shapes; the eternal pilot, the moving deliverable, the prompt-as-asset trap; that share a structural family with the open-metric kill-list entry.
The arithmetic of saying no
The five categories above represent 35% to 55% of inbound in any quarter. Declining them is a meaningful financial decision; the arithmetic that keeps us disciplined is straightforward.
A compounding engagement; a tightly-scoped, eval-gated, two-week increment shipping into production; produces three outputs: revenue, a referenceable case study, and internal infrastructure (an eval pattern, an architecture decision record, a tool wrapper) that lowers the cost of the next engagement. One in three becomes a published case study; one in two produces reusable tooling. The effective hourly rate, amortized across reuse, runs 1.4x to 1.8x our nominal rate.
A non-compounding engagement; a wide-scope chatbot, a fine-tune on dirty data, a disposable demo, a prompts-as-input job, a metric-defined-later contract; produces revenue and nothing else. No case study, no reusable infrastructure, no eval pattern that survives the project. The effective hourly rate is the nominal rate, minus the opportunity cost of the compounding engagement we did not run during those weeks. Run for long enough, the ratio compresses the firm into a generic dev shop that bills hours and forgets them.
How we communicate a decline
We decline in writing, the same day, with three sentences: a one-line statement that we are not the right firm for this shape of work, a one-line explanation pointing to the structural reason, and a one-line referral to a firm or contract structure that fits better. We do not negotiate, soften with an “unless,” or offer a discount to make the engagement work. The decline is final, the reason is given, and the buyer can re-shape the request or take it elsewhere.
Roughly 30% of declined buyers re-shape and come back with an engagement that fits. Roughly 50% go elsewhere and we rarely hear from them again. Roughly 20% come back six to twelve months later, having watched the original shape fail, and ask for a 14-day engagement to do it correctly. The 30% and the 20% are the business case; the 50% who go elsewhere are the firm’s filter system working as designed. Internal hygiene, when it is good, is also a marketing artifact, and we would rather have ours written down than implicit and unevenly applied.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has personally signed off on most engagement decline the firm has issued in the last 18 months.
Frequently Asked Questions
Why does an AI agency publish a kill list of engagements it declines?
Publishing the kill list is internal hygiene that doubles as a buyer-filter. Buyers who recognize their own near-mistakes self-correct before the sales conversation; buyers offended by the list would have been bad fits regardless. The agency benefits from a pre-qualified pipeline, and serious buyers benefit from clarity about which engagement shapes the firm has structurally decided not to run. An unwritten kill list tends to be applied unevenly; a written one is enforced consistently across the partner group.
What is wrong with a ‘build us a chatbot for everything’ engagement?
The surface area is undefined, which makes a representative ground-truth eval suite impossible to construct. Without an eval baseline, the engagement cannot be measured, the threshold becomes a wish, and at month four the conversation devolves into anecdotes about specific failed queries. The productive counter-offer is to narrow scope to one workflow with a 14-day forward-deployed engagement against a real eval suite. Roughly one in three buyers accept the narrowing.
Why decline a fine-tuning project when the buyer has no usable training data?
Fine-tuning a frontier model in 2026 is not the bottleneck. With Claude Opus 4.7 in 1M-context configuration and GPT-5 holding state across long agentic tasks, the marginal performance gain over a strong retrieval-augmented prompt is small enough that it is dominated by data quality. Fine-tuning without clean, deduplicated, label-aligned training pairs is a six-to-ten-week data-quality project the buyer did not budget for. Both OpenAI and Anthropic explicitly recommend exhausting prompt engineering and retrieval before considering a fine-tune.
What makes a ‘demo for an investor pitch deck’ engagement disposable?
A demo for an investor pitch must look real for 12 minutes in a room. It does not need to handle production traffic, does not need to be eval-gated, and will be discarded after the round closes. The fastest path to a working demo is a hardcoded happy path with a thin layer of LLM cosmetics on top, which is the shape of work the agency refuses to do. Nothing reusable comes out: no eval suite, no architecture decision record, no production system, no publishable case study. The productive counter-offer is a 14-day engagement with the explicit understanding that the output is production-eval-gated code, not a demo.
Why is ‘we provide the prompts, you just code it’ a structural problem?
Prompts handed to the agency are tuned against a particular model and a tiny set of inputs in a Jupyter notebook. The moment they hit production traffic variance, they fail in ways that require restructuring the prompt, switching the model, redesigning the tool boundary, or rebuilding the agent loop. A contract that forbids the agency from touching the prompts then traps the engagement. Prompt design is inseparable from model selection, tool design, eval design, and agent architecture. Splitting the responsibility makes the system unownable.
What is the problem with mid-project performance metrics?
Defining the success metric mid-project is a structural mechanism for payment retraction. By the time the metric is defined, the project is half-built, the agency is committed, the buyer has many the leverage, and the metric will be selected to land just barely outside the system’s actual performance. Honest performance pricing requires the metric, threshold, and measurement methodology to be defined and signed before the engagement begins, with eval cases written in advance and a baseline established in week one.
How does the agency communicate a decline to a buyer?
Same day, in writing, in three sentences. One line stating the firm is not the right fit for the shape of work described, one line pointing to the structural reason (which of the five kill-list categories applies), and one line referring the buyer to a firm or contract structure that fits better. The decline is not negotiated, softened with an ‘unless,’ or discounted into acceptance. The buyer can re-shape the request and come back, or take it elsewhere, both of which are healthy outcomes for both sides.
What share of inbound revenue does the kill list typically reject?
Between 35% and 55% of inbound revenue opportunity in a typical quarter falls into one of the five kill-list categories. Declining many of it is a meaningful financial decision, and the arithmetic only works because compounding engagements (tightly-scoped, eval-gated, two-week increments shipping to production) carry a 1.4x to 1.8x effective hourly rate after amortizing case studies and reusable internal tooling, while non-compounding engagements produce only revenue with no second-order returns.
What happens to buyers after they are declined?
Roughly 30% re-shape the request and come back with an engagement that fits, often within the same quarter. Roughly 50% go elsewhere and the agency rarely hears from them again. Roughly 20% return six to twelve months later, having watched the original shape of the project fail at a different vendor, and ask for a 14-day forward-deployed engagement to do it correctly. The 30% plus the 20% combined are the long-run business case for publishing and enforcing the list.
Is the kill list compatible with growth, or does it cap the firm at a small size?
The kill list is a growth instrument rather than a brake. Compounding engagements produce reusable internal infrastructure (eval patterns, ADR templates, tool wrappers) that lowers the marginal cost of the next engagement of the same shape, which raises capacity over time without raising headcount linearly. A firm that says yes to everything cannot reuse much across engagements, so each new project starts at zero and capacity is bounded by hours. The kill list is the mechanism by which the firm trades short-term revenue for long-run leverage.
Arthur Wandzel