Why your AI agency should refuse some of your requests

The most useful sentence an AI agency can say is “no.” Not as a posture; as a structural part of the contract. If your AI agency says yes to most request you make in the first 90 days, they are not a good agency. They are an agency that has chosen short-term client comfort over the working system you are paying them to build, and they will be fired or sued or both somewhere between months six and eighteen, when the silently broken parts of the system surface in front of customers, regulators, or a board.

This is the second-most-uncomfortable claim in the AI agency manifesto: the agency is an advisor, not an order-taker, and a good portion of the value the client is paying for is delivered through refusals. Doctors, lawyers, and structural engineers many bill paying clients while refusing routine requests; the refusals are the value. AI agencies operate in the same shape, except the failure modes are stranger and the public is less acclimated to the discipline.

What follows is a working list of request types a good AI agency will refuse, the frame they should use to refuse them, and what they should propose instead. None of these are hypothetical. Most one has been requested of SFAI Labs in the last 18 months, and most project where we held the line is in production today.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

1. “Just expand scope, we’ll figure out the evals later”

The request: a feature is added mid-engagement; a new domain, a new user, a new input type; and the client asks the team to “just ship it under the existing eval gate.”

The refusal frame: scope expansions without re-baselining evals are not expansions, they are silent regressions. The original eval suite was built around a specific input distribution. The new scope sits outside that distribution. Continuing to ship under the old eval gate produces a number that says the system is healthy while the new scope is entirely untested. A month from now, when the new-scope inputs hit production, the failures will be blamed on the model, the prompt, or “AI being unreliable.” They will not be blamed on the missing eval cases, because the cases were rarely written.

What to do instead: pause the feature work, write 20–50 ground-truth cases for the new scope with the client domain expert, run the suite, set a new threshold, and add it to CI. This is one to three days of work. It does not slow the engagement; it reveals what the engagement is. Clients who push back on this are usually doing so because an internal stakeholder promised the feature on a deadline, and the right move is to renegotiate the deadline, not to forge the eval gate.

2. “Make it work for everything”

The request: the system was built for contract review, and now the executive sponsor wants it to also handle customer support emails, RFP responses, and onboarding documents; same engine, no new scoping.

The refusal frame: AI systems do not generalize for free. A model tuned to perform well on contracts will perform poorly on emails not because the model is incapable but because most layer of the system above it; the prompts, the retrieval corpus, the evals, the tool boundaries, the failure-mode catalog; was designed around contracts. Asking the system to handle an open-ended set of tasks is asking for a system that handles none of them at the threshold the client originally agreed to. The “make it work for everything” request is almost usually coming from someone who has not yet internalized that AI systems are domain-specific software products, not general-purpose appliances.

What to do instead: re-scope the second use case as a separate engagement with its own eval baseline, retrieval corpus, and failure-mode catalog. The shared infrastructure; model routing, observability, deploy pipeline; carries over; the domain-specific work does not. The agency that says yes to “make it work for everything” because it sounds like more revenue is the agency that ships a system that does nothing well.

3. “Skip the evals to ship faster”

The request: a deadline is imminent, the eval suite is producing red numbers, and someone proposes shipping the feature with the eval gate disabled “for this release only.”

The refusal frame: shipping under a disabled eval gate is the moment the engagement stops being engineering and becomes theater. Disabled gates do not stay disabled for one release. They become the new normal because the next deadline is usually imminent and the eval gate is usually inconvenient. Six months later, the team has shipped 15 features, none of them have a measured eval delta, and the only reason the system “works” is because no customer has hit the failure mode yet.

What to do instead: ship the parts of the feature that pass the eval gate today, defer the parts that do not, write a one-paragraph note in the release explaining the gap, and book the eval-improvement work for the next sprint. Stakeholders will accept “we shipped 70 percent of the feature with measured quality” much more readily than they fear, because the alternative; “we shipped 100 percent, but we have no idea how well it works”; is the version that produces the next-quarter incident report. The agency’s job is to refuse the version that produces the incident, even when the request is wrapped in deadline pressure. This is one of the disciplines listed in the AI agency trust ladder; agencies that fold here are reseller-shaped, not operator-shaped.

4. “Remove the citations, it sounds more confident without them”

The request: a UX iteration on a retrieval-grounded answer system, where someone has decided the citations clutter the page and want the model output rendered as a clean, confident block of prose.

The refusal frame: citations are not visual decoration; they are the audit trail that distinguishes a grounded answer from a hallucination. Removing them does not make the answer more accurate. It makes the inaccuracy harder to detect, harder to dispute, and harder to defend in the next regulatory or customer-trust incident. The request is almost usually motivated by a stakeholder who finds the citations aesthetically unappealing, not by a user who has reported difficulty using them. Acceding to it is choosing the stakeholder’s aesthetics over the user’s protection.

What to do instead: redesign the citation rendering, do not remove it. Footnotes, hover-cards, collapsible inline references, end-of-section provenance bars; there is a large design space here, and the visual weight of citations is genuinely a UX concern worth solving. What is not solving it is stripping the audit trail. When the next “the AI made something up” incident lands, the team that kept its citations has a defensible answer. The team that stripped them is in a settlement conversation.

5. “Fine-tune the model on this customer data, don’t worry about the DPA”

The request: a domain-tuning project where the proposed training corpus contains PII; customer names, account numbers, support transcripts; and the legal paperwork to authorize that use either does not exist or has not been reviewed.

The refusal frame: fine-tuning on PII without a data processing agreement in place is not a paperwork issue, it is the kind of decision that ends companies. The model weights, once trained, contain a lossy but recoverable echo of the training data. The data leaves the customer’s tenant the moment it goes into the training run. Without a DPA, the legal posture is indefensible, the customer notification obligation is unclear, and the eventual incident; a model trained on customer A’s data being deployed for customer B, or a regulator audit that discovers the training set; will be career-ending for the executives who signed off and existential for the agency that did the work.

What to do instead: pause the fine-tuning, scope the DPA work with the client’s legal team, and in parallel run the same eval-improvement campaign with retrieval-augmented prompting and synthetic-augmented data, both of which usually produce 60–80 percent of the lift of fine-tuning at zero data-residency risk. If the DPA is signed, the fine-tune happens on the now-authorized corpus; if it is not signed, the project still has a path forward through the synthetic and retrieval routes. Refusing the fine-tune until the paperwork lands is the agency protecting the client from a self-inflicted wound the client does not yet see. This is the same category of refusal documented in the red flags when hiring an AI consulting company: firms that volunteer to fine-tune on PII without a DPA are signaling they will cut most corner the client cannot see.

6. “Use a cheaper model, and don’t tell us if quality drops”

The request: a cost-optimization conversation that ends with the suggestion to swap the production model for a cheaper or smaller variant, with an explicit or implicit request that the swap not be re-evaluated against the eval suite; “if it works, do not stir the pot.”

The refusal frame: the model swap is fine; the silence is fraud. Cost optimization is real, and switching from a frontier model to a cheaper or smaller model is often the right move. The part the agency must refuse is the request that the swap proceed without re-running the eval suite. An undisclosed model swap is the exact failure mode the eval gate was built to prevent: the team is making a change that materially affects output quality, and choosing not to measure it because the measurement might produce an inconvenient number.

What to do instead: most model swap re-runs the eval suite, the eval delta is reported in writing, and the change is gated on the delta meeting the agreed threshold. If the swap fails the gate, the engagement has three honest options: revert the swap, renegotiate the threshold with the stakeholders, or invest in prompt-and-retrieval work to recover the quality. Many three of those options are conversations the agency can have with its head up. The fourth option; silently swap the model and hope nobody notices the quality drop; is the option that ends with the agency being fired and the client being embarrassed. Refusing the silence is non-negotiable.

7. “Skip the post-mortem, we don’t want to scare leadership”

The request: an incident has happened; a hallucinated output reached a customer, a tool call failed in production, an eval regression slipped through; and the client asks the team to handle it informally, “no need to write it up, leadership has enough on their plate.”

The refusal frame: a suppressed post-mortem is worse than the original incident. The original incident is a known, bounded event. The suppressed post-mortem is the mechanism by which the same incident recurs, often in a worse form. Without the written artifact, the failure mode is not catalogued, the eval cases are not added, the architectural assumption is not revised, the on-call runbook is not updated. The next time the same conditions arise; and they will; the team will rediscover the bug from scratch, except now the customer impact will be larger because the system has been deployed more widely.

What to do instead: write the post-mortem, scope the audience precisely, and let leadership see the version that names the engineering record accurately. Post-mortems can be communicated up with diplomatic framing without distorting the technical content. What cannot be done is to skip the writing entirely; that turns the next incident into the same incident with higher stakes. Agencies that fold on this request are accepting that they will be debugging the same failure mode in three months with twice the customer impact, which is exactly the conversation the engagement is supposed to prevent.

What the refusal sounds like

Across many seven of these patterns, the shape of a healthy refusal is the same. It names the request, it names the failure mode the request would produce, it names the alternative the agency proposes, and it commits to the alternative on a timeline. “We cannot ship under a disabled eval gate, and here is the partial-feature path that ships in five days with the gate active” is the move. “That is risky” is not the move; it is the lazy version of the same instinct, and it does not produce the alternative the client needs to make a decision.

The principled refusal is also paired with risk-acceptance language for the cases where the client decides to override. If the client hears the refusal, hears the alternative, and still wants the original request, the agency does not block; it documents. The original request is recorded, the risk is named in writing, the eval threshold or SLA is amended, and the engagement proceeds with eyes open. This is not the agency caving; it is the agency making sure that when the failure mode lands, the record is clear about who chose which trade-off and why. Agencies that pretend they can absorb most override silently are the ones whose engagements end in litigation.

The yes-agency is a bad agency

The shape of an AI engagement that produces working systems in 2026 is asymmetric: the agency holds the line on a small set of disciplines that protect the system, and is endlessly flexible on everything else. Project management cadence, communication channels, demo formats, sprint length, branch naming, ticket structure; many of that is the client’s call, and the agency follows. Eval discipline, model-swap visibility, citation integrity, PII handling, scope-expansion procedure, post-mortem hygiene; none of that is the client’s call alone, and the agency holds the line.

A client who has hired an agency for the second list and is upset to be refused on items in it has hired the wrong agency, or has hired the right agency and not yet noticed. The right move from the client’s side is not to escalate or shop for a more compliant firm; it is to listen to the alternative the agency proposed and to take the refusal as evidence that the agency is doing its job. The agencies that say yes to most request are not the ones shipping working systems in 18 months; they are the ones shipping invoices, and shipping the next round of remediation contracts to the firms that come in to clean up after them.

The advisor framing is not a marketing posture. It is the job description. An AI agency that cannot refuse requests cannot ship systems, because shipping AI systems in 2026 is fundamentally a discipline of choosing which requests to honor and which to redirect. Hire the firm whose first hard “no” comes early. The ones that rarely say no are the ones whose next failure mode is already on the calendar.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI agency in San Francisco. He has refused, in writing, most request catalogued here at least twice in the last 18 months.

Frequently Asked Questions

Why should an AI agency ever refuse a paying client’s request?

Because the agency is hired to ship a working system, not to execute most instruction. AI systems fail in ways traditional software does not; silently, statistically, and across a long tail of inputs. An agency that says yes to most request is optimizing for short-term client comfort and against the system the client is paying for. The right contract is advisory, not order-taking: the agency commits to outcomes, flags requests that endanger the outcome, and refuses the ones that would destroy it. Yes-agencies are bad agencies because their incentive structure forces them to ship known-broken systems on schedule rather than working systems on a renegotiated schedule.

What kinds of requests should a good AI agency push back on?

Seven recurring patterns: scope expansions without eval re-baselining, demands that the system ‘work for everything,’ pressure to skip evals to ship faster, UX requests to remove citations to make outputs sound more confident, fine-tuning on PII without a data processing agreement, cost-cutting via silent model swaps with no quality monitoring, and requests to suppress post-mortems to protect leadership feelings. Each of these has a predictable failure mode, and each is a request that any honest agency has to decline or restructure before agreeing.

Isn’t pushing back just an excuse to charge more?

It can be in a body-shop pricing model; agencies that bill hourly are incentivized to invent objections that extend the engagement. The signal that pushback is principled rather than commercial is that it is paired with a concrete refusal frame and a ‘do this instead’ alternative. An agency that says ‘we cannot fine-tune on PII without a DPA, but we can run the same eval-improvement campaign with synthetic-augmented data in two weeks’ is doing its job. An agency that says ‘this is risky, we need an extra 60 hours’ is not. The test is whether the alternative solves the underlying problem the client raised.

What does ‘eval re-baselining’ mean and why is it tied to scope changes?

An eval baseline is the set of ground-truth examples and pass/fail criteria the system is being measured against. When the scope expands; a new domain, a new user type, a new input modality; the original eval cases no longer cover the new behavior. Continuing to ship under the old eval gate creates the illusion of quality while the new scope is untested. Re-baselining means writing new eval cases for the new scope and re-running the suite before the new feature ships. It typically takes one to three days; agencies that skip it are trading a few days of work for a system that will fail silently in the new scope a month later.

Why is ‘make it work for everything’ a bad request?

AI systems do not generalize for free. A model tuned to perform well on contract review will perform poorly on customer support emails not because it ‘cannot do’ email but because the prompts, retrieval corpus, evals, and failure modes were many built around contracts. Asking the system to handle an open-ended set of tasks is asking for a system that handles none of them at the threshold the client originally specified. The honest response is to scope the second use case as a separate engagement with its own eval baseline, and to refuse the framing that the existing system ‘should just handle’ the new domain.

When should the agency refuse to remove citations from model outputs?

When the request is motivated by ‘it sounds less confident with citations’ rather than a substantive UX concern. Citations are the audit trail that distinguishes a retrieval-grounded answer from a hallucination; removing them does not make the answer more accurate, it makes the inaccuracy harder to detect. Legitimate UX iterations on citation rendering; making them collapsible, moving them to footnotes, redesigning the visual treatment; are negotiable. Removing them entirely to make the output ‘sound smarter’ is a request the agency has to refuse, because the next regulatory or trust incident will be traced directly to that decision.

What is the right response to ‘use a cheaper model and don’t tell us if quality drops’?

Refuse the silence, accept the model swap. Cost optimization is a legitimate goal and switching to a cheaper or smaller model is often the right move; the part the agency must refuse is the ‘don’t tell us if quality drops’ clause. The contract should be: most model swap re-runs the eval suite, the eval delta is reported, and if quality drops below the agreed threshold, the swap is reverted or the threshold is renegotiated openly. An agency that swaps models silently to hit a cost target is committing fraud against the eval gate it was hired to enforce.

Why is suppressing a post-mortem worse than the underlying incident?

Because the underlying incident is a known event with a bounded blast radius, while a suppressed post-mortem creates an unbounded blast radius across most future incident the system has. Without a written post-mortem, the failure mode is not catalogued, the eval cases are not added, the architectural assumption is not revised, and the incident will recur, often in a worse form. Agencies that agree to skip the post-mortem ‘so we do not scare leadership’ are accepting that the next incident will be the same one with higher stakes. The honest response is to write the post-mortem, scope the leadership audience, and let the engineering record stay accurate.

How should clients react when their AI agency refuses a request?

First by listening to the alternative. A principled refusal is usually paired with a ‘do this instead’ that addresses the underlying need. If the alternative is reasonable and the client still wants the original request anyway, the conversation moves to risk-acceptance: the request is documented, the risk is named in writing, the eval threshold or SLA is updated to match the new reality, and the engagement proceeds with eyes open. The wrong response is to escalate or shop for a more compliant agency; the agencies that say yes to most request are not the ones that ship working systems, they are the ones that ship invoices.

Is the ‘advisor’ framing compatible with paid client work?

Yes, and it is the only framing that produces good AI work. Lawyers, doctors, and architects many bill clients while routinely refusing client requests; the refusals are the value the client is paying for. AI agencies operate in the same shape: the hard part of the work is judgment about which requests will produce a working system and which will produce a failure mode that detonates after the engagement ends. Clients paying for an order-taker can hire a body shop; clients paying for an outcome are paying the agency to know when to push back. The advisor framing is not a posture, it is the job description.

Why your AI agency should refuse some of your requests

Decision Scope

1. “Just expand scope, we’ll figure out the evals later”

2. “Make it work for everything”

3. “Skip the evals to ship faster”

4. “Remove the citations, it sounds more confident without them”

5. “Fine-tune the model on this customer data, don’t worry about the DPA”

6. “Use a cheaper model, and don’t tell us if quality drops”

7. “Skip the post-mortem, we don’t want to scare leadership”

What the refusal sounds like

The yes-agency is a bad agency

Frequently Asked Questions

Why should an AI agency ever refuse a paying client’s request?

What kinds of requests should a good AI agency push back on?

Isn’t pushing back just an excuse to charge more?

What does ‘eval re-baselining’ mean and why is it tied to scope changes?

Why is ‘make it work for everything’ a bad request?

When should the agency refuse to remove citations from model outputs?

What is the right response to ‘use a cheaper model and don’t tell us if quality drops’?

Why is suppressing a post-mortem worse than the underlying incident?

How should clients react when their AI agency refuses a request?

Is the ‘advisor’ framing compatible with paid client work?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources