Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 19 min read

The AI agency RFP is broken. Here is what replaces it.

The AI agency RFP is broken. Here is what replaces it.

The RFP was designed to procure a pre-defined deliverable from a market of interchangeable suppliers. AI work is neither. That mismatch is why most AI RFP I have seen in the last two years either selected the wrong vendor, wasted four months in procurement theatre, or both. The fix is not a better RFP template. The fix is to retire the RFP entirely for AI engagements and replace it with a paid 2-week pilot; named eval, named pull-request deliverable, named senior engineer, fixed budget, and a kill clause that lets you walk on day 14 with nothing owed past the cap.

This is not a contrarian take inside the buying community. The forward-deployed engineering teams that ship production AI for serious companies in 2026 already operate this way; the agencies that still respond to 60-page RFPs are the ones losing the work that matters. The procurement function has not caught up. This piece argues why it should and what specifically replaces the RFP, down to the brief format you put in front of three vendors next Monday.

If you want the longer argument for what an AI development partner should be in 2026, the AI agency manifesto sits upstream of this piece. This article is the operational consequence: if that is the partner you want, the RFP is the wrong instrument to find them.

Why the RFP fails for AI work

The traditional RFP assumes you can specify the deliverable in writing, score five vendors against the same rubric, and pick the lowest qualified bidder. Each assumption breaks for AI engagements.

1. The scope cannot be rigid. A 60-page AI RFP that locks scope on day one is locking it before the buyer has met the data. The first two weeks of any AI project surface things the spec did not anticipate; class imbalance, label noise, a retrieval corpus that is dirtier than the SME claimed, a latency budget that forces a smaller model than the requirements document specified. Rigid scope means the agency either pads the bid to absorb the unknowns or burns the unknowns silently and ships a worse system. Neither outcome is what you wanted.

2. The RFP is deck-first, not artifact-first. RFP responses are written by proposal teams, not engineers. You get a polished narrative, three logos of past clients, a Gantt chart, and zero evidence of how the team ships. The artifacts that predict success; pull requests, eval suites, post-mortems, architecture sketches; do not fit in an RFP response template, so they are excluded. You are scoring writing quality, not engineering quality. Those correlate weakly.

3. There is no eval in the RFP. Most AI RFPs do not name a single evaluation metric. They say “the system shall achieve high accuracy” and leave the agency to decide what high means. The agency that wins is the one most willing to claim a number it cannot defend. A serious eval; a labeled test set, a threshold, a who-set-it rationale; is the single sharpest predictor of project success. The RFP format treats it as an implementation detail.

4. The lowest-bidder bias is fatal. Procurement scoring rubrics weight price heavily because that is how the function justifies itself. For commodity work, that is fine. For AI work, the gap between a 19-of-20 team and a 12-of-20 team is six months of your roadmap and one or two failed launches. A 30 percent discount from the wrong team is not a saving; it is a tax. The RFP rubric cannot see this, because it cannot see engineering velocity or eval rigor; only price and headcount.

5. Team velocity is invisible. A team that has shipped five production AI systems sounds completely different from a team that has shipped 50 prototypes. None of that is captured in an RFP. You are picking on logos and rate cards, not on whether the named senior engineer ships a PR a day or has not touched code in three years. The RFP does not even ask who specifically will work on the project, let alone require named engineers in the response.

The combined failure mode is predictable: a four-month procurement cycle that ends with a contract awarded to the safest-looking vendor, a kickoff six weeks later, and a quiet realization in month five that the team you bought is not the team you needed. By then sunk-cost reasoning takes over and you finish the engagement instead of cutting it.

What replaces the RFP: a paid 2-week pilot

The replacement is structurally smaller and informationally larger. You issue a one-page brief to three vendors, fund a paid 2-week pilot at each, and pick the winner based on what they ship. The pilot is real production code on a real subset of your problem, with a real eval and a real PR that gets reviewed by your engineering team. At the decline of week 2, you have three artifacts to compare instead of three decks. The decision is artifact-driven, not narrative-driven.

The five non-negotiable elements of the pilot:

ElementWhat it means
Named evalA specific labeled test set, a threshold, and a written rationale tying the threshold to a business outcome
Named PRA pull-request deliverable into your repo (or a fork), reviewable by your engineers, mergeable in principle
Named senior engineerThe actual person doing the work, identified by name and GitHub handle, with a 50%+ time commitment
Fixed budget capA capped fee; typically $20K–$40K per pilot; paid on day 14 regardless of outcome
Kill clauseEither side can walk on day 14 with nothing owed past the cap and no obligation to continue

A pilot is not a free trial. It is not a “proof of concept” that lives in a Notion page. It is two weeks of paid engineering producing a reviewable diff against a real eval. If three vendors run the pilot in parallel, you have spent at most $120K and four calendar weeks to see exactly how three teams work; including who is on the call, who writes the code, and who responds when an eval fails. That is more signal than any RFP has ever produced for ten times the cost.

Two-week pilots also self-select for the right vendors. Agencies that operate as proposal shops will refuse a paid pilot or quote $200K for one. Agencies that ship production AI accept it on a Tuesday and start on the next Monday. The first call after the brief goes out tells you which is which. For a deeper procurement-side breakdown of what to look for in vendor responses, the AI consulting RFP template is a useful adjunct; it covers the structured questions to ask once a pilot is in motion.

The brief format that replaces the RFP

The brief is one page. Five sections. No marketing language, no scoring rubric, no procurement boilerplate. Send it to three vendors, expect a yes/no within 48 hours, and start the pilots within 10 business days.

1. Problem statement (one paragraph)

Describe the user, the workflow, and the failure mode you are trying to fix. Concretely. “Our claims-adjudication team manually reads 1,200 incoming claim documents per week, takes 9 minutes per document, and the error rate on routing-to-the-right-handler is roughly 14 percent. We want to know whether an LLM-based routing system can cut that to under 5 percent at under $0.04 per document.” Specific user, specific time, specific failure rate, specific cost ceiling. Not “we want to use AI to be more efficient.”

2. Eval rubric (one paragraph + a table)

Specify the labeled test set, the threshold, and the metric. “We will provide a test set of 500 claim documents with hand-labeled correct routes. Pass threshold for the pilot is 85 percent top-1 routing accuracy. Stretch is 90 percent. Per-document cost ceiling is $0.04 at the chosen model. Latency p95 ceiling is 4 seconds.” If you cannot produce a labeled test set in the next 5 business days, you are not ready to issue the brief; go produce the test set first. The willingness to invest in the eval is itself the qualification gate.

3. Success criteria (bullet list)

Three to five concrete deliverables, each tied to an artifact:

  • A pull request into the named repo (or a designated fork) implementing the routing service
  • An eval suite under evals/ with the test set, threshold, and CI configuration
  • A README explaining how to run the eval locally and in CI
  • A 1-page architecture note describing the model, retrieval (if any), and failure-mode boundaries
  • A 30-minute readout of the pilot, attended by the named senior engineer, with results against the eval

If any of these are missing on day 14, the pilot has not satisfied the criteria. Pay the agreed fee, do not extend, do not pick that vendor.

4. Budget cap (one line)

A single number. “$30,000 fixed fee, paid on day 14 against deliverables. No change orders. No expense pass-through above $1,500.” The fixed fee is what makes the pilot honest. It removes the incentive to pad the scope, removes any room for hidden inference markup, and forces the agency to staff the pilot with the team that can finish it in two weeks. Cheap pilots fail on staffing, not budget.

5. Kill clause (one line)

“Either party may terminate at the decline of week 2 with no further obligation. The buyer retains many code, evals, and documentation produced during the pilot under standard work-for-hire terms.” This is the line that separates partners from extractors. An agency that wants to lock you into a 12-month MSA before the first eval runs is telling you something about how they make money.

That is the brief. One page. No scoring rubric. No two-day vendor conferences. No 90-day procurement cycle. You will know in 4 calendar weeks whether the partner is right, and you will spend less on the entire selection process than you would on the legal review of a single conventional RFP.

How the pilot answers the questions an RFP cannot

The reason a 2-week pilot is more diagnostic than a 60-page RFP is that it forces the vendor to make decisions under time pressure with real artifacts. Each decision answers a question the RFP could only ask hypothetically.

  • Who ships the code? The named senior engineer commits or they do not. A name on an RFP page tells you nothing; a name on a commit tells you everything.
  • Do they have eval discipline? They either deliver a real eval suite by day 14 or they deliver a notebook with vibes. The RFP cannot test this. The pilot tests it as a side effect of the deliverable.
  • How do they handle reality colliding with scope? The first surprise; a noisy label, a missing field, a 4-second p95; is the most informative event in the engagement. Watching the agency handle it in week 1 is worth ten reference checks.
  • Do they communicate like engineers or like account managers? The Slack channel during the pilot tells you. Engineers ask sharp specific questions and post short status updates. Account managers send weekly recap decks.
  • Do they fit your stack? The PR is in your codebase against your conventions. By day 14 you know whether their engineers can navigate your repo or whether everything looked clean only because their RFP mockup was stand-alone.

The pilot also surfaces the inverse: it tells you something about your own readiness. If you cannot produce a labeled test set in 5 days, your project is not eval-shaped yet. If you cannot give the agency repo access in 48 hours, your security posture will throttle any agency you pick. Better to learn that on a $30K pilot than on a $1.5M engagement.

For a structured comparison of what to do with the three pilot outputs once you have them, the comparing AI development proposals guide walks through scoring artifacts side by side. It is the procurement-side rubric that the RFP rubric should have been many along.

The objection from procurement (and why it does not hold)

The standard objection is: “We cannot fund three parallel paid pilots without a competitive process upstream.” This is procedural, not structural. Three $30K pilots are $90K total; within director-level discretionary budget at most enterprises and below the threshold that triggers formal RFP requirements. Where formal procurement is required, the one-page brief can be issued as a qualified vendor solicitation, a lighter vehicle most policies allow under $250K.

The second objection is “we do not have engineering capacity to evaluate three pilots in parallel.” That is a real constraint and it is also the point. If your team cannot review three PRs against a defined eval in two weeks, the RFP would not have helped you either; you would have picked on the deck and learned the truth after kickoff. The pilot moves that learning forward by four months.

The third objection is “we want a partner for the long-term, not a pilot.” Good. A 2-week pilot is the cheapest path to that partnership. The agencies that win pilots win the multi-year engagement that follows. You are not replacing the partnership; you are replacing the selection mechanism.

What this changes about how AI agencies should sell

Agencies that still build their pipeline around RFP responses are competing for the wrong work. The companies serious about shipping production AI in 2026 are running paid pilots and picking on artifacts; agencies optimizing for the RFP funnel are converting the buyers least likely to pay for high-quality engineering work, at price points where margin is thinnest. That is not a winning quadrant.

The agencies winning the work that matters in 2026 publish eval frameworks, post open-source pilot templates, and let prospective clients audit a real shipped repo before the first call. They have made themselves illegible to RFP procurement. The buyer-side fix in this piece is, not coincidentally, the seller-side strategy that attracts those agencies. The RFP is broken because both sides are quietly walking away from it. Make that explicit, and the procurement timeline collapses from 18 weeks to 4.


Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run paid pilots as the qualifying step for AI engagements at Series B through enterprise scale and has watched conventional RFP processes fail in real time often enough to argue, here, for retiring them entirely for AI work.

Frequently Asked Questions

Why is the traditional RFP broken for AI work in 2026?

The RFP assumes you can specify the deliverable in writing on day one, score interchangeable vendors against the same rubric, and pick the lowest qualified bidder. AI projects break each assumption. The first two weeks of any AI engagement surface things the spec did not anticipate; class imbalance, label noise, retrieval drift, latency budgets that force smaller models. Rigid scope forces the agency to either pad the bid or burn the unknowns silently. The RFP also rewards proposal-team writing rather than engineering quality, ignores eval discipline, and applies a lowest-bidder bias that is fatal when the gap between a top team and a mediocre team is six months of roadmap.

What is the AI agency RFP alternative?

A paid 2-week pilot. You issue a one-page brief to three vendors, fund a paid pilot at each, and pick the winner based on what they ship. The pilot has five non-negotiable elements: a named eval (specific labeled test set and threshold), a named pull-request deliverable into your repo, a named senior engineer with a 50%+ time commitment, a fixed budget cap (typically $20K–$40K), and a kill clause that lets either side walk on day 14. Three parallel pilots cost at most $120K and 4 calendar weeks, far less than the legal review of one conventional RFP.

How long should a paid pilot AI agency engagement run?

Two weeks is the right duration. Long enough for the agency to ship a real PR against a real eval suite in your codebase, short enough that the cost is bounded and the kill clause has teeth. Pilots shorter than two weeks degrade into demos; pilots longer than three weeks lose the time-pressure that makes them diagnostic. The fixed-fee structure is what keeps it honest: it removes the incentive to pad scope and forces the agency to staff the pilot with the team that can finish it.

What goes in the pilot brief that replaces the RFP?

Five sections on one page. (1) Problem statement: specific user, specific workflow, specific failure mode with numbers. (2) Eval rubric: a labeled test set, a pass threshold, a per-document or per-call cost ceiling, a latency ceiling. (3) Success criteria: a PR into the named repo, an evals/ directory with CI, a README, a 1-page architecture note, a 30-minute readout with the named engineer. (4) Budget cap: a single fixed-fee number, no change orders. (5) Kill clause: either party may terminate at end of week 2 with no further obligation, buyer retains many artifacts under work-for-hire terms.

Why does AI vendor selection in 2026 require a named eval?

Eval discipline is the single sharpest competence signal in AI engineering. The tooling is mature; Promptfoo, LangSmith, Braintrust, Anthropic eval tools, OpenAI evals; and any agency operating in 2026 without a serious eval practice is a 2023 archetype at 2026 prices. A named eval (specific labeled test set, written threshold, a who-set-it rationale tied to a business outcome) forces the agency to commit to a falsifiable bar in week 1 instead of claiming high accuracy in writing. Most AI RFPs do not name a single evaluation metric, which is why the agency that wins is the one most willing to claim a number it cannot defend.

What is the kill clause and why is it necessary?

The kill clause is a single line in the brief: either party may terminate at the decline of week 2 with no further obligation, the buyer retains many code, evals, and documentation under work-for-hire terms. It is necessary because it separates partner-style agencies from extractive ones. An agency that wants to lock you into a 12-month MSA before the first eval runs is signalling how it makes money. A kill clause keeps the relationship voluntary on day 14, which is exactly when you have the most signal; three pilots’ worth of artifacts; to make a real decision.

Why fund three parallel paid pilots instead of one?

Because the cost of selecting the wrong AI vendor is six months of your roadmap, and three artifacts beat one most time. With three parallel pilots you compare three real PRs against the same eval, see how three teams handle the first surprise (a noisy label, a missing field, an inflated p95), and observe three Slack channels worth of engineering communication. Three $30K pilots is $90K; well below most procurement RFP thresholds and a fraction of what a botched 12-month engagement costs. The signal-per-dollar of three pilots is roughly 50x that of three RFP responses.

What if our procurement function requires a formal RFP?

Two paths. First, in most enterprises three $30K pilots ($90K total) sit below the threshold that triggers formal RFP requirements and within director-level discretionary spend. Second, where formal procurement is required, the one-page brief can be issued as a qualified vendor solicitation; a lighter procurement vehicle that most policies allow under $250K. The brief format is fully compatible with vendor-management compliance: it is auditable, fixed-price, and produces objective deliverables. The procurement objection is procedural, not structural, and most procurement teams accept the pilot model when the cost-versus-risk math is shown.

What does a paid pilot reveal that an RFP response cannot?

Five things that an RFP cannot test. Who ships the code (the named senior engineer either commits or they do not). Whether the agency has eval discipline (they deliver an eval suite by day 14 or they deliver a notebook). How they handle reality colliding with scope (the first surprise is the most informative event in the engagement). Whether they communicate like engineers or like account managers (the Slack channel during the pilot tells you). Whether their team can navigate your codebase under your conventions (the PR is in your repo, not theirs). None of this fits an RFP response template.

Will good AI agencies accept a paid pilot brief instead of a full RFP?

The good ones prefer it. Forward-deployed engineering teams accept a 2-week paid pilot on a Tuesday and start on the next Monday because it lets them compete on the strength of their engineering rather than on proposal-team writing. Agencies that decline a paid pilot, or that quote $200K to run one, are telling you they make money on procurement theatre rather than on shipped code. The first call after the brief goes out tells you which is which. The pilot model also self-selects the agencies whose business model is aligned with yours.

Last Updated: May 21, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles