Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 21 min read

The AI Agency Manifesto: What an AI Dev Partner Should Actually Be in 2026

The AI Agency Manifesto: What an AI Dev Partner Should Actually Be in 2026

An AI development agency in 2026 is a forward-deployed engineering team that ships production code, owns the eval suite, and bills in transparent line items — not a strategy practice that ends in a slide deck. The 2023 archetype — a roomful of “prompt engineers” producing Streamlit demos and PDF playbooks — is now a $20-per-month ChatGPT subscription with extra steps. Anything between those two poles is a transitional artifact, and most of it will not survive the next eighteen months.

That is the entire thesis. The rest of this manifesto is what an AI agency owes you, written down, numbered, and signed.

Why This Manifesto Exists

Buyers procuring AI development work in 2026 face a market with three populations of vendors mixed under one label:

  1. Strategy shops still selling 2023-era engagements: vendor matrices, capability maps, six-month “AI roadmaps” that produce no shipped software.
  2. Prompt-engineering boutiques that mistook a temporary skill arbitrage for a durable business and now compete with Cursor, Claude Code, and Codex CLI as if those tools did not exist.
  3. Forward-deployed engineering organizations that treat AI work as software work — disciplined evals, model-agnostic architectures, on-call rotations, transparent inference billing.

The first two groups are pricing themselves at six figures per month for outputs that an internal engineer with a Claude Code Max subscription can replicate over a long weekend. According to Stack Overflow’s 2025 Developer Survey, 84% of professional developers reported using AI tools in their daily workflow, with the majority using paid tiers like Cursor Pro and GitHub Copilot Business. Anthropic and OpenAI have both reported double-digit-billion run rates driven heavily by individual developer subscriptions, not enterprise consulting passthroughs.

That is the macro signal. The market for “AI strategy” is collapsing into the price of the underlying tools. The market for forward-deployed engineering judgment is expanding, because models keep getting better while teams keep failing to operationalize them — BCG’s 2024 study Where’s the Value in AI? reported that only ~10% of enterprise AI value comes from algorithms; the remaining 90% comes from people, process, and integration work.

This manifesto names the eleven commitments a 2026 AI agency must make in writing — to itself, to its engineers, and to every buyer it signs with. Each commitment is deliberately specific enough to quote back to a vendor in a procurement call.

Table of Contents

The Eleven Commitments

1. Code Is The Deliverable

The deliverable is running code in your repository, behind your auth, on your infrastructure. Not a slide deck. Not a Streamlit demo on Hugging Face Spaces. Not a private Notion page. Code, in your version control, with commit history that survives the engagement.

The 2023 consulting model treated AI work as an analyst exercise: capability mapping, vendor selection, “use-case prioritization.” That work was billable because the underlying technology was unfamiliar enough that translation was the value. The technology is no longer unfamiliar. McKinsey’s State of AI in early 2025 found that 78% of organizations now use AI in at least one function — translation is no longer scarce. What is scarce is the engineering discipline to take a model output and turn it into a production system that does not embarrass you.

A 2026 AI agency commits, in the SOW, to one or more of: a deployed service, a merged PR, a published package, a running data pipeline. The artifact must be inspectable by a senior engineer at the buyer’s organization without the agency present. If the only artifact a buyer can show their CTO is a deck and a demo URL, the agency has shipped nothing.

This is not anti-strategy. Strategy work has its place — sequencing, organizational design, build-vs-buy framing. But it must be grounded in a shipped pilot, not in slideware. Anchor strategy in artifact, or do not bill for it.

2. Evals Are The Contract

A model is non-deterministic. A demo is one sample point on a non-deterministic system. The only durable contract between a buyer and an AI agency is an evaluation suite: a fixed set of inputs, expected behaviors, and quality thresholds, checked into version control, runnable on every commit.

Without evals, the buyer cannot detect regressions when the model is silently updated — and frontier models are updated constantly. In the eighteen months ending Q2 2026, OpenAI shipped GPT-5 then GPT-5.4, Anthropic shipped Claude Opus 4 through Opus 4.6, Google shipped Gemini 3.1 Pro, and the open-weights frontier moved from Llama 3 to Llama 4 Scout. Each transition silently changes the behavior of any system pinned to a “stable” model alias.

A 2026 AI agency delivers, with every shipped feature:

  • A named eval suite (Promptfoo, LangSmith, Ragas, Anthropic’s eval tooling, or a custom harness — tool choice is negotiable; presence is not).
  • A documented threshold the buyer agreed to before launch (“≥85% pass on the regulatory-question set, ≥0.92 cosine similarity on retrieval, p95 latency under 4s”).
  • A CI integration that fails the build when thresholds are missed.

Buyers should ask, on every vetting call, to see an eval suite from a recent project and read a postmortem from a regression caught by it. The presence of postmortems is the strongest single positive signal in agency vetting. Their absence is the only signal a buyer needs to keep looking. We expand on the practice in AI model evaluation testing services.

3. Forward-Deployed By Default

Forward-deployed engineering means the agency’s engineers operate inside the buyer’s environment — codebase, communications, ticketing, on-call rotation — rather than in a vendor sandbox handing artifacts over a wall.

The pattern was popularized by Palantir’s forward-deployed engineers in the 2010s and has become the dominant delivery model for serious AI work for one reason: AI features fail in production for client-specific reasons. A particular data shape. A particular usage pattern. A particular regulatory constraint that nobody flagged in scoping. Those failures are invisible from outside the buyer’s environment.

A 2026 AI agency commits to:

  • Engineers in the buyer’s GitHub organization, not a separate “vendor repo.”
  • Engineers in the buyer’s Slack, not a vendor liaison channel.
  • Engineers on the buyer’s incident rotation for systems they ship, with documented escalation paths.
  • A single point of accountability — usually a tech lead — who is reachable directly, not via account management.

The opposite pattern — vendor sandboxes, weekly status meetings, deliverables-over-the-wall — is a 2018-era staff augmentation model. It survives in 2026 only because procurement habits are sticky. The cost of that stickiness shows up in coordination overhead, which routinely consumes a meaningful slice of any traditional agency engagement budget. The cleaner build-vs-outsource framing is laid out in AI agency vs in-house team decision.

4. Model-Agnostic Architecture

Hard-coding to a single model vendor is the most predictable architectural mistake in AI projects. Models are deprecated on six-to-eighteen-month cycles. Pricing curves are unstable — OpenAI cut GPT-4-class pricing by ~80% over 2023–2024 while Anthropic did similar with prompt caching in 2024. A system architecturally married to one provider is a system designed to be rewritten.

A 2026 AI agency commits to a thin model-router abstraction — usually 100 to 300 lines of code — that lets the buyer:

  • Route cheap requests (classification, simple extraction) to small fast models.
  • Route expensive requests (reasoning, planning, generation) to frontier models.
  • Swap providers when pricing or capability shifts.
  • Run the same eval suite across providers to validate equivalence before migration.

Tools like LiteLLM, Portkey, and Vercel’s AI SDK have made this practice cheap and standard. There is no defensible reason in 2026 to ship a system that imports openai or anthropic directly from application code. If an agency proposes one, ask why — and accept “we move faster initially” only if there is a documented migration plan and a budgeted sprint to add the abstraction before the system carries production load.

5. No Token Arbitrage

A token-arbitrage agency is one that holds the model API keys, bills the buyer a flat monthly fee that includes inference, and quietly captures the spread between the buyer’s monthly check and the agency’s actual model spend. It is a cost-plus consulting trick dressed up as a managed service.

The honest model is direct billing: the buyer’s Anthropic, OpenAI, or Google account, the buyer’s bill, the agency’s job is to make that bill predictable, observable, and small. If an agency resists this structure, the structure is the answer.

A 2026 AI agency commits to:

  • Direct provider billing in the buyer’s accounts.
  • Per-feature inference cost dashboards, refreshed at least weekly.
  • A documented unit-economics model: dollars-per-request, dollars-per-active-user, dollars-per-document-processed.
  • Quarterly cost-reduction reviews against named target ratios.

Token arbitrage is invisible to the buyer until they hire a second agency or a senior engineer who looks at the actual provider bills and the actual usage patterns. By that point, six to twelve months of margin has been quietly capitalized into the first agency’s books. Read the cost discipline in detail in AI development agency cost in 2026.

6. Software Engineering Judgment First

“AI experience” is the wrong primary filter for a 2026 vendor. Cursor, Claude Code, and similar tools have collapsed the gap between a senior engineer who has been writing AI code for three years and a senior engineer who started six months ago. The remaining gap is software engineering judgment: system design, failure-mode analysis, observability, security, on-call discipline. Those skills compound over a ten-to-twenty-year career. A six-month “AI specialization” does not.

The practical implication: prefer agencies whose senior engineers have shipped non-AI production systems at meaningful scale. Ask about distributed-systems incidents, database migrations under load, security postmortems, dependency-supply-chain decisions. Engineers who can answer those questions conversationally for several recent examples will figure out the AI parts. Engineers who can only talk about prompts and embeddings will be exposed by the first production incident.

This commitment also reframes hiring. A 2026 AI agency is, structurally, a senior software engineering shop that happens to specialize in agentic systems — not a “GenAI talent network.” Compare the consulting-vs-agency framing in AI consulting vs development agency.

7. Demos Are Not Evidence

The 2023 consulting selling motion was: pitch deck, capability demo, statement of work, six months of meetings, a final readout. Buyers learned to read decks for sophistication. They did not learn to read demos for production-readiness, because most 2023 demos were never going to be production.

A demo proves three things:

  • Someone can wire an API to a UI.
  • The chosen example does not embarrass the demoer.
  • The model, on this prompt, on this day, with this latency tolerance, gave an acceptable answer.

A demo does not prove: error handling, retry semantics, idempotency, rate-limit recovery, content-filter handling, latency under load, cost under load, behavior on adversarial inputs, behavior on malformed inputs, behavior on the input distribution the buyer actually has.

A 2026 AI agency commits to evidence over demos:

  • Postmortems from real production incidents, redacted of client specifics but readable.
  • Eval-suite snapshots showing pass rates over time, including regressions and recoveries.
  • Load-test reports against realistic input distributions.
  • A reference call with a buyer whose system has been in production for at least six months.

If an agency cannot provide all four, they have not yet shipped what they say they ship. That is acceptable for a junior agency at junior pricing — it is not acceptable at senior pricing. We catalogue the harder vetting questions in evaluate AI developer portfolios and check AI developer references.

8. Maintenance Economics, Disclosed

AI systems decay. Prompts drift as models update. Retrieval indexes go stale. Agent loops that worked at launch hit edge cases at scale. Someone has to pay for that work, and the engagement contract should name them in advance.

A 2026 AI agency commits, in the SOW, to:

  • A named maintenance cadence (e.g., monthly model-version regression sweep, quarterly retrieval-quality audit, semi-annual prompt-registry review).
  • A named owner — buyer engineering, agency, or a hybrid retainer — for each maintenance dimension.
  • A model-deprecation clause: who pays to migrate when the underlying model is retired or repriced, and on what timeline.
  • A handoff inventory: prompts, evals, fine-tunes, RAG indexes, agent configurations, named with version and ownership.

The 2023 default was to bill the build, ship the system, and let the buyer discover six months later that “the AI feature isn’t working as well as it used to.” That is a structural failure of the engagement contract, not a surprise of the technology. We build the timeline economics out further in AI development agency ROI timeline.

9. The Spec Is A Living Document

A 2026 AI feature is rarely fully specifiable up front. The model’s capabilities define what is possible. The data defines what is reliable. The user behavior defines what is desired. None of those are knowable on day one of a four-month engagement.

The traditional fixed-bid SOW assumes a stable spec. AI work breaks that assumption. A 2026 AI agency commits to a living-spec discipline:

  • Initial SOW scoped to a Discovery phase (typically 2–4 weeks) with named exit criteria, not a final feature list.
  • Working agreements that allow scope to compress or expand based on what the evals reveal.
  • A weekly written change log of spec deltas, signed by both sides — not buried in Slack.
  • A clear stop-loss: if Discovery reveals the feature is not feasible at the buyer’s quality bar, both parties exit cleanly without sunk-cost capture.

The opposite — locking a fixed-feature SOW months before the team understands the data — is the single largest source of mid-engagement disputes. The right structure is described in AI consulting discovery phase.

10. Persistence Is The Moat

The 2025–2026 frontier shifted from single-shot LLM calls to persistent agentic loops — Anthropic’s Claude Code, OpenAI’s Codex CLI, Cursor’s Composer, and the broader class of long-running agents that hold state, retry, plan, and operate against tools over hours or days. The SWE-Bench Verified leaderboard moved from ~12% (early 2024) to consistently above 65% (late 2025) precisely because of this shift in agent design, with frontier systems now exceeding 80% on selected slices.

The practical implication for buyers: the durable competitive advantage in AI software is no longer “we have access to GPT-4.” Everyone has access to frontier models. The durable advantage is the persistence layer — the agent harness, the eval suite, the tool integrations, the failure-recovery logic, the institutional memory of what works on the buyer’s specific data shape. This is the architecture pattern we explore in agentic AI development tool use.

A 2026 AI agency commits to building the persistence layer with the buyer, not for the buyer. It belongs to the buyer when the engagement ends. The agent definitions, the eval threshold history, the prompt registry, the tool schemas — all in the buyer’s repository, all under the buyer’s license, all migratable to a different vendor without a forklift rewrite.

11. We Will Help You Replace Us

The cleanest test of an honest AI agency is whether they help the buyer hire the in-house team that will eventually replace them. The unhealthy alternative — vendor lock-in via tribal knowledge, undocumented prompts, opaque agent behavior — is the dominant failure mode of the 2018-era staff-augmentation industry, and AI agencies that import that failure mode will not survive the next contract cycle.

A 2026 AI agency commits to an explicit succession path:

  • Documentation written for an engineer joining the buyer’s team six months from now, not for the agency’s own hand-off file.
  • Pair programming hours with the buyer’s engineering team, billed at the same rate as solo work.
  • A named “first hire” recommendation: the role profile, the comp band, the interview loop, the evaluation rubric. Where useful, candidate referrals.
  • A documented exit ramp: a 30-60-90 day plan from “agency-led” to “buyer-led” for any system the agency ships.

This commitment changes the agency business model from “longest possible engagement” to “highest-leverage engagement.” The economics work because forward-deployed senior engineering judgment, applied at high leverage, is genuinely scarce — much more so than agency time. Agencies that build the succession path become the obvious choice for the next system the buyer needs, not because of lock-in, but because they have already proven they will not exploit one. The build-vs-hire trade-off is dissected in AI agency vs in-house team decision.

What Buyers Should Do With This

These eleven commitments are designed to be quotable. The recommended use is operational, not philosophical:

  1. Send this manifesto to every shortlisted vendor before a vetting call. Ask which commitments they currently make in writing, which they push back on, and why.
  2. Score each vendor on the eleven commitments. Treat any vendor scoring under 8/11 as a red flag. Most 2023-era vendors will score 3–5.
  3. Negotiate the missing commitments into the SOW. A vendor unwilling to commit on paper is unwilling to commit in practice.
  4. Re-score quarterly during the engagement. Drift from the manifesto correlates strongly with drift from the working software. The vetting framework expands in AI agency contract negotiation, and the reference-call discipline is laid out in check AI developer references.

The manifesto is not a finished document. It will be updated as the underlying technology shifts. The current version is anchored to the state of the field as of Q2 2026 — frontier models exceeding 65% on SWE-Bench Verified, agent harnesses now standard, inference costs collapsing, model lifecycles shortening. The commitments will tighten as the floor rises.

Frequently Asked Questions

What is the difference between a 2023 AI consulting firm and a 2026 AI development agency?

A 2023 AI consulting firm sold strategy decks, vendor matrices, and proof-of-concept demos at engineering rates. A 2026 AI development agency embeds engineers in the buyer’s codebase, ships production systems, owns the eval suites, and operates inside the buyer’s tooling and on-call rotation. Strategy work, where it remains useful, is grounded in shipped pilots rather than slideware. The economics inverted because solo founders with $200/month Cursor seats and Claude Code Max subscriptions can now build what 2023 agencies billed six figures to demo.

How do I tell whether an AI agency is just running token arbitrage on me?

Ask a single question: who holds the model API keys? If the agency holds them and bills a flat monthly fee that includes inference, the buyer is probably paying a markup on a usage-based cost that should be transparent. The honest model is direct billing — the buyer’s Anthropic, OpenAI, or Google account, the buyer’s bill, the agency’s job is to make that bill predictable and small. Token-arbitrage agencies will resist this structure because it removes their margin on a hidden line item.

Do I really need eval suites? Can’t I just review the demo?

A demo is a single sample point on a non-deterministic system. The buyer is purchasing a feature that will run thousands of times against inputs they cannot predict, with a model that will be silently updated every few months. An eval suite — a fixed set of inputs with expected behaviors and quality thresholds — is the only mechanism that detects regressions when the model changes or the prompt shifts. Promptfoo, LangSmith, and Anthropic’s evaluation tooling have made this practice cheap and standard. An agency that does not deliver one is delivering a 2023 artifact at a 2026 price.

How much should an AI development engagement actually cost in 2026?

It depends on scope, but a useful sanity check is that a forward-deployed senior AI engineer in 2026 runs roughly $25,000 to $45,000 per month all-in, depending on geography and seniority. A reasonable pilot that ships production code with an eval suite typically takes one to two engineers for 8 to 16 weeks. If the buyer is being quoted six-figure-per-month numbers for a strategy engagement that ends in a deck rather than a deployed system, they are paying 2023 prices for 2023 outputs.

Why is model-agnostic architecture so important?

Because the underlying models change every few months. In the eighteen months ending Q2 2026, OpenAI shipped GPT-5 and GPT-5.4, Anthropic shipped two major Claude generations and an Opus refresh, Google released Gemini 3.1 Pro, and the open-weights frontier moved from Llama 3 through Llama 4 Scout. A system hard-coded to one model becomes a forced rewrite every time the underlying model deprecates or a better one ships. A thin model-router abstraction adds maybe 200 lines of code and saves those rewrites — and lets the buyer route cheap calls to small models and expensive calls to frontier models.

Should I hire an AI agency or build an in-house team?

It depends on whether AI work is core to product strategy and whether the buyer can hire senior engineers fast enough to start. If AI is core and hiring is feasible, build in-house — the long-term economics are better. If AI is core and hiring is too slow, hire a forward-deployed agency now and let them help recruit the in-house team in parallel. If AI is not core, an agency is almost always the right call.

What is forward-deployed engineering and why does it matter?

Forward-deployed engineering is a delivery model where the agency’s engineers work inside the client’s codebase, communication channels, ticketing, and on-call rotations, rather than working from a vendor sandbox and handing artifacts over a wall. It matters because AI features fail in production for client-specific reasons — a particular data shape, a particular usage pattern, a particular regulatory constraint — and those failures are invisible from outside the client’s environment.

How do I know if an agency actually has the engineering chops, beyond their case-study page?

Ask the team to walk through a hard production failure in detail — what broke, how they detected it, what the fix was, what monitoring they added afterward. Senior engineers can do this conversationally for several recent examples. People who have only ever shipped demos cannot. Also ask to see an eval suite from a recent project and ask to read a postmortem. The presence of postmortems and eval suites is a strong positive signal; their absence is the only signal a buyer needs to keep looking.

Are AI agencies still relevant given how good Cursor and Claude Code are now?

Yes, but the value proposition has changed. Cursor and Claude Code give the buyer’s existing engineers leverage they did not have in 2024. They do not give the buyer senior engineering judgment, production operations experience, eval design discipline, or model-agnostic architecture patterns. A 2026 AI agency’s value is in the judgment that surrounds the agentic-coding tools, not in the agentic-coding tools themselves.

What is the single most important question to ask an AI agency in a vetting call?

Show me an eval suite from your most recent shipped project, and walk me through the threshold you set with the client and why. The answer reveals whether they understand the practice (eval-driven development), whether they actually shipped production work (not demos), and whether they treat the buyer as a partner in the spec (collaborative threshold-setting) or as someone to deliver a deck to. Three signals from one question.

Closing

The AI agency category is being reset in real time. Not by another wave of model releases — those are now expected — but by the gap between agencies that operate as engineering organizations and agencies that operate as consulting practices wearing engineering branding.

The eleven commitments above are how a buyer can tell the difference in a thirty-minute vetting call. The honest agencies will sign them, line by line, into the SOW. The dishonest ones will hedge.

We are publishing this because the asymmetry between buyers and vendors in AI procurement is wider than at any point since the early-2010s cloud-migration wave. Buyers deserve a checklist they can hold vendors to. Vendors who are doing the work deserve a way to differentiate. And the field, broadly, deserves a higher floor.

Code is the deliverable. Evals are the contract. Persistence is the moat. Sign your name.

— Arthur Wandzel, CEO, SFAI Labs

Last Updated: May 8, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles