The AI agency portfolio is dead. Here is what founders should ask for instead.

The AI agency portfolio; the deck of logos, the redacted case-study screenshots, the eight-page PDF with a tasteful chart; is the single most useless artifact in the 2026 buying process. With AI software, the gap between a screenshot and a system has widened to absurdity. A demo that looks identical in a portfolio can hide a $40K-a-month inference bill, a 14% hallucination rate on the customer’s actual data, or a single-prompt monolith that the agency cannot evolve without rewriting from scratch. Founders keep asking for portfolios because that is what procurement asked for in 2019. Agencies keep producing them because that is what wins meetings. Both sides are pretending.

The fix is not a better portfolio; it is a different artifact set, one that proves engineering judgment instead of decorating it. The kind of artifacts a forward-deployed AI agency produces as a byproduct of doing the work; and the kind a deck-first agency cannot produce on demand. Below is the replacement set: seven artifacts a founder should ask for in the first vetting call, and what each one tells you.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why the portfolio fails as a signal

A portfolio is a curated set of outputs. AI engineering quality is almost entirely about judgment under failure modes you cannot see in outputs. The portfolio shows the screenshot of the chatbot. It does not show:

Whether the model is pinned or auto-upgrades silently in production.
Whether the eval suite caught the last regression or whether a customer did.
Whether the cost-per-conversation is $0.04 or $1.20 and whether anyone is tracking it.
Whether the system has a fallback when the primary provider has an outage, which many three major providers had in the last 12 months.
Whether the prompt is version-controlled, reviewed in PR, and rolled out behind a flag, or whether it lives in a Notion doc and a junior engineer can change it on a Friday afternoon.

You are not buying screenshots. You are buying the reasoning behind the choices that produced them. A portfolio cannot transmit reasoning. The artifacts below can.

There is also a darker reason portfolios persist: they let agencies sell the appearance of a track record they do not have. A logo on a slide does not mean the agency shipped production code at that company; it might mean a one-week prototype, a rarely-deployed POC, or a sub-contracting relationship through another firm. The decoding the AI agency case study guide covers how to pull the real engagement story out of a polished case study.

The replacement artifacts share three properties. They are dated, so you can verify recency. They are diffable, so you can read the agency’s actual reasoning. And they are awkward to fake, because faking them costs nearly as much as producing the real thing.

Artifact 1: A real eval test set, with the rationale doc

Ask: show me the eval set you shipped for your most recent client. Read me five test cases. Then show me the doc that explains why these cases and not others.

A real eval test set is a structured Promptfoo, LangSmith, or Braintrust suite (or equivalent) with 50 to 5,000 test cases, each with an input, an expected behavior, a pass/fail predicate, and ideally a tag indicating which production failure first motivated it. The rationale doc explains the threshold, who set it (the client’s domain expert, by name and role), and the tradeoff between false positives and false negatives in the customer’s business context.

What this artifact tells you:

The agency knows what eval-driven development means in 2026; the eval suite is the spec, not a vibe check after the fact.
Thresholds are negotiated with the customer’s domain experts rather than set at 90% because it sounds professional.
Adversarial and edge cases sit alongside happy paths, meaning somebody has run the system in production and learned where it breaks.

The red flags: generic “GPT vs. Claude on MMLU” benchmarks dressed up as evals; LLM-as-judge with no ground-truth examples; or “we test it manually.” If the agency has nothing to show here, end the call. The evaluate AI developer portfolios guide goes deeper on what a credible eval suite looks like.

Artifact 2: Live system trace screenshots from a deployed product

Ask: open your tracing tool. Show me the last 20 production requests on a real client system. I want to see input, retrieved context, tool calls, intermediate model outputs, latency, and total cost.

A trace is a structured per-request record from a tool like LangSmith, Helicone, Langfuse, OpenLLMetry, or a custom OpenTelemetry pipeline. Trace screenshots show that the agency has observability; the discipline of reading what their system did, not what they hoped it would do.

What this artifact tells you:

Whether the system was built to be debuggable in production, or to demo well and pray it does not break.
Whether requests have measured latency budgets, or whether the agency ships 17-second response times because they rarely checked.
Whether tracing covers retrieval, tool calls, and downstream side effects, or only the top-level model call; making everything between input and output a black box.

The red-flag answer is “we use the OpenAI dashboard” or “we have logs.” Neither is tracing. An AI engineering team without structured per-request observability is one production incident away from a multi-day debugging spiral with nothing to look at.

Artifact 3: Cost-per-action telemetry over time

Ask: show me a chart of cost per request, cost per session, or cost per business outcome on a real client system over the last 90 days.

Cost-per-action is the unit economic that decides whether your AI feature is a margin-positive product or a slow-burn fire. It is the dollar amount the system spends on inference, retrieval, and tool calls to produce one unit of customer value; one resolved support ticket, one drafted contract, one summarized meeting, one whatever-your-product-does. A real agency tracks this on a chart, alerts on regressions, and knows the levers to pull (model choice, context size, caching, batching, distillation) when the chart goes the wrong way.

What this artifact tells you:

Whether the agency views cost as a first-class engineering concern or as the customer’s problem after handoff.
Whether they have ever optimized a real system and can describe the before/after; “we cut cost-per-resolved-ticket from $0.34 to $0.06 over six weeks by routing intent classification to a smaller model and caching retrieval at the embedding layer.”
Whether their cost monitoring is connected to their eval monitoring, so a 60% cost reduction that lost 4 quality points triggers a review.

The red flag here is the agency that does not understand the question, or that quotes a project cost (engineering hours) rather than a runtime cost (per-action inference). Those are different numbers and the conflation is usually deliberate.

Artifact 4: A post-mortem with a documented regression

Ask: send me a post-mortem from the last 90 days. Specifically one where an eval regression made it to production and you had to roll back or hotfix.

A post-mortem is a written, dated, blame-free document describing an incident, its root cause, its detection mechanism, its remediation, and the structural change made afterward to prevent recurrence. A regression post-mortem is the most diagnostic kind; it means the team had a baseline, the baseline drifted, the drift was caught, and somebody wrote down what they learned. Post-mortems are a forcing function for engineering culture. Agencies that operate production systems write them. Agencies that ship demos do not.

What this artifact tells you:

Whether the agency runs production-grade software or whether the customer is the QA team.
Whether the team has a learning loop: do action items have owners, dates, and follow-up status, or do they evaporate by the next sprint?
Whether the failure modes the agency has encountered are the kind your project will encounter; silent model upgrades, embedding-model swaps that broke retrieval, prompt edits that cascaded across unrelated features, tool-call loops that ran up a five-figure bill in a weekend.

“We have not had any regressions” means they have either not shipped enough, not detected what they shipped, or not been honest about what they detected.

Artifact 5: The actual PR diff that shipped a fix

Ask: screen-share the pull request that fixed the regression in the post-mortem above. I want to see the diff, the review comments, and the eval results in the CI pipeline.

This is the artifact the deck-first agency physically cannot produce. A PR diff is the raw artifact of engineering work; the file-by-file change that fixed the problem, with reviewer comments visible, with CI checks visible, with the eval delta posted as a bot comment showing what improved or regressed. It is the most honest 90-second view of how an engineering team works that you will ever get on a sales call.

What this artifact tells you:

Whether engineers review each other’s prompt changes the same way they review code changes.
Whether the eval suite is part of the merge gate or a polite afterthought.
Whether the reasoning behind the fix is captured in commit messages and PR description, so a new engineer joining in month seven can read the history.
Whether the team writes small, focused PRs that ship continuously, or massive end-of-sprint dumps that nobody can review.

Acceptable substitutes when NDAs apply: a redacted PR description, a synthetic PR built from a public repo using the same patterns, or an internal tool the agency built for itself. The unacceptable substitute is a deck. If the agency cannot screen-share any PR; not even one; they do not have the engineering practice they claim to have.

Artifact 6: The prompt-registry version history

Ask: show me the version history of your most-used prompt on a live client system. I want to see who changed what, when, and why, and which version is currently in production.

In 2026, prompts are code. They live in a registry; a database, a Git-tracked repo, a tool like PromptLayer or Pezzo, or a homegrown YAML setup; that versions them, attributes changes to humans, ties versions to eval runs, and supports rollback. The version history tells you whether the agency treats prompts as engineering artifacts or as Slack messages someone copied into the codebase.

What this artifact tells you:

Whether prompt changes go through review or whether anyone with access can edit production behavior in three keystrokes.
Whether evaluations gate prompt promotions, or whether prompts ship the way blog posts do.
Whether the team can A/B test prompts and roll back instantly when a new version regresses, or whether “rolling back” means digging through Slack to find the previous wording.

This is the artifact that catches the agency that hired a great prompt engineer in 2024 and rarely industrialized the practice. If the prompts they ship live in Notion, in .txt files, or in the model SDK call site with no version control, most prompt change is an unobservable production incident waiting to happen.

Artifact 7: The architecture decision records

Ask: send me three ADRs from the most recent project. Specifically the ones for model choice, retrieval design, and the eval framework.

An Architecture Decision Record is a short markdown file; usually one to three pages; that captures a single architectural decision, the alternatives considered, and the reasoning for picking one over the others. A team that writes ADRs is a team that thinks in tradeoffs. A team that does not is a team that picks defaults and hopes for the best.

What this artifact tells you:

Whether the agency considered multiple model providers, embedding strategies, and retrieval architectures, or whether everything is “we are a Claude shop” by default.
Whether the reasoning is anchored in the customer’s specific constraints; latency budget, data residency, regulatory regime, cost ceiling; or whether it reads like a generic best-practices essay.
Whether the agency has a memory: do new ADRs reference older ones, build on prior decisions, or contradict them with no explanation?

ADRs are the closest thing to a transcript of engineering judgment that exists as a written artifact. Five minutes reading one tells you more about how an agency thinks than five hours of pitch decks.

How to run the request

Send the seven-artifact request before the first scheduled call, in writing, with a 48-hour window. Tell the agency you will spend 60 minutes on the call walking through what they sent, which artifacts can be redacted, and that you would rather see four real artifacts than seven curated ones. Then watch what happens in the 48 hours.

The agency that will be a good partner replies within the day with a structured response: which artifacts they can share at full fidelity, which require redaction, which they cannot share but can substitute, and a meeting agenda that walks through them in priority order. They send a Loom of the trace tool the day before so the call can move faster.

The agency that will not be a good partner does one of three things. They negotiate the request down (“can we just send the case study deck?”). They miss the deadline. Or they send a polished response that looks comprehensive on first read but, on second read, contains zero of the seven artifacts in their actual form; only descriptions, summaries, and screenshots of dashboards. You have your decision before the call starts.

For founders running a fuller vetting process, this artifact set pairs naturally with a structured reference call to previous customers; the artifacts tell you what the agency says it built, the reference call tells you whether the customer agrees.

What replaces the portfolio is not less; it is more honest

The portfolio is dead because the format is structurally incapable of carrying the signal that matters in AI engineering. Logos do not transmit eval discipline. Screenshots do not transmit cost ownership. Case-study PDFs do not transmit production observability. Founders who keep asking for portfolios get the agencies that produce portfolios, a sub-population correlated with marketing investment, not engineering investment.

The seven artifacts shift the burden of proof from the buyer’s intuition to the seller’s evidence. They are awkward to produce on demand if the agency does not already produce them as a byproduct of the work. The asymmetry of effort is the entire point; it is what turns a 90-minute vetting call into a falsifiable test.

Stop asking for the deck. Ask for the diff. Ask for the trace. Ask for the eval set, the cost chart, the post-mortem, the prompt history, the ADRs. The agencies that can show you those will be the ones still building production AI software in five years. The agencies that cannot will keep selling portfolios; to somebody else.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has reviewed more than 200 AI engagement artifact requests on behalf of founders and CTOs in the last 24 months.

Frequently Asked Questions

Why is the AI agency portfolio a weak signal in 2026?

A portfolio is a curated set of outputs, but AI engineering quality is almost entirely about judgment under failure modes that outputs cannot show. The portfolio shows the chatbot screenshot. It does not show whether the model is pinned, whether the eval suite caught the last regression, whether cost-per-action is $0.04 or $1.20, whether the system has a provider fallback, or whether the prompt is version-controlled. Founders are not buying screenshots; they are buying the reasoning behind the choices that produced them, and a portfolio cannot transmit reasoning.

What artifacts should I ask an AI agency for instead of a portfolio?

Seven concrete artifacts: a real eval test set with the rationale doc, live system trace screenshots from a deployed product, cost-per-action telemetry over time, a post-mortem with a documented regression, the actual PR diff that shipped a fix, the prompt-registry version history, and three architecture decision records from the most recent project. Each artifact is dated, diffable, and awkward to fake; which is the entire point of asking for them.

What does a real eval test set look like?

A real eval test set is a structured Promptfoo, LangSmith, or Braintrust suite (or equivalent) with 50 to 5,000 test cases. Each case has an input, an expected behavior, a pass/fail predicate, and ideally a tag indicating which production failure first motivated it. It comes with a one-page rationale doc explaining the threshold, who set it (the client’s domain expert by name and role), and the tradeoff between false positives and false negatives. Generic ‘GPT vs. Claude on MMLU’ benchmarks dressed up as evals do not count.

Why ask for cost-per-action telemetry?

Cost-per-action is the unit economic that decides whether an AI feature is a margin-positive product or a slow-burn fire. It is the dollar amount the system spends on inference, retrieval, and tool calls to produce one unit of customer value; one resolved support ticket, one drafted contract, one summarized meeting. A real agency tracks this on a chart, alerts on regressions, and knows the levers (model choice, context size, caching, batching, distillation). An agency that does not understand the question, or quotes a project cost rather than a runtime cost, is signalling that cost is the customer’s problem after handoff.

Why is the PR diff the artifact deck-first agencies cannot produce?

A PR diff is the raw artifact of engineering work; the file-by-file change with reviewer comments, CI checks, and an eval-delta bot comment showing what improved or regressed. It is the most honest 90-second view of how an engineering team works. A deck-first agency cannot produce a PR diff because there is no PR; the work happens in notebooks, in Notion, or directly on production. Acceptable substitutes when NDAs apply are redacted PR descriptions, synthetic PRs from public repos, or a PR from an internal tool the agency built for itself. The unacceptable substitute is a deck.

What is a prompt registry and why does it matter?

In 2026, prompts are code. A prompt registry is a database, Git-tracked repo, or tool like PromptLayer or Pezzo that versions prompts, attributes changes to humans, ties versions to eval runs, and supports rollback. The version history shows whether prompt changes go through review, whether evals gate prompt promotions, and whether the team can A/B test prompts and roll back instantly. Agencies whose prompts live in Notion docs or.txt files alongside the SDK call site are running most prompt change as an unobservable production incident waiting to happen.

What is an Architecture Decision Record and why ask for one?

An Architecture Decision Record (ADR) is a one-to-three-page markdown file capturing a single architectural decision, the alternatives considered, and the reasoning for picking one over the others. Ask for ADRs on model choice, retrieval design, and the eval framework. They show whether the agency considered multiple providers and architectures or defaulted to ‘we are a Claude shop,’ whether the reasoning is anchored in the customer’s specific constraints; latency budget, data residency, cost ceiling; and whether the agency has memory: do new ADRs build on prior ones or contradict them with no explanation? ADRs are the closest thing to a written transcript of engineering judgment.

How should I send the seven-artifact request?

Send the request in writing before the first scheduled call, with a 48-hour window. Tell the agency you will spend 60 minutes on the call walking through what they sent, that artifacts can be redacted, and that you would rather see four real artifacts than seven curated ones. The good-partner agency replies within a day with a structured response: which artifacts they share at full fidelity, which require redaction, which they substitute, and a meeting agenda. The not-good-partner agency negotiates the request down (‘can we just send the case study deck?’), misses the deadline, or sends a polished response containing zero of the artifacts in their actual form.

Are these artifacts realistic to ask for at the vetting stage?

Yes. Most one of the seven artifacts is produced as a byproduct of running a production AI engineering practice; they are not ceremonial outputs, they are the working files the engineering team uses most day. Asking for them at the vetting stage is exactly the right time, because the asymmetry of effort is the entire point: agencies that already produce these artifacts can share them in a few hours of redaction work, agencies that do not produce them cannot manufacture them in 48 hours without lying. The request is calibrated to be cheap for real engineering teams and impossible for sales-led ones.

The AI agency portfolio is dead. Here is what founders should ask for instead.

Decision Scope

Why the portfolio fails as a signal

Artifact 1: A real eval test set, with the rationale doc

Artifact 2: Live system trace screenshots from a deployed product

Artifact 3: Cost-per-action telemetry over time

Artifact 4: A post-mortem with a documented regression

Artifact 5: The actual PR diff that shipped a fix

Artifact 6: The prompt-registry version history

Artifact 7: The architecture decision records

How to run the request

What replaces the portfolio is not less; it is more honest

Frequently Asked Questions

Why is the AI agency portfolio a weak signal in 2026?

What artifacts should I ask an AI agency for instead of a portfolio?

What does a real eval test set look like?

Why ask for cost-per-action telemetry?

Why is the PR diff the artifact deck-first agencies cannot produce?

What is a prompt registry and why does it matter?

What is an Architecture Decision Record and why ask for one?

How should I send the seven-artifact request?

Are these artifacts realistic to ask for at the vetting stage?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources