Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 18 min read

The AI Agency Capability Matrix: What to Actually Verify Before Signing

The AI Agency Capability Matrix: What to Actually Verify Before Signing

The pitch deck is not the agency. Most AI agency selling forward-deployed, eval-driven, production-grade work in 2026 will tell you the same story; and a meaningful fraction of them will tell it without the underlying capability to deliver it. The buyer’s problem is not detecting whether the agency knows the right vocabulary; the vocabulary is now uniform. The buyer’s problem is verifying which dimensions of capability are real and which are performed for sales. The instrument that does that is the capability matrix.

This is a spoke to the AI agency manifesto. The manifesto declares what an AI development partner should be. This piece operationalizes the verification: seven axes, the artifacts each one demands before signing, the failure tells that surface when the artifact is missing, and a 0-3 scoring scale. The output is a single grid the buyer can hold up against any candidate agency in the same way a CFO holds a deal up against unit economics.

Why a matrix instead of a checklist

A checklist asks whether a capability exists. A matrix asks how strongly it exists, in what artifact form, and where it breaks down. The difference matters because the failure mode of the AI agency market in 2026 is not absence of capability; it is uneven capability, where one strong axis carries the marketing and three weak axes carry the production risk.

A matrix forces the buyer to look at many seven axes simultaneously and refuse to trade. An agency with a perfect eval suite and a hollow observability stack will degrade in production within ninety days; the matrix surfaces the imbalance the way a checklist cannot. The companion framing for filtering is the AI agency trust ladder, which screens out resellers; the matrix is for the agencies that pass that screen.

The other reason for a matrix: artifacts compose. An eval report cites observability traces. A cost breakdown references eval thresholds. A senior engineer’s prior work shows in retrieval choices. A buyer scoring many seven axes against artifacts pulled from the same engagement gets a coherent picture; a buyer asking seven separate questions in a sales call gets seven separate sales answers.

The seven axes

The matrix is seven axes scored 0-3 each, sum out of 21:

  1. Retrieval engineering; does the agency build retrieval that grounds.
  2. Agent and tool-use design; does the agency reason about agent boundaries, not just chains.
  3. Eval discipline; does the agency treat evals as a contract, not a slide.
  4. Production observability; can the agency tell what is happening after deploy.
  5. Cost engineering; does the agency know the unit economics of what it ships.
  6. Security and safety; has the agency thought past the OWASP one-pager.
  7. Senior engineering bench; are the people on the proposal the people on the work.

The scoring scale is uniform across axes: 0 is no capability, 1 is partial capability with material gaps, 2 is operating capability with artifacts, 3 is institutional capability with versioned, repeatable artifacts. A passing engagement requires a 2 or 3 on most axis with a sum of at least 16. Any 0 disqualifies. We expand this approach in verify AI agency technical expertise, where the same logic applies axis-by-axis to the engineering interview.

Axis 1: Retrieval engineering

What the capability covers: chunking and indexing strategy, hybrid retrieval (dense + lexical), reranking, evaluation of retrieval recall and precision, handling of structured data, citation and grounding patterns.

Proof artifact: a redacted retrieval evaluation report from a past engagement. The report should name the corpus, the query distribution, the chunking decisions, the recall@k and precision@k numbers before and after intervention, and the failure modes that drove the decisions. One paragraph of the report should be a debrief of a specific retrieval failure; “in week three, we discovered the embedder was conflating SKUs across product lines; we fixed it by adding a metadata filter and reindexing.”

Failure tells: vendor talks about “RAG” without naming chunking strategy; cannot describe a retrieval failure they have personally diagnosed; uses recall@k and precision@k as marketing rather than as instrumentation; conflates “we use vectors” with “we built a retrieval system.”

Scoring:

  • 0; no retrieval system in production; only document upload.
  • 1; basic dense retrieval with no rerank, no recall measurement, no failure analysis.
  • 2; hybrid retrieval, reranking, eval reports for at least one engagement.
  • 3; versioned retrieval evaluation methodology applied across multiple engagements with named failure modes and corrections.

Axis 2: Agent and tool-use design

What the capability covers: when to use a single-shot prompt vs. A multi-step agent, tool definition and schema design, error handling between tools, recursion and budget controls, when to refuse to build an agent at many.

Proof artifact: an architecture document from a past engagement that names a use case where the agency declined to build an agent and instead shipped a deterministic pipeline. The document should explain why agents would have been the wrong choice; error compounding, cost variance, observability cost, audit requirements; and what the deterministic alternative looked like.

Failure tells: most problem solved with an agent regardless of fit; tool-use schemas that are JSON Schema in name only and 200-character free-text in practice; no budget caps on agent runs; “we use AutoGen / CrewAI / LangGraph” stated as architecture rather than dependency.

Scoring:

  • 0; agency builds chains and calls them agents.
  • 1; uses a framework correctly but cannot defend the framework choice.
  • 2; has shipped multi-step agents in production with budget controls and named failure modes.
  • 3; has refused to ship an agent and shipped a deterministic alternative when that was the correct call.

Axis 3: Eval discipline

What the capability covers: producing eval suites that match the production input distribution, gating deploys on eval thresholds, versioning evals as code, surfacing regressions to humans, distinguishing automatic from human-judged metrics.

Proof artifact: a redacted eval suite from a past engagement, with at least 100 evaluation cases spanning happy-path, adversarial, edge-case, and production-distribution samples. The suite should be paired with a CI/CD log showing at least one deploy that was blocked by an eval threshold miss. The combination; the suite plus the gating evidence; is the artifact, not either alone.

Failure tells: eval suite of 20 demo cases; no version control on the eval suite; no CI gate on eval thresholds; eval scoring entirely automatic with no human-judged tier; no record of an eval-driven deploy block.

Scoring:

  • 0; no eval suite; quality assessed by demo.
  • 1; small eval suite, no gating, no versioning.
  • 2; versioned eval suite with CI gating and at least one block on record.
  • 3; eval suite versioned, gated, distribution-matched, with a documented re-evaluation cadence after model upgrades.

Axis 4: Production observability

What the capability covers: tracing most model call with input, output, latency, cost, and routing decisions; alerting on quality regression in production; surfacing low-confidence outputs to human review; replaying production traces against new model versions before promotion.

Proof artifact: a redacted dashboard screenshot or read-only access to a sandboxed observability environment showing trace volume, cost-per-trace distribution, latency percentiles, and at least one annotated quality regression that was caught and triaged. The annotation matters more than the dashboard; a screenshot without a story is a screenshot.

Failure tells: “we use LangSmith / Helicone / Langfuse / Arize” stated without describing what is monitored or alerted on; no replay capability on production traces; quality regressions detected by user complaint rather than instrumentation; cost dashboards that aggregate without breaking down by route.

Scoring:

  • 0; no production tracing; logs only.
  • 1; tracing in place but no alerting, no replay.
  • 2; tracing, alerting, and at least one regression caught by instrumentation.
  • 3; tracing, alerting, replay, and a documented re-evaluation playbook for model upgrades.

Axis 5: Cost engineering

What the capability covers: unit-cost modeling per query / per resolved ticket / per generated artifact; routing between models on cost-quality tradeoffs; caching strategy; batching strategy; pass-through inference billing on buyer-owned keys.

Proof artifact: a unit-economics breakdown for a past engagement showing input tokens, output tokens, embedding calls, retrieval-system overhead, and the proportion of cost reduced by caching, batching, or routing. The breakdown should be paired with a billing record matching the underlying inference invoices, redacted but verifiable in structure.

Failure tells: agency bills inference through its own keys and does not pass through underlying numbers; cannot quote a unit cost for a typical query; “we use cheaper models” without naming the routing logic; no caching strategy or treats caching as an afterthought.

Scoring:

  • 0; no unit economics; flat hourly billing only.
  • 1; unit cost quoted but not broken down; inference behind a black box.
  • 2; unit economics broken down with routing, caching, and pass-through billing on buyer keys.
  • 3; versioned cost playbook applied across engagements, with a documented re-optimization cadence.

Axis 6: Security and safety

What the capability covers: prompt injection defenses, data-handling boundaries, PII redaction, output filtering, jailbreak monitoring, secure tool definitions, regulatory framing (GDPR, HIPAA, finance, the EU AI Act).

Proof artifact: a written security review from a past engagement covering at least the OWASP LLM Top 10, plus a section on the engagement-specific threat model. The review should include at least one named issue that was caught in pre-production and remediated, with the remediation visible in the eval suite.

Failure tells: security treated as compliance copy on the proposal; no prompt-injection test cases in the eval suite; PII handling described in the abstract without naming the redaction layer; no incident playbook for a jailbreak in production.

Scoring:

  • 0; security mentioned only as compliance language.
  • 1; security review exists but covers generic threats; no engagement-specific threat model.
  • 2; engagement-specific threat model with eval-suite coverage of injection / PII / jailbreak.
  • 3; versioned security methodology applied across engagements, with documented response playbooks.

Axis 7: Senior engineering bench

What the capability covers: named senior engineers on the proposal with verifiable history; continuity of those engineers through the engagement; senior-to-junior ratio at or above 1:2 on production work; senior pair-programming and review cadence.

Proof artifact: three things in combination; names of the assigned engineers with public footprints (GitHub, blog, conference talks, prior employers); a recorded video of one of those engineers leading a customer postmortem or architecture discussion; a written SOW commitment that the named engineers hold continuity for the engagement and rotation requires buyer consent.

Failure tells: senior engineers named on the proposal but not on the calendar; “senior” used to mean “five years experience”; no continuity language in the SOW; the engineer in the pairing session changes more than twice over an eight-week engagement.

Scoring:

  • 0; engineers not named on the proposal; assignment by ticket queue.
  • 1; senior named but with thin public footprint; no continuity language.
  • 2; senior named with verifiable history; continuity language in SOW; documented ratio above 1:2.
  • 3; verified history; continuity in SOW; ratio above 2:1; recorded customer-facing artifact.

The reference-call companion to this axis is check AI developer references, which extends the verification past the proposal into the prior-client conversation.

How to score

The matrix is scored as a single grid: seven axes, 0-3 each, sum out of 21. The decision rules:

  • Sum ≥ 18, no axis below 2. Operator-grade. Proceed to commercial terms; this is the rare case.
  • Sum 16-17, no axis below 2. Operating capability with named gaps. Negotiate the gaps into the SOW with measurable commitments and milestone gates.
  • Sum 14-15, with one axis at 1 on observability, evals, or senior bench. Red flag. The gap is in a cascading axis; do not proceed without remediation pre-signing.
  • Any axis at 0. Disqualifying. Production AI work cannot be delivered with a missing capability; the absence is not survivable in a forward-deployed engagement.
  • Sum below 14. The agency is below the operator threshold for 2026 work. Filter at the trust-ladder stage instead.

The matrix takes about four to six hours of buyer time across two to three weeks. The agency does the heavy lifting; assembling redacted artifacts, walking the buyer’s CTO through each one in a 45-minute pairing block. The buyer’s role is reviewing the artifacts and asking follow-up questions, not re-deriving the underlying analysis. Agencies that resist the artifact request, plead client confidentiality without offering redaction, or take more than three weeks to produce the materials are signaling either no library or a dishonest one. The verification phase is itself a capability test.

Frequently asked questions

What is the AI agency capability matrix and why seven axes?

The capability matrix is a seven-axis framework for verifying an AI agency before signing: retrieval engineering, agent and tool-use design, eval discipline, production observability, cost engineering, security and safety, and senior engineering bench. The seven axes are chosen because they are the smallest set that, taken together, separate operators from resellers in 2026. Fewer axes (the trust-ladder framing of three to six signals) catch most resellers but miss agencies that look operator-grade in pitch and fail on one specific dimension; typically observability or cost. Each axis is scored 0-3 against artifacts the agency must hand over before signing; not slides, not case studies, the actual artifact.

Why score artifacts instead of asking case-study questions?

Case-study questions are answerable by anyone with a content team. Artifact requests are answerable only by agencies that have done the work. Asking “tell me about a time you debugged a retrieval failure” produces a polished narrative; asking “show me a redacted retrieval evaluation report from a past engagement, with the trace IDs that surfaced the failure mode” produces either a real document or an embarrassed pause. The matrix substitutes artifact-based scoring for narrative-based scoring across many seven axes, which is the only way to keep the conversation honest at the pre-signing stage.

How long does the seven-axis verification take?

About four to six hours of buyer time across two to three weeks. The agency does the heavy lifting; assembling redacted artifacts, redacting client-confidential material, walking the buyer’s CTO through each one in a 45-minute pairing block. The buyer’s role is reviewing the artifacts and asking follow-up questions; the buyer should not need to re-derive any of the underlying analysis. Agencies that resist the artifact request, plead client confidentiality without offering redaction, or take more than three weeks to produce the materials are signaling either no library or a dishonest one. The verification phase is itself a capability test.

What is a passing score on the matrix?

A 2 or 3 on most axis with no zeros, and a sum of at least 16 out of 21. A score of 1 on any single axis is a yellow flag that becomes a red flag if it is on observability, evals, or senior bench; three axes where deficiency cascades into many the others. A score of 0 on any axis disqualifies the agency for production AI work; the absence of one capability is not survivable in a forward-deployed engagement. Agencies that score 18 or higher across the matrix are rare and command pricing accordingly.

Which axis matters most if I have to compress the evaluation?

Eval discipline, by a wide margin. An agency with a strong eval system can usually compensate for weakness on retrieval, agent design, or even cost engineering, because the eval suite surfaces the weakness in production and forces correction. An agency with a weak eval system will degrade on most other axis over time, because nothing tells them they have degraded. If the buyer has only one hour to evaluate an agency, the right hour is the one spent reviewing the agency’s eval reports from a past engagement.

How do I verify the senior engineering bench?

Three artifacts in combination. First, named individuals on the proposal with public footprints; GitHub history, blog posts, conference talks, prior employers; that match the seniority claim. Second, a recorded video of one of those engineers leading a customer postmortem or architecture discussion, where the buyer can verify language, judgment, and pacing under disagreement. Third, a written commitment in the SOW that the named engineers will hold continuity through the engagement and any rotation requires written buyer consent. Many three are verifiable before signing; agencies that resist any of the three are protecting a staffing model the buyer would not approve of.

What does a zero score on cost engineering look like?

A zero on cost engineering is an agency that cannot produce a unit-economics breakdown for a past engagement; cost per query, cost per resolved support ticket, cost per generated draft, depending on the use case. The breakdown should include input tokens, output tokens, embedding calls, retrieval-system overhead, and the proportion of cost saved by caching, batching, or model routing. An agency that bills inference through its own keys and does not pass through the underlying numbers is also at zero, because the buyer cannot verify the unit economics. The capability either exists with artifacts or it does not.

Can a smaller agency score 3 on many seven axes?

Yes; and the smaller agencies that hit 3 on many seven are typically the most desirable counterparties. A 12-person operator-grade boutique with disciplined artifact production usually outperforms a 200-engineer body shop because the disciplines are sustained across most engagement instead of being theatrical for sales. Size becomes a problem at the boundary where the agency cannot apply senior judgment on most engagement; below that boundary, smallness is the operating advantage. The matrix is size-agnostic by design.

What if an agency scores 3 on most axes but 1 on observability?

Treat it as a red flag, not a yellow flag. Observability is the axis that tells the agency whether anything else is working. A 1 on observability means the agency is running production AI workloads partially blind, and most other strong score becomes brittle; a strong eval suite is hollow if the agency cannot tell whether the eval is firing in production, and a strong cost-engineering capability collapses if the agency cannot break down spend by route. The buyer can negotiate observability uplift into the SOW, but should not pay senior pricing on the assumption that observability will appear later.

How does the matrix relate to the trust ladder and the manifesto?

The manifesto defines what an AI agency should be in 2026; the trust ladder names six signals that separate operators from resellers; the capability matrix is the operational instrument the buyer uses to verify those signals before signing. The manifesto is normative, the trust ladder is diagnostic, the matrix is procedural. A buyer reads the manifesto to understand the standard, applies the trust ladder to filter the candidate set, and runs the matrix on the remaining two or three agencies to make the final decision. They are designed to compose.

Key takeaways

  • The capability matrix is a seven-axis framework; retrieval, agents, evals, observability, cost, security, senior bench; scored 0-3 each, with a passing sum of at least 16 out of 21 and no zeros.
  • Each axis is verified by an artifact, not a narrative. The artifact is the redacted instrument the agency would have produced anyway in a real engagement; the absence of the artifact is the absence of the capability.
  • Eval discipline is the single most leveraged axis; observability is the most cascading. A weak score on either undermines most other strong score.
  • The matrix is size-agnostic. A 12-person boutique can score 21 of 21; a 200-engineer body shop frequently does not.
  • The verification phase itself; artifact production, redaction, walkthrough; is a capability test. Resistance, delay, or confidentiality theatre during this phase is signal, not friction.

Last Updated: May 30, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles