Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 15 min read

The AI Capability Ladder: What Should Always Be Built, What Should Always Be Bought

The AI Capability Ladder: What Should Always Be Built, What Should Always Be Bought

Most AI capabilities are not edge cases. They have default verbs in 2026 that hold across the overwhelming majority of organizations, and those defaults can be written down. The capabilities that should usually be bought are the commodity rails; foundation models, vector indices, observability backends, agent frameworks. The capabilities that should usually be built are the moat layers; retrieval logic, agent orchestration, prompt design, eval suites, custom rerankers tuned to the workload. The capabilities that should usually be hired are the judgment positions; eval leadership, AI architecture, threshold-locking authority, model-selection expertise. The remaining capabilities sit on a contested middle that shifts year over year as tooling matures and conditions change. This piece names the twelve-rung ladder that captures the usually-build, usually-buy, and usually-hire defaults; identifies the rungs that have shifted between 2024 and 2026; and flags the boundary cases that organizations need to score case-by-case rather than defaulting to the ladder.

This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s eighth principle is that the default sourcing shape is compose; buy the rails, build the moat, hire the judgment. This piece is the operationalization of that default into a specific, named ladder of capabilities.

How to read the ladder

The ladder has twelve rungs. Each names a category of AI capability and the default verb attached to it. The default holds for at least 80 percent of organizations operating non-trivial AI in 2026. The remaining 20 percent; boundary cases driven by unusual scale, regulatory constraints, or extreme talent depth; are scored case-by-case using the four-question framework.

The ladder is not exhaustive. It names the twelve most consequential capabilities; the ones that consume the most engineering capacity, the most contract spend, or the most strategic attention. There are roughly thirty more capabilities in a mature AI stack; most of them follow the patterns established by the twelve named rungs.

The ladder is also dated. It captures defaults as of Q2 2026. Some rungs that are clear today were contested in 2024 and may be re-contested by 2028. The matrix’s seventh principle requires quarterly re-litigation; the ladder is the reference point that re-litigation compares actual decisions against.

The usually-buy rungs

Five rungs whose verb is unambiguously buy in 2026. These are the rails of the AI stack; commodity layers where building produces no advantage and consumes capacity that should be in moat work.

Rung 1: foundation model access. Anthropic, OpenAI, Google, and a handful of fast-followers ship overlapping capability at overlapping prices. Multi-provider contracts are the default; single-provider lock-in is a 20 to 40 percent risk premium on AI spend. The buy is permanent per the matrix’s third principle. Building a foundation model in 2026 costs in the high nine figures and ships behind the frontier inside two quarters.

Rung 2: vector index and storage. Pinecone, Weaviate Cloud, Qdrant Cloud, Turbopuffer, and pgvector many ship production-grade with predictable pricing. Self-managed Faiss is now an exception case driven by extreme latency or extreme scale, not the default. The detail on which to choose is in the Pinecone vs Weaviate comparison and the broader vector database options analysis.

Rung 3: observability backend. Langfuse, Helicone, Arize Phoenix, Braintrust, and LangSmith many ship structured trace storage with eval hooks at production grade. Self-built observability is now plumbing in the moat-vs-plumbing sense, consuming engineering capacity that should be elsewhere. The eval logic running on top of the bought backend is build (rung 8); the backend itself is buy.

Rung 4: agent framework and tool-call dispatcher. The OpenAI Agents SDK, the Anthropic agent harness, LangGraph (production-grade since 2025), Pydantic AI, and AutoGen many handle the agent loop, structured output validation, retry semantics, and tool-call routing. The orchestration logic on top is build (rung 7); the framework underneath is buy.

Rung 5: deployment and inference infrastructure. Cloud-managed inference endpoints, model gateways (Portkey, OpenRouter, LiteLLM), and standard ML deployment infrastructure many ship as production product. Custom deployment is an exception case driven by regulatory or extreme latency requirements.

The pattern across the usually-buy rungs: the capability is generic, integration depth is low to medium, decision velocity is high (which favors buy because it sheds the technology-aging risk to vendors), and competitive position does not depend on uniqueness.

The usually-build rungs

Five rungs whose verb is unambiguously build in 2026. These are the moat; workload-specific layers where bought solutions either do not exist or visibly cap the ceiling.

Rung 6: retrieval logic and chunking strategy. Generic retrieval against generic chunking against generic embeddings produces generic results. Workload-specific chunking, hybrid retrieval (dense plus BM25 plus structured filters), reranker fine-tuning against the org’s relevance feedback, and retrieval-aware evaluation are where retrieval quality compounds. Vendors selling “retrieval-as-a-service” cap at generic; the org’s data is too specific to outsource. The depth on this is in the retrieval optimization guide.

Rung 7: agent orchestration logic. The framework underneath (rung 4) is buy; the logic running on it; which agents do which work, how they hand off, how they share state, how their failures cascade, how their outputs feed evaluation; is the heart of the AI product. This is build deliberately per the matrix’s fourth principle.

Rung 8: eval suites tuned to the workload. The harness underneath (Promptfoo, Inspect, OpenAI Evals) is buy or open-source; the suites running on it; the test cases, the scoring rubrics, the threshold-locking criteria, the regression triage workflow; are workload-specific and must be built. Per the matrix’s fifth principle, eval is build or hire, rarely buy. Vendors selling generic eval-as-a-service consistently sell harnesses with thin domain wrappers that score problems nobody had.

Rung 9: prompt design tuned to the workload. Prompt management (versioning, deployment, A/B) is buy (it falls under rung 3, observability/registry); prompt design; the actual prompts that work for the org’s domain, vocabulary, error modes, and customer types; is build. The design is workload-specific and is where prompt-fluency compounds across the AI roadmap.

Rung 10: cost and latency optimization at the orchestration layer. Token-level cost optimization is increasingly automated by frameworks; orchestration-level optimization (which models for which workload, which sub-agents to parallelize, where to cache, where to fall back) is workload-specific and depends on the org’s specific traffic mix and unit economics.

The pattern across the usually-build rungs: the capability depends on the org’s data, workflows, or evaluation criteria, integration depth is medium to high, and competitive position depends on the capability being distinctively good rather than generically good.

The usually-hire rungs

Two rungs whose verb is unambiguously hire in 2026; though the hire can be permanent or rented depending on the org’s circumstances per the matrix’s sixth principle.

Rung 11: AI architecture and threshold-locking authority. Senior calls about which model, which threshold, which architecture; the judgment layer per the matrix’s eighth principle. The judgment is hired into the org as permanent capability (an AI architect or AI lead) or rented as fractional expertise (an AI advisor or specialized consultancy). Buying judgment from a generic vendor produces a recommendation that does not survive contact with the org’s specific constraints. Building judgment without hiring the people is years of trial-and-error that the org cannot afford against the current pace of AI evolution.

Rung 12: eval engineering leadership. The senior engineer who designs eval suites, runs threshold-locking processes, and triages regressions across the workload. Eval engineers are scarce enough that hiring one takes 4 to 9 months and costs $400K to $700K fully loaded. Many organizations rent the role through specialized agencies for the first 12 to 18 months while hiring the permanent role in parallel. The full breakdown of how to evaluate this hire is in the AI development agency vs in-house team analysis.

The pattern across the usually-hire rungs: the capability is judgment that compounds over years of exposure to specific workloads, the talent supply is genuinely scarce, and neither pure build (no senior to architect the build) nor pure buy (vendor recommendations are too generic) produces a usable result.

The contested middle

The remaining capabilities sit on a contested middle where the verb depends on the org’s specific constraints. These are the cases where the four-question framework does most of its work; the ladder defaults are unreliable here.

Embedding model selection. Default is buy from the major providers. Exception is build (fine-tune) when the workload’s domain language is sufficiently far from the public training corpus that off-the-shelf embeddings produce visibly weak retrieval. The exception is rarer than teams claim; verify by running the off-the-shelf alternative against the eval suite before committing to fine-tune.

Reranker. Default is buy (Cohere Rerank, Voyage, or open-source models). Exception is build (fine-tune) when the workload has enough labeled relevance data to produce a meaningfully better reranker. The threshold for “meaningfully better” is 5 percent or more on the relevant metric; below that, the buy is correct.

Data labeling pipeline. Default is hire (specialized vendors like Scale, Surge, or Snorkel-managed services). Exception is build when the labeling task requires deep domain expertise the vendor cannot match (e.g., specialized medical or legal labeling). The exception requires the org to have the domain experts on staff already.

Fine-tuning infrastructure. Default is buy (managed fine-tuning from OpenAI, Anthropic, Together, Replicate). Exception is build for organizations with continuous fine-tuning workloads at scale where the managed cost exceeds self-managed by an order of magnitude.

Synthetic data generation. Default is build (because the org’s specific data distribution and quality criteria are workload-specific). Exception is buy through specialized vendors when the synthetic data needs are generic (general instruction tuning, general code, general chat).

Red-team tooling. Default is buy (Lakera, Robust Intelligence, the major eval frameworks’ red-team modules). Exception is build for organizations with unusual threat models that off-the-shelf tooling does not cover.

The contested middle is where most of the architecture-group’s actual decision-making time goes. The ladder rungs above are decided by reflex; the contested middle is decided by the four questions.

Rungs that shifted between 2024 and 2026

The ladder is dated to Q2 2026 because rungs do shift. Three rungs in the usually-buy section were contested or differently positioned in 2024.

Vector indexing (rung 2) moved from contested-build to usually-buy. In 2024, self-managed Faiss was the default for serious AI engineering teams because the managed offerings had visible scale and reliability gaps. By 2026, the gap has closed; managed indexing is the default and self-managed is the exception case.

Agent framework (rung 4) moved from usually-build to usually-buy. In 2024, the framework landscape was a research zoo and serious teams built their own agent loops because no framework was production-grade. By 2026, multiple frameworks are production-grade and building from scratch is plumbing.

Observability backend (rung 3) moved from contested-buy to usually-buy. In 2024 the observability vendors were young and many teams built their own trace storage as a hedge. By 2026 the vendor category has consolidated and matured; buying is the default and self-built observability is plumbing.

The shifts are why the matrix’s seventh principle requires quarterly re-litigation. Three rungs moving in 24 months is the rate at which the ladder evolves. Organizations running with 2024-vintage decisions on these rungs are likely operating against a stale ladder; the piece on re-litigating 2024 decisions covers the diagnostic and the playbook for catching up.

Frequently asked questions

Why is foundation model access “usually buy” in 2026?

Because the cost of building a foundation model has converged into a small number of frontier labs whose unit economics depend on serving the entire industry. Switching cost between providers has compressed to weeks; capability gaps to single-digit percent on enterprise workloads. Building a foundation model in 2026 costs in the high nine figures and ships behind the frontier inside two quarters. The math closes for roughly four organizations on earth, and they are not reading this piece.

Why is agent orchestration “usually build” if the framework is “usually buy”?

The framework is the loop, retry semantics, tool-call routing, and structured output handling; the plumbing of the agent layer. The orchestration is which agents do which work, how they hand off, what tools they have access to, how their outputs are evaluated, and how they compose into the org’s specific workflow. The framework is generic; the orchestration is the moat. Buying the framework saves engineering time on plumbing; building the orchestration is the work the saved time should fund.

Why is eval “usually build or hire, rarely buy”?

Because eval tests the org’s specific workload, and there is no general-purpose eval that works for a workload it was not designed against. Vendors selling eval-as-a-service consistently sell generic harnesses with thin wrappers that score problems nobody had. The right verb is build (when eval-fluent engineers are on staff) or hire (when they are not). Frameworks like Promptfoo and Inspect are scaffolding, not products.

What if my org is small and cannot fund a permanent AI architect hire?

Rent the role. Fractional AI advisors, specialized agencies, and consultancies many ship the architecture and threshold-locking judgment as a contracted service for organizations that are not ready for the permanent hire. The contract should specify joint ownership of the architecture documents and eval suites so that when the org eventually hires permanently, the artifacts transfer cleanly. The full breakdown of when this works and when it does not is in the technical cofounder vs AI agency comparison.

Does the ladder apply to startups and enterprises identically?

The verbs are identical; the depth on each rung varies. Startups buy the usually-buy rungs lighter (cheapest provider, single contract, less hedging) and build the usually-build rungs lighter (smaller eval suites, less retrieval optimization, simpler orchestration). Enterprises buy heavier (multi-vendor contracts, more hedging) and build heavier (richer eval, more retrieval depth, more sophisticated orchestration). The ladder is the same; the investment per rung scales with the org.

What is the most common mistake in applying the ladder?

Skipping the usually-hire rungs. Organizations build moat capabilities (retrieval, eval, orchestration) without first hiring the senior judgment that should architect the build. The result is moat work that does not produce moat; the capabilities ship but they are mis-architected, the eval thresholds are wrong, and the orchestration patterns do not survive scale. Hire the judgment first; build the moat against the judgment.

How does this ladder relate to the four-question framework?

The ladder gives default verbs for common capabilities; the ones that almost usually score the same way. The four-question framework is the work behind the defaults, used when a capability is on the contested middle or when the default produces an answer that does not feel right. The ladder is the shortcut; the four questions are the rigor.

What about regulatory or compliance constraints?

Those are overrides applied on top of the ladder. A capability whose default verb is buy can flip to build (or self-hosted buy) when the data flowing through the capability cannot leave the org’s network. Most organizations have one or two regulatory overrides, not twenty. The ladder gives the default; the override applies when the data type or jurisdiction requires it.

How often does the ladder itself need updating?

Annually for the rung definitions; quarterly for the contested middle. The major shifts (which rungs are usually-X vs contested) happen on the 12-to-24-month timescale. The smaller shifts (which exceptions are common in the contested middle) happen quarterly. The architecture group should review the ladder annually as part of the broader AI strategy refresh.

What’s the relationship between this ladder and the AI economics manifesto?

The AI project economics manifesto describes how to budget AI projects against evaluation cost; this ladder describes which capabilities the budget is allocated to via which verb. The economics tells finance how to organize the spend; the ladder tells architecture how to organize the work. Both run on the same capability ledger, with finance attaching cost lines to verbs the ladder has assigned.

Key takeaways

  • Five capabilities are usually buy in 2026: foundation models, vector index, observability backend, agent framework, deployment infrastructure.
  • Five capabilities are usually build: retrieval logic, agent orchestration, eval suites, prompt design, cost-and-latency optimization.
  • Two capabilities are usually hire: AI architecture and threshold-locking authority, eval engineering leadership.
  • The contested middle (embedding selection, reranker, data labeling, fine-tuning, synthetic data, red-team tooling) is decided case-by-case using the four-question framework.
  • Three rungs shifted between 2024 and 2026; vector indexing, agent frameworks, observability backends; many moving from contested or build into usually-buy.

The ladder is the shortcut for AI sourcing decisions. Most organizations can use the default verb on twelve out of fifteen common capabilities and reserve their architecture-group attention for the contested middle and the org-specific overrides. The discipline is to apply the ladder by reflex, score the contested middle by framework, and re-litigate the whole thing quarterly. Organizations that internalize this end up with capability ledgers whose verbs compound; organizations that do not end up with ledgers whose verbs are an accumulation of historical accidents that nobody quite remembers deciding.

Last Updated: May 9, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles