Half of the failures we audit in AI systems are caused by over-building the wrong things; the build trap. The other half are caused by over-buying the wrong things; the buy trap. The buy trap is harder to see because the failure surface is different. Build-trap failures show up as engineering capacity drained on commodity infrastructure; buy-trap failures show up as products that run smoothly until they fail in ways that reveal the org’s standards were rarely encoded into the system. The five capabilities below are the most common buy-trap targets. Each looks like infrastructure that can be outsourced; each is the org’s specific judgment about its own product, customers, and risk tolerance. Outsourcing them delegates that judgment to people who cannot make the calls correctly. This piece names the five, explains why each must stay inside the team, and what to encode to keep them there.
This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s eighth principle says the default verb is compose; buy the rails, build the moat, hire the judgment. The five capabilities below are the moat-with-judgment subset: things that look like rails but encode product judgment in ways no vendor can absorb. The buy trap is mistaking the second category for the first.
What the buy trap is
The build trap drains capacity. The buy trap drains judgment. A team that buys what should have been built ends up with a system that runs but does not represent the org’s actual standards. The product behaves competently on the happy path; it behaves wrong on the stress path because the rules governing stress behavior were written by someone who did not have the org’s specific judgment.
The trap is harder to see because the cost is delayed. A bought eval test set scores green for two quarters; the regression that ships in quarter three reveals that the test set was missing the case that mattered. A bought routing configuration runs for six months; the incident that surfaces an out-of-policy model call reveals that the configuration was rarely aligned to the org’s policy. A bought set of observability rules pages the on-call engineer twelve times a quarter; none of the pages are about the failure that matters when it happens.
The five capabilities below share a common property: they encode judgment, not work. The judgment is specific to the org. The judgment cannot be transferred via a contract or a vendor agreement. Outsourcing the work is fine; outsourcing the judgment is the trap.
Capability 1: the eval test set
Per the case for buying the eval stack and building the evaluator, the runtime is buy and the evaluator is build. The test set is the highest-value artifact inside the evaluator. It is the 200 to 2,000 cases that capture the org’s actual workload, tagged by failure mode, curated by domain experts, versioned alongside the prompts and data they exercise.
A vendor selling “eval as a service” cannot produce this test set. They do not know the org’s workload, customers, edge cases, or tolerance for specific kinds of failure. What they ship is a generic harness with a thin domain wrapper that scores a problem the org does not have. The wrapper looks credible in the demo and fails the moment it meets the org’s actual workload.
A contractor can help curate the test set, but the org must own the curation decisions. The case selection (“which 30 customer scenarios are most representative of the workload?”) and the correctness rubric (“what counts as a satisfactory answer for this case?”) are judgment calls that the org’s domain expert must own. The contractor produces drafts; the org accepts, revises, or rejects.
What to keep inside: most test case in the suite, most grader specification, most threshold value, most failure-mode tag. What to source externally: the harness underneath, the regression UI, the storage, the runtime.
Capability 2: the prompt registry content
The prompt registry stack; version management, deployment, A/B testing, eval-suite linkage; is buy. Promptlayer, Langfuse, Helicone, and LangSmith ship this. The integration is hours.
The prompt registry content; the actual prompts being managed, plus the rules governing what gets deployed when, plus the rollback semantics; is not buy. The prompts encode product behavior; the deployment rules encode the org’s tolerance for prompt-driven incidents. A change to a system prompt can shift how the product handles thousands of customer interactions per day. The decision about whether that change ships, when, with what rollout cadence, and with what rollback trigger is the org’s product judgment.
Outsourcing the prompt content puts product behavior in the hands of someone outside the team. Outsourcing the deployment rules puts the org’s incident tolerance in the hands of someone who does not own the consequences. Both are common buy-trap failures.
What to keep inside: most prompt in the registry, the deployment rules, the A/B configuration, the rollback triggers. What to source externally: the registry stack, the diff UI, the deployment automation.
Capability 3: model-routing configuration
The model router itself; OpenRouter, Portkey, LiteLLM; is buy. The integration is hours; the cost savings from sophisticated routing logic typically recover 30 to 40 percent of inference spend.
The configuration that decides which model handles which request is not buy. That configuration encodes the org’s cost-vs-quality tradeoffs (“we use the cheaper model for this customer tier”), regulatory constraints (“PII queries must route to the EU-region provider”), customer-tier policies (“enterprise customers get the higher-quality model”), and fallback semantics (“when the primary fails, route to the secondary if and only if the request is non-sensitive”).
Outsourcing this configuration to a contractor or vendor produces silent policy drift the org learns about during an incident. A misrouted query that hits a non-compliant region, a customer-tier mismatch that gives a free-tier customer the premium model, a fallback that triggered when it should not have; many of these surface during incidents and reveal that the configuration was written by someone who did not have the policy in their head.
What to keep inside: most routing rule, most fallback semantic, most customer-tier mapping, most compliance constraint. What to source externally: the router, the cost-tracking, the metrics aggregation.
Capability 4: observability rules
Per the AI agency observability stack analysis, trace storage and dashboards are buy. Langfuse, Helicone, Phoenix, Braintrust, and LangSmith many ship production-grade trace storage with structured schemas, cost dashboards, and latency breakdowns.
The rules that govern when those traces become alerts are not buy. The rules define what counts as an alert-worthy event (“p99 latency over 8s for the customer-support workload triggers a page”), what threshold triggers what severity (“3 hallucination flags in 60 seconds is a P1, 1 in 10 minutes is a P3”), what runbook each alert maps to, and what counts as resolution.
A contractor producing observability rules without org-specific context produces alerts that page the wrong people about the wrong things. The on-call engineer gets paged twelve times a quarter for noise; when the actual failure happens, the alert that matters is buried in the same flow.
What to keep inside: most alert rule, most severity mapping, most runbook, most escalation path. What to source externally: the trace storage, the dashboards, the structured schema, the eval-on-trace hooks.
Capability 5: kill switches
Kill switches are the controls that disable an AI feature, route around a failing model, or revert to a baseline behavior when something is wrong. They encode the org’s worst-case product judgment; what the system does when the model is producing bad output, when costs are spiking, when a customer reports a serious failure, when a regulator asks the org to pause.
Outsourcing kill switches outsources the org’s ability to respond to an incident. The contractor or vendor who built the kill switch is not on call; they cannot pull the switch at 3am; they cannot make the judgment call about which switch to pull when multiple failure modes overlap. The team that owns the product must own the controls that disable the product.
This is the single most important capability on the buy-trap list. A team that bought everything else but built the kill switches has a recoverable position. A team that outsourced the kill switches has no incident response; they are dependent on someone outside the team to respond, and that someone almost usually cannot respond fast enough.
What to keep inside: most kill switch implementation, most trigger rule, most fallback behavior, most operator runbook. What to source externally: nothing. Kill switches are build-only.
The control-plane pattern
The five capabilities share a structural property: they are control-plane capabilities. They govern how the system behaves under stress, under exception, under failure. The happy path can be bought; the request comes in, the model responds, the user is served. The control plane cannot.
Per the AI agency quality system analysis, the control plane is what distinguishes a system the team operates from a system the team merely runs. Operating a system means owning the rules that govern its stress behavior. Running a system means accepting whatever defaults the bought components ship with.
Founders who buy aggressively on the data plane (model access, vector storage, trace storage) and stay inside on the control plane (test set, prompt content, routing config, observability rules, kill switches) ship products that survive their first major incident. Founders who buy aggressively on both planes ship products that look identical on day one and diverge sharply during the first incident.
The build-with-help model
The five capabilities are build-only at the artifact level, but the team does not have to do most line of work. The build-with-help model:
- A contractor produces a starter test set draft. The org’s domain expert curates which cases are kept, which are added, which are dropped. The artifact lives inside the org.
- A consultant configures the model router with a baseline rule set. The org’s security and product leads review and sign off on the rules. The configuration lives inside the org.
- An agency drafts the observability rule set. The on-call engineering lead revises the rules against the org’s actual incident history. The rules live inside the org.
- An external eval-fluent engineer pairs with the internal team on building the kill switches. The implementation, triggers, and operator runbook are many owned by the internal team.
This model lets the team move faster without outsourcing judgment. The pattern is hire-with-handoff, not buy. Per the matrix’s hiring principle, this is the appropriate use of rented capacity; drafting work that the org reviews and owns.
What to encode
A short list of structural decisions.
- Five named control-plane owners. The eval test set, the prompt registry content, the model-routing configuration, the observability rules, and the kill switches each have a named owner inside the team. Not “the platform team”; a specific name on the org chart.
- Vendor-write boundary explicit. The architecture review names which artifacts are vendor-writable (the harness, the trace storage, the router) and which are org-only (the test set, the prompt content, the routing config, the alert rules, the kill switches).
- Audit cadence on each control-plane artifact. Quarterly review checks whether each artifact is current, whether the owner is still operating it, whether new failure modes have been added.
- Incident retrospectives include control-plane review. Most incident triggers a review of which control-plane artifact failed to catch it. If the test set missed it, the test set gets a new case. If the alert rules missed it, the rules get an update. If a kill switch was missing, one gets built.
Frequently asked questions
What is the AI buy trap?
The mirror of the build trap. Teams over-correct on the buy side and outsource capabilities that cannot be safely outsourced because they encode the org’s specific judgment. The five capabilities below are the most common targets. Outsourcing them produces a system that runs but does not represent the org’s actual standards.
Why can’t the eval test set be outsourced?
The test set encodes the org’s specific knowledge; what counts as a valid input, a correct output, an edge case worth catching. A vendor or contractor producing a generic test set scores generic correctness against a problem the org does not have. The test set must be curated by people who understand the domain at customer-by-customer detail.
Why can’t the prompt registry be outsourced?
The registry stack is buy. The prompt content and deployment rules are not. The prompts encode product behavior; the rules encode incident tolerance. Outsourcing either delegates product behavior or incident tolerance to someone outside the team.
Why can’t model-routing config be outsourced?
The router is buy. The configuration is not. The configuration encodes cost-vs-quality tradeoffs, regulatory constraints, customer-tier policies, and fallback semantics. Outsourcing the configuration produces silent policy drift the org learns about during an incident.
Why can’t observability rules be outsourced?
Trace storage and dashboards are buy. The rules that turn traces into alerts are not. They define what counts as alert-worthy, what severity, what runbook. A contractor producing them without org context produces alerts that page the wrong people about the wrong things.
Why can’t kill switches be outsourced?
Kill switches encode worst-case product judgment. They define what the system does when something is wrong. Outsourcing them outsources incident response. The team that owns the product must own the controls that disable it.
How is the AI buy trap different from the AI build trap?
The build trap is over-investing in commodity infrastructure vendors ship better. The buy trap is under-investing in the workload-specific judgment layer. Mirror failures. Build-trap costs surface as drained capacity; buy-trap costs surface as products that fail in ways that reveal standards were rarely encoded.
What do these five capabilities have in common?
Many five are control-plane capabilities; they govern stress behavior, not happy-path behavior. The happy path can be bought; the control plane cannot. Each encodes judgment specific to the org’s product, customers, risk tolerance, and standards.
Can a contractor or agency do any of these correctly?
A contractor or agency can produce drafts; the org must own the final form. The pattern is build-with-help, not buy. Hiring help is fine; outsourcing the artifact is not.
What does this mean for the build-vs-buy-vs-hire matrix?
It refines principles seven and eight. Some capabilities are build-only regardless of how attractive the buy options look, because the org’s specific judgment is what the capability encodes. The five control-plane capabilities are canonical examples; part of the moat layer even when they look like infrastructure.
Key takeaways
- The buy trap is the mirror of the build trap. It drains judgment instead of capacity, and the cost surfaces during incidents rather than at quarterly reviews.
- Five capabilities cannot be safely outsourced: the eval test set, the prompt registry content, the model-routing configuration, the observability rules, and the kill switches.
- Many five are control-plane capabilities; they govern how the system behaves under stress, exception, and failure.
- Each capability has a buy-able stack underneath it (harness, registry tooling, router, trace storage, infrastructure for switch implementation) and an org-only artifact on top.
- The build-with-help model lets the team move faster without outsourcing judgment: contractors draft, the org reviews and owns.
The AI buy trap is harder to see than the build trap because the failure is delayed. The system runs smoothly until it does not. When it fails, the org discovers the rules governing stress behavior were rarely written by anyone who had the org’s judgment in their head. The fix is naming the five control-plane capabilities, owning them inside the team, and using contractors and vendors only for drafts the org reviews. The cost of not doing it is the next incident.
Arthur Wandzel