The mirror of the previous piece. There are founders who should not build AI in-house and should outsource. There are also founders for whom outsourcing is a category error; the wrong frame, not the wrong tradeoff. The four conditions below name the situations where outsourcing AI development produces a worse outcome than in-house regardless of the agency’s quality. Most founders we audit who outsourced AI in 2024 and regret it now were sitting on at least two of these and did not see them. Mistake cost: roughly 1.5x the original engagement plus customer trust damaged in the failure window. The four conditions are proprietary data depth, regulatory boundary accountability, IP-critical defensibility, and eval-set tacit judgment. When two or more are present, refuse the outsource.
This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s first principle says most AI capability resolves to a named verb. The second says moat density, integration depth, and decision velocity are the axes that determine the verb. The four conditions below are the canonical signals of high moat density; situations where the build verb is the only verb that produces the right outcome.
The four refusal conditions
There are four conditions under which a founder should refuse to outsource AI development to an agency or contractor, regardless of how strong the partner is. Each is a structural signal that the work cannot survive a contract boundary intact. Any one is a yellow flag. Any two together is a red flag and the right answer is build.
The conditions are:
- The work depends on proprietary data the org cannot legally or commercially share at the depth the build requires.
- The work sits on a regulatory boundary the org alone is accountable for.
- The IP being produced is the org’s primary defensible asset.
- The eval set encodes tacit judgment that cannot be transferred in writing.
These are not contract problems. They are knowledge-locality problems. Each is a situation where the relevant judgment lives inside the org’s people and the org’s data, and any handoff to a contractor produces a system that approximates that judgment without containing it.
Condition 1: proprietary data depth
The first condition is the most common. The org’s AI capability depends on data that is unique to the org; customer transactions, internal documents, sensor readings, support conversations, financial flows, clinical records; and the depth of intuition required to build a good system against that data exceeds what a contractor can absorb.
The trap is that the data looks shareable. The org runs a security review, signs an NDA, exports a sanitized sample, and the agency builds against the sample. The system handles the sample well. It ships. Three months later, the long tail surfaces; the 8 percent of cases the sample did not contain, the schema variations that appeared after the export, the customer-specific quirks that only show up at scale; and the system fails on the cases that mattered.
The structural reason: data depth is built through repetition over time. A domain expert who has lived inside the data for two years has internalized 4,000 edge cases that no document captures. An agency engineer with three months of access has captured maybe 200. The system the agency ships is built against the 200; the org’s customers exercise the 4,000. The gap is not closeable inside a contract.
This is also where legal risk concentrates. Per the AI agency security review most CTO should run, the controls around data export to an agency environment are a hard ceiling on what work can be outsourced. When the data is regulated or commercially sensitive, exporting it is a control surface most legal teams will not approve, and the depth-of-intuition problem becomes moot; the contractor rarely sees the data at the required depth in the first place.
Condition 2: regulatory boundary accountability
The second condition is regulatory boundaries. HIPAA, GDPR, SOC 2, FedRAMP, banking regulations, FDA requirements; each makes the org accountable for the AI system’s behavior in ways that cannot be delegated. A contractor can build to the regulation. The org signs the audit, pays the fine, and loses the customer when the system fails.
The accountability cannot be outsourced. The work that produces the accountability; the prompt content, the eval rules, the routing policy, the kill switches; should not be outsourced either, because the work is what the audit examines. An auditor asking “why did your AI system make this decision in this case?” expects an answer that traces back to a named person inside the org, not a vendor’s engagement manager.
A common failure pattern: the agency builds a regulated AI system in 2024 and the engagement ends. The first regulatory examination lands in 2025. The org’s compliance lead discovers that the policy logic, audit-log structure, and eval rules reflect the agency’s general practice rather than the org’s specific posture. The remediation is a rewrite under audit pressure at roughly 2x the original build cost.
Regulated AI work is build by default. Hire support to compress the timeline; outsource the rails per the case for buying your model gateway and building your prompt library. Do not outsource the regulated layer.
Condition 3: IP-critical defensibility
The third condition is IP-critical work. The AI capability is the org’s primary defensible asset; what an acquirer pays for, what a competitor cannot replicate, what investors underwrite. Outsourcing IP-critical work creates two structural risks.
First, contractor learning leaks. An agency engineer who works on the org’s IP-critical capability for 6 months absorbs patterns that adjacent clients benefit from. The contract restricts what the engineer can copy verbatim; it does not restrict what the engineer can carry as intuition. The org’s defensibility erodes quietly in proportion to how generalizable the patterns are.
Second, IP locality. The IP itself sits in a contract environment the org does not fully control. Source code in the agency’s repos, prompts in the agency’s registry, eval cases in the agency’s storage. The contract specifies handoff terms; in practice, the handoff is incomplete, and the org spends 2 to 4 quarters reconstructing what the contract said it had. By the time the org owns the artifact, the artifact is already older than the moat-density window justified.
For most defensible-AI businesses, the IP-critical layer is exactly the layer that should be built per the matrix’s fourth principle: the agent orchestration, the eval suite, and the prompt library. These are also exactly the layers that founders most often outsource to agencies who promise speed. The speed is real; the IP cost is also real and surfaces 18 months later as a flat acquisition multiple.
Condition 4: eval-set tacit judgment
The fourth condition is the eval set. The eval set encodes the org’s judgment about what counts as correct behavior. That judgment is mostly tacit; domain experts who recognize that a given output is wrong but cannot fully articulate why.
A contractor producing the eval set generates a credible-looking artifact: 200 to 500 cases with rubrics, severity levels, expected outputs. The artifact scores green on the demos. It also scores green on cases the org does not care about, because the contractor selected cases by what was easiest to articulate, not by what mattered most. The cases that mattered most were the ones the domain expert could not fully articulate; the contractor rarely asked about those because the contractor did not know to ask.
Per stop scoping AI projects in features, scope them in evaluations, the eval set is the project’s center of gravity. When the eval set is wrong, the project shipped on time against the wrong target. When the eval set is curated by a contractor without the tacit judgment, the eval set is wrong by construction.
The fix is to keep eval-set authorship inside the org and use the contractor for everything around the eval set: the harness, the regression tooling, the dashboard, the storage. Per the case for buying the eval stack and building the evaluator, this is the standard split. The eval set itself does not survive the contract boundary; the rest of the eval stack does.
Why NDAs and security reviews are insufficient
A common counter to many four conditions: “We have an NDA. We have a security review. We have a data-handling agreement.” Many true and many insufficient.
NDAs and security reviews fix legal exposure. They do not fix the depth-of-intuition problem, the accountability locality problem, the IP-leak-through-pattern problem, or the tacit-judgment problem. Each of those is structural, not contractual. The contractor with full legal access to the data still does not have the years of exposure that the in-house domain expert has. The contractor with full audit access still does not sign the audit. The contractor with full IP rights restriction still carries patterns. The contractor with full transcript of conversations with domain experts still does not have the unspoken judgment.
The contract layer governs what is allowed. The knowledge layer governs what is possible. The four conditions above are about knowledge, not allowance.
The hybrid that does work
When two or more conditions are present, the right shape is build-with-help, not buy. Hire one or two senior AI engineers who own the moat layer end to end. Bring in an agency or contractor for the rails; the model gateway, the observability backend, the eval harness, the vector store, the basic agent framework. Per the AI hybrid playbook, the inverse split applies here: the agency owns roughly 30 percent (the rails); the in-house team owns roughly 70 percent (the moat).
This shape is more expensive than full outsourcing for the first six months and roughly the same cost as full outsourcing over 18 months. The probability of the system passing its first regulatory examination, surviving its first acquisition diligence, and earning its first eval-driven competitive moment is dramatically higher.
What to encode
For founders deciding whether to outsource AI development, encode the four conditions as a checklist that runs before any agency engagement is signed.
- The data test. Does the build require depth of intuition over data the org cannot share at that depth? If yes, the data layer stays in-house.
- The regulatory test. Is the org accountable for the system’s behavior under a named regulation? If yes, the policy layer stays in-house.
- The IP test. Is the AI capability the org’s primary defensible asset? If yes, the moat layer stays in-house.
- The eval test. Does the eval set require tacit judgment that lives only in domain experts? If yes, the eval set stays in-house.
When two or more answers are yes, refuse the outsource. Run the build-with-help shape instead. The agency engagement that respects these boundaries produces a stronger system than the engagement that absorbs them, and the founder who refuses on these grounds is making a sourcing decision that respects what the work is.
Frequently asked questions
When should a founder refuse to outsource AI development?
When the work depends on proprietary data the org cannot share at depth, sits on a regulatory boundary the org alone is accountable for, produces IP that is the org’s defensible asset, or relies on an eval set that encodes tacit domain judgment.
Why is proprietary data a refusal condition?
Data depth is built through repetition over time. A contractor with three months of sample access captures roughly 200 edge cases; the in-house domain expert has internalized 4,000. The system the contractor ships handles the 200 well and fails on the 4,000.
What does a regulatory boundary mean here?
Regulations like HIPAA, GDPR, SOC 2, FedRAMP, banking, and FDA make the org accountable in ways that cannot be delegated. The work that produces the accountability should not be outsourced either, because the auditor’s questions land on a named person inside the org.
When is AI development IP-critical?
When the AI capability is what an acquirer pays for, what a competitor cannot replicate, and what investors underwrite. Outsourcing IP-critical work leaks pattern intuition to adjacent clients and lets IP sit in contractor environments the org does not fully control.
Why is the eval set untransferable?
Most of the judgment is tacit. Contractors generate credible-looking eval artifacts that score green on cases the org does not care about, because the cases that mattered were the ones the domain expert could not fully articulate.
Can NDAs and security reviews fix the proprietary data problem?
They reduce legal exposure but do not change the depth-of-intuition problem. The contractor with full data access still does not have the years of exposure that the in-house expert has.
How does this principle reconcile with the AI agency manifesto?
The AI agency manifesto describes what an agency should be when an agency is the right move. This piece names the conditions where it isn’t, regardless of agency quality.
What happens when founders ignore these conditions?
Systems that pass user acceptance and fail in production along the proprietary-data, regulatory, IP-leak, or eval-judgment axis. Failures surface 6 to 18 months in; remediation costs roughly 1.5x the original build.
What does this principle imply for the build-vs-buy-vs-hire matrix?
It refines the first and second principles. Some capabilities are build-only because moat density is high enough that no other verb produces the right outcome. The four conditions are the canonical moat-density indicators.
Key takeaways
- Outsourcing fails on proprietary data because depth-of-intuition cannot be transferred inside a contract window.
- Regulatory accountability cannot be delegated; the work that produces accountability should not be either.
- IP-critical work outsourced to agencies leaks patterns to adjacent clients and produces flat acquisition multiples 18 months later.
- Eval sets that encode tacit judgment do not survive contractor authorship; they must be curated in-house.
- NDAs and security reviews fix legal exposure, not knowledge locality.
- When two or more conditions are present, run the build-with-help shape: in-house moat layer, agency-supplied rails.
Return to the AI build-vs-buy-vs-hire decision matrix manifesto; the anchor that frames this refusal pair.
Arthur Wandzel