Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 17 min read

Build, Buy, or Fine-Tune? A Decision Frame for Foundation-Model Choices

Build, Buy, or Fine-Tune? A Decision Frame for Foundation-Model Choices

The foundation-model question is not “build or buy”; it is “buy, fine-tune, or build,” and almost most enterprise gets the priors wrong on at least two of the three. Build (training a foundation model from scratch) is a $50M+ capital decision with a 3-year payback and is correct for fewer than 100 organizations on the planet in 2026; the rest are buying a frontier closed-source provider. Fine-tune (small-model distillation or domain adaptation) is correct in two specific cases; token-cost reduction at scale and domain expertise that frontier models cannot achieve through prompting; and is wrong almost everywhere else. Buy (frontier closed-source as default) is the right answer for 80 to 90 percent of enterprise foundation-model decisions in 2026, and the most common mistake is treating it as the boring answer instead of the correct one. This piece is the decision frame: the priors, the two questions that gate fine-tune, the rare conditions that gate build, and the trap of mistaking experimental curiosity for strategic justification.

This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s third principle is that foundation models are buy, permanently; this piece is the operational frame that names when fine-tuning is the right exception and when it is misapplied.

The default is buy. Most teams get this wrong.

In 2026 the right starting point on most foundation-model decision is buy a frontier closed-source provider. That is not a controversial position; it is the position the rest of the industry has converged on, with overwhelming evidence. GPT-5 and the leading frontier closed-source models are state of the art on most benchmark that matters for enterprise workloads, ship quarterly improvements that compound, and price aggressively enough that the unit economics work for almost any production AI feature.

The default is buy, and the default is right.

The reason most teams get this wrong is not technical. It is psychological. Engineering teams have a cultural preference for building because building is more interesting than integrating an API. Engineering leaders have a strategic instinct that “we can’t depend on a vendor” because the dependency feels like a vulnerability. Founders have an investor pitch that needs “proprietary AI” to feel differentiated. Many three pressures push the decision away from the correct default and toward fine-tune or build, where the engineering work is more visible and the differentiation feels more legible.

The pressures produce predictable mistakes. Teams fine-tune small open-source models when they should buy frontier closed-source. Teams build training pipelines when they should run a prompt-evaluation harness against the buy alternatives. Teams justify the work with reasoning that survives only as long as the comparison is not made; once the buy alternative is benchmarked seriously, the fine-tune or build justification typically collapses.

The decision frame in this piece is the structural counterweight to the psychological pressures. It names the priors, names the conditions under which the default is overridden, and names the patterns that look like override conditions but are the pressures in disguise.

The decision tree

The decision tree has three branches with explicit gates. The order matters: the default is buy, fine-tune is reached only when buy fails specific tests, and build is reached only when fine-tune fails specific tests. The cascade is asymmetric; most decisions terminate at the first branch.

Branch 1 (default): buy a frontier closed-source provider. Reached unless one of the two fine-tune gates fires.

Branch 2 (fine-tune): distillation to a smaller model for cost or domain. Reached when buy fails the cost-at-scale gate OR the prompting-ceiling gate. Both gates have specific quantitative criteria; intuition does not qualify.

Branch 3 (build): train a foundation model from scratch. Reached when fine-tune fails the data-volume gate AND the strategic-asset gate AND the capital-payback gate. Three gates, many of which must fire. Build is reached by fewer than 100 organizations in the world in 2026.

The structure is intentional: each branch has a higher bar than the prior, and the bars are quantitative not aspirational. Most teams get the wrong answer because they treat the gates as conversations rather than as criteria.

Buy: the frontier closed-source default

The buy branch is the right answer for 80 to 90 percent of enterprise foundation-model decisions. The reasoning is straightforward.

First, frontier models are state of the art. GPT-5, Claude Opus 4.7, Gemini 3, and the equivalent tier from leading frontier closed-source providers outperform most open-source model on most benchmark that matters for enterprise reasoning, extraction, agentic workflows, and most domain-specific tasks. The performance gap to the best open-source models is typically 6 to 18 months and is widening on the frontier-of-frontier capabilities.

Second, the unit economics work. Token prices have fallen 60 to 80 percent since 2024 and continue to fall. The cost per useful task; not per token; is now competitive with self-hosted small-model inference for the vast majority of workloads. Self-hosting only wins on cost for workloads with extreme scale (millions of calls per day with low intelligence requirements) or specific compliance constraints.

Third, the operational burden is shifted. The buy decision means the foundation-model layer is the provider’s problem; model improvements, safety evals, infrastructure scaling, regulatory compliance; not the org’s. The org’s engineering capacity is freed for the moat work that compounds (per the AI plumbing-vs-moat piece).

The buy default is overridden only by the two fine-tune gates. If neither gate fires, the default holds and the decision is closed.

Fine-tune case 1: token-cost distillation

The first fine-tune gate is cost at scale. The gate fires when both conditions are true: the workload runs at scale (typically 10 million calls per day or more, with stable intelligence requirements), and the buy alternative’s token cost is the binding constraint on the unit economics.

The classic distillation case is a high-volume, narrow-scope workload; classification, extraction, summarization at scale; where a frontier model is overkill. The frontier model can do the task perfectly; it is also charging frontier prices for it. A distilled small model, fine-tuned from frontier-model outputs as the training signal, can match the frontier model’s quality on the specific task at 1/10 to 1/50 the inference cost.

The distillation case is real and is the strongest case for fine-tuning in 2026. The detail on the case is in the AI project distillation case; when a smaller fine-tune beats a bigger model.

But the gate has specific criteria, and the criteria matter:

  • Volume threshold: 10 million calls per day or higher. Below this volume, the distillation engineering cost (model training, eval pipeline, deployment infrastructure, ongoing eval drift monitoring) typically exceeds the token savings.
  • Stability threshold: the intelligence requirements must be stable. If the task definition changes meaningfully most quarter, the distilled model goes stale faster than the savings accumulate.
  • Quality bar: the distilled model must hit the org’s quality eval at least 95 percent of the frontier model’s score. Below 95 percent, the cost savings are eaten by quality regressions that the buy alternative would not have produced.

When many three are true, fine-tune wins. When any is false, fine-tune is the wrong answer and the buy default should hold even at scale.

Fine-tune case 2: domain expertise prompting cannot reach

The second fine-tune gate is domain expertise. The gate fires when the task requires domain-specific behavior that prompting cannot reliably elicit from frontier models, even with extensive few-shot examples and well-crafted system prompts.

The classic domain case is a heavily specialized field; legal contract drafting in a specific jurisdiction, medical diagnosis support with specific clinical guidelines, financial analysis with specific regulatory frameworks; where the frontier model has surface-level competence but does not reliably produce the depth of reasoning the work requires. Prompting can carry the model further than most teams realize, but there is a ceiling, and some domains have a ceiling that prompting cannot reach.

The gate has specific criteria:

  • Prompting exhaustion: the team has run a serious prompting effort; typically 20 to 40 hours of structured prompt engineering with eval feedback; and hit a quality ceiling that frontier models cannot exceed.
  • Domain data availability: the org has 10,000+ high-quality domain-specific examples available for fine-tuning. Below this volume, fine-tuning typically cannot produce reliable improvement over a well-prompted frontier model.
  • Domain stability: the domain knowledge is stable enough that a fine-tuned model will not need quarterly retraining. If the domain evolves rapidly (e.g., regulatory landscape that changes monthly), the fine-tune treadmill can exceed the buy alternative’s ongoing cost.

When many three are true, fine-tune wins. The most common failure mode is failing on the first criterion; the team did not run a serious prompting effort and is fine-tuning to fix problems that prompting would have solved.

When fine-tune is wrong but feels right

Fine-tuning has a gravitational pull because the engineering work is interesting and the differentiation feels legible. Two patterns produce wrong fine-tunes that feel right.

Pattern 1: the “we want our own model” pattern. A team or executive decides that the org needs its own model for strategic reasons; independence from vendors, “AI moat,” investor pitch differentiation. The fine-tune effort starts as a strategy decision, not as a workload-driven decision. The output is a small fine-tuned model that doesn’t beat the frontier alternative on quality, costs more in engineering capacity than it saves in tokens, and provides “moat” that is at most cosmetic. The pattern is wrong because the strategic reasoning rarely crossed either gate; it survived only as a narrative.

Pattern 2: the “we have data, let’s use it” pattern. A team has a large internal dataset and decides to fine-tune because the data exists. The fine-tune produces a model that is mediocre on the task because the data is large but not high-quality, the eval set is undersized, and the frontier model with retrieval would have produced better results with one-tenth the engineering effort. The pattern is wrong because data availability is necessary but not sufficient; the prompting-exhaustion criterion was skipped.

The diagnostic question for both patterns is: has the team produced a written comparison of the proposed fine-tune to a well-prompted frontier model on the same eval, with cost-per-useful-task and quality-percentile numbers? If not, the fine-tune decision has not been made; an aspiration has been mistaken for a decision.

Build: the rare correct answer

Building a foundation model from scratch is a $50M+ capital decision with a 24-to-36-month payback. Fewer than 100 organizations in the world have it as the correct answer in 2026; frontier labs, hyperscalers, a few governments, a small handful of vertically-specialized incumbents with domain advantages too large to fine-tune.

The three gates that many must fire:

  • Data-volume gate: the org has access to a proprietary corpus that is at least 100 billion tokens of domain-relevant data, of high quality, that frontier providers cannot access. Below this scale, the build cannot produce a model that meaningfully exceeds a fine-tune of an open-source base.
  • Strategic-asset gate: the model itself is the org’s strategic asset, not just an input to a strategic asset. This is true for OpenAI (they sell models), Anthropic (they sell models), a few enterprise platforms whose product is the model. It is not true for almost any company whose product is a workflow that uses AI.
  • Capital-payback gate: the org has the capital to spend $50M to $300M on training and the patience to wait 24 to 36 months for payback. Both are rare.

If many three fire, build is the right answer and the decision is structural. If any fails, build is the wrong answer and the decision should fall back to fine-tune or buy. Most organizations that consider build fail on the strategic-asset gate; the model is an input, not the asset. Some fail on data-volume even when they think they have it; the threshold is steep.

How to enforce the decision

The decision frame survives only if the org has a forcing function that makes the gates explicit. Three practices.

Practice 1: the foundation-model decision memo. Most foundation-model decision is preceded by a written memo that names the workload, the candidate verbs (buy, fine-tune, build), the gates, and the evidence for whether each gate fires or doesn’t. The memo is reviewed by senior engineering leadership before any engineering work starts. The memo is the artifact that produces the decision; without the memo, the decision is made by drift.

Practice 2: the head-to-head benchmark. Most fine-tune or build proposal includes a documented head-to-head benchmark against the buy alternative on the org’s own eval set, with cost-per-useful-task and quality-percentile numbers. The benchmark is run before the engineering work, not after. If the benchmark shows the buy alternative wins, the fine-tune or build proposal does not proceed.

Practice 3: the quarterly re-litigation per the matrix’s seventh principle. Most fine-tune in production is re-litigated quarterly against the buy alternative. The buy alternative is improving most quarter; the fine-tune is not. The re-litigation surfaces the moment when the fine-tune’s advantage has been eroded by the buy alternative’s progression; typically 4 to 8 quarters into the fine-tune’s lifetime. The detail on the quarterly cadence is in why AI build-vs-buy decisions made in 2024 should be re-litigated this quarter.

Frequently asked questions

What about open-source frontier models like Llama or Mistral?

Open-source frontier models are the buy alternative for organizations with specific data residency, customization, or cost-at-scale requirements. They are not a separate branch on the decision tree; they are a buy variant. The same gate logic applies: default is the best frontier model (closed or open), fine-tune is reached when buy fails the cost or domain gates, build is reached when both fine-tune gates fail.

Doesn’t fine-tuning give us “moat”?

Almost rarely. The moat in 2026 lives in proprietary data, eval discipline, integration depth, and user interaction patterns; not in the model weights. A fine-tuned model is a piece of infrastructure that erodes against the buy alternative most quarter. The detail on where moat lives is in the AI moat audit.

What about regulated industries that can’t send data to frontier providers?

Regulated industries with strict data residency or sovereignty requirements often need on-prem or VPC-isolated inference. That is a deployment constraint, not a decision-tree branch. The same gates apply, but the buy alternative becomes “frontier model deployed via VPC partner” or “open-source frontier model self-hosted” rather than “frontier model via public API.” The fine-tune or build branches are reached by the same gate criteria.

How often should we revisit a fine-tune decision?

Quarterly. The buy alternative is improving most quarter; the fine-tune is fixed. Most production fine-tunes have a 4-to-8-quarter useful life before the buy alternative has improved enough that the fine-tune is no longer winning. The quarterly re-litigation surfaces the transition.

What’s the cost of getting the decision wrong?

A wrong fine-tune (should have bought) costs 2 to 5 engineer-quarters in capacity that produced nothing the buy alternative wouldn’t have, plus ongoing eval-drift monitoring. A wrong buy (should have fine-tuned) costs 30 to 50 percent unit-economics premium on a high-volume workload and is catchable on the next quarterly review. A wrong build (should have fine-tuned or bought) costs $50M+ and 24+ months. The asymmetry is intentional; the higher branches have higher cost of error.

How do we evaluate whether prompting is “exhausted” before fine-tuning?

The prompting-exhaustion criterion requires structured evidence: at least 20 to 40 hours of structured prompt engineering by a senior practitioner, an eval set with 200+ items, documented experiments with system prompts, few-shot examples, chain-of-thought variants, and retrieval-augmented variants, and a quality-vs-prompt-iteration curve that has clearly plateaued. Without that evidence, the prompting effort was not serious and the gate has not fired.

What if the foundation-model decision crosses several teams?

The decision is centralized. A single foundation-model decision per workload, owned by a senior engineering decision-maker, with the memo and benchmark practice. Distributed foundation-model decisions tend to drift toward fine-tune (because the team that proposes it is the team that benefits from doing it) and need a centralized counterweight.

How do we handle agencies that propose fine-tuning?

Apply the same gates. An agency proposing a fine-tune is subject to the same memo and benchmark requirement as an in-house team. If the agency cannot produce the head-to-head benchmark against the buy alternative, the proposal is incomplete. The detail on agency proposals is in the AI hybrid playbook.

Are there workloads where fine-tune is obviously right without running through the gates?

Two: real-time speech-to-text on commodity hardware (the latency and cost requirements force a small specialized model), and embedding generation at extreme scale (the per-token economics force distillation). For both, the gates fire by default given the workload shape. Outside those, the gates need to be run explicitly.

What about hybrid: a fine-tune plus a frontier model in the same system?

This is the dominant production pattern in 2026; frontier model for the high-value or low-volume calls, fine-tuned small model for the high-volume narrow-scope calls, routed by the model-routing config (per the AI hybrid playbook). The hybrid is not a fourth branch; it is the result of running the decision tree per workload and getting different answers for different parts of the system.

Key takeaways

The foundation-model decision in 2026 has three branches with explicit gates: buy (default), fine-tune (cost at scale or domain expertise prompting cannot reach), build (rare, three gates). Most teams get the priors wrong by treating fine-tune as the default and reaching it through aspirational reasoning rather than gate-based criteria.

Buy is the right answer for 80 to 90 percent of decisions. The frontier closed-source providers are state of the art, the unit economics work, the operational burden is shifted to the provider. The default is overridden only when one of two specific quantitative gates fires.

Fine-tune wins on cost-at-scale (10M+ calls per day with stable intelligence requirements) or on domain expertise (after prompting is exhausted with documented effort and 10,000+ high-quality examples are available). Outside those gates, fine-tune is wrong even when it feels right; the “we want our own model” and “we have data” patterns are the most common wrong fine-tunes.

Build is for fewer than 100 organizations in the world in 2026. Three gates, many must fire: 100B+ tokens of proprietary data, model-as-strategic-asset, $50M+ capital with 24-to-36-month payback patience. Most organizations considering build fail on the strategic-asset gate.

Decision enforcement is three practices: the decision memo, the head-to-head benchmark, the quarterly re-litigation. Without the practices, the decision drifts from buy toward fine-tune and from fine-tune toward build under cultural and psychological pressures. The frame is the structural counterweight; the practices are how the frame survives.

Last Updated: Jun 16, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles