The AI project TCO comparison: in-house team vs AI agency vs hybrid (with worked numbers)

Most TCO comparisons between in-house AI teams and AI agencies stop at salary versus invoice and call it a day. That comparison is wrong by roughly 40 percent because it omits the load-bearing 2026 line items: eval engineering, model-upgrade re-evaluation, observability, and the recruiting cost of building an AI bench from scratch. This piece walks the full 24-month TCO for the same project under three delivery models; in-house, AI agency, and hybrid; with the numbers behind most line. The model is not academic. It is the spreadsheet we put in front of CFOs before they sign.

The comparison sits inside the AI project economics manifesto: if evaluation is the unit of account, then a TCO comparison without eval cost is a TCO comparison of imaginary projects.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

The reference project
The full line-item TCO model
Scenario A; in-house team
Scenario B; AI agency
Scenario C; hybrid
Side-by-side 24-month totals
What moves the answer
Frequently asked questions
Key takeaways

The reference project

To compare delivery models honestly we need a single project specification that many three execute against. The reference project below is a composite of typical mid-market enterprise AI engagements we have priced in 2025–2026.

Workload. A B2B SaaS company is building an AI agent that drafts customer support replies, escalates per-policy, and writes structured follow-up tasks into the CRM. Volume scales to 80,000 actions per month by month 12.

Eval bar. 87 percent weighted score on a 350-prompt eval set, with cost-per-action at or below $0.04, latency P95 under 4 seconds.

Timeline. 24 months total: 6 months build to first deployable threshold; 18 months operating, including three model-upgrade cycles and quarterly eval-set expansions.

Scope. Full ownership of orchestration, evals, observability, prompt registry, and tool integrations. Inference is rented from frontier vendors. Observability platform is rented (Langfuse, Arize, or equivalent).

This is the project. Now we walk the three delivery models against it.

The full line-item TCO model

Most honest AI TCO model carries seven cost categories. A model with fewer than seven is mispriced. We borrow this decomposition from the 7 cost lines most CFOs miss piece, with one addition for the make-or-buy decision (recruiting cost on the in-house line).

Engineering build cost; the load-bearing direct cost of getting to first deployable threshold.
Eval engineering cost; test set construction, harness build, regression triage, threshold-locking. Typically 30 to 40 percent of build cost.
Model-upgrade re-evaluation cost; two to four engineering weeks per upgrade, three to five upgrades per year.
Observability cost; platform license, retention, dashboard build. Sized at 15 to 25 percent of inference spend plus a one-time build line.
Inference cost; pass-through. The same number across delivery models if the workload is the same.
Maintenance retainer / steady-state engineering; ongoing capacity for regression triage, prompt iteration, eval expansion.
Recruiting and onboarding cost (in-house only); the load-bearing hidden line for the in-house path.

We will price each line in each scenario in 2026 dollars.

Scenario A; in-house team

The in-house team scenario assumes the company hires three AI engineers and one ML platform engineer to deliver the project, with shared support from existing data and infra teams.

Engineering build cost. Four FTEs at a fully-loaded $260K average for 6 months: $520K. This is direct headcount through first deployable threshold.

Eval engineering cost. Internal teams pay this cost in calendar time rather than separate billable hours, but it is real. Empirically the eval lift consumes roughly 35 percent of the same FTE capacity over the build window: $182K of attributed cost. Test sets, harness, regression workflows, prompt-evaluation loops.

Model-upgrade re-evaluation cost. Three upgrade cycles in months 7–24 at three engineering weeks each. At a marginal cost of roughly $15K per engineering-week (loaded), that is $135K across the operating period.

Observability cost. Platform license at $36K per year for 18 months ($54K) plus a one-time $40K build-out for dashboards and alerting: $94K.

Inference cost. Workload at 80K actions per month average, ramped, with a cost-per-action of $0.04: $33.6K per month at steady state. Across the operating period (months 7–24, with ramp), $432K cumulative.

Maintenance / steady-state engineering. Two of the four FTEs persist as the steady-state team for months 7–24. Eighteen months at $260K loaded, two FTEs: $780K.

Recruiting and onboarding. The hidden line. Hiring four AI engineers in 2026 takes 6 to 9 months of recruiting calendar time per hire, with a 25 to 30 percent agency placement fee on each successful hire. Loaded recruiting cost across four hires averages $120K. Onboarding ramp (3 months at 50 percent productivity per engineer): an effective cost of $130K. Total: $250K.

24-month in-house TCO: $2,393K (≈ $2.4M).

The in-house line is dominated by the cost of building and operating the bench. Most internal teams under-estimate the recruiting and onboarding line by roughly half. The number above is conservative. We discuss the structural reasons in the build-vs-outsource piece.

Scenario B; AI agency

The agency scenario assumes the company hires a competent forward-deployed AI agency on an eval-threshold engagement model with a 30 percent threshold-tied holdback.

Engineering build cost. Six-month engagement at $80K per month for a fractional-team-of-three with senior tenure: $480K. Slightly under in-house FTE-equivalent cost because the agency does not carry recruiting drag and operates pre-formed teams.

Eval engineering cost. Built into the engagement at the 30 to 40 percent ratio the manifesto names. Inside the $480K. Not a separate line.

Model-upgrade re-evaluation cost. Covered under a $12K per-month maintenance retainer that includes the planning horizon of three upgrades per year. Months 7–24: $216K.

Observability cost. Same $94K; observability platform cost is delivery-model-invariant. Build-out is bundled into the engagement; the license cost is the same.

Inference cost. Same $432K. Pass-through is structural.

Maintenance / steady-state engineering. $12K per month retainer covers the standard 18-month operating envelope. Already counted under model-upgrade above.

Recruiting and onboarding. Zero. The agency carries the bench cost.

24-month agency TCO: $1,222K (≈ $1.2M).

The agency line is materially lower than in-house, by roughly $1.17M over 24 months. Two structural reasons: agencies amortize recruiting cost across many engagements, and agencies have eval and observability practice that internal teams build from scratch. We discuss the gating risk on the agency line in the AI agency tax piece; the 30 percent coordination tax is the failure mode that erodes this advantage when the agency is not run on a 2026 operating model.

Scenario C; hybrid

The hybrid scenario assumes the company hires the agency for the build phase (months 1–6) and transitions ownership to a smaller in-house team (two FTEs) for months 7–24, with the agency on a smaller maintenance retainer for re-eval cycles.

Engineering build cost. Same $480K agency engagement for months 1–6.

Eval engineering cost. Inside the $480K agency engagement.

Model-upgrade re-evaluation cost. Smaller agency retainer at $5K per month covering re-eval planning and regression-triage support for the in-house team. Months 7–24: $90K.

Observability cost. Same $94K.

Inference cost. Same $432K.

Maintenance / steady-state engineering. Two in-house FTEs at $260K loaded for months 7–24: $780K.

Recruiting and onboarding. Two AI hires (lower bench than scenario A): roughly $125K total recruiting plus onboarding ramp.

24-month hybrid TCO: $2,001K (≈ $2.0M).

The hybrid line sits between in-house and agency. The build phase captures the agency’s bench advantage. The operating phase captures the in-house team’s lower marginal cost on steady-state work. The trade-off is integration risk at the handoff; see the AI agency manifesto on what an honest handoff looks like.

Side-by-side 24-month totals

Line	In-house	Agency	Hybrid
Engineering build	$520K	$480K	$480K
Eval engineering	$182K	(incl.)	(incl.)
Model-upgrade re-eval	$135K	$216K	$90K
Observability	$94K	$94K	$94K
Inference (pass-through)	$432K	$432K	$432K
Maintenance / steady-state	$780K	(incl.)	$780K
Recruiting and onboarding	$250K	$0	$125K
24-month TCO	$2,393K	$1,222K	$2,001K

The agency line is roughly half the in-house line. Most CFOs encountering this number for the first time push back on it because their mental model is “agencies are 1.5x to 2x in-house.” That mental model was correct for 2018 software because the recruiting line was a few months of effort and the eval and observability lines did not exist. In 2026 it is wrong.

The two numbers inside the table that move the most across engagements are recruiting and steady-state engineering. If the in-house team builds under one of the AI cities (San Francisco, New York, or comparable hubs) the recruiting line can run twice the figure above. If the company has an existing AI bench, the recruiting line goes to zero and the in-house number compresses by $250K.

What moves the answer

Five sensitivities materially change which delivery model wins.

Existing AI bench. If the company already has 4+ AI engineers and an eval practice, the recruiting line drops out of the in-house scenario and in-house becomes competitive. We see this with FAANG-graduate teams. Most other organizations do not have this bench.

Project duration. At 36 months instead of 24, in-house catches agency because the build cost is amortized over more steady-state. At 12 months, agency is unambiguously cheaper because the build cost dominates.

Workload volume. At 800K actions per month instead of 80K, inference dominates and delivery model becomes nearly irrelevant. At 8K actions per month, the build cost dominates and agency wins by a wider margin.

Eval-engineering maturity. A team with a mature eval practice from a prior project compresses the eval line by 30 to 50 percent. A team starting from zero on evals expands it by the same amount. The agency advantage is structurally tied to portable eval practice.

Model-upgrade pace. If frontier model upgrades slow to 2 per year, the re-eval line drops 40 percent across many scenarios. If they accelerate to 6 per year, it doubles. The pace in 2026 sits at the high end of the 3-to-5 range. We discuss the operational mechanics in the AI project budget anti-patterns piece.

Frequently asked questions

Why is the in-house line higher than the agency line at 24 months?

Because the recruiting and onboarding line, the eval engineering build, and the absence of bench amortization combine to a structural disadvantage that does not show up in salary-versus-invoice comparisons. The in-house team is paying the cost of becoming an AI team in addition to the cost of building the project.

Doesn’t the in-house team get cheaper after the first project?

Yes; the eval and recruiting lines drop sharply on the second and third projects because the bench and the eval practice carry forward. We discuss this compounding effect in the AI project compounding return piece. The TCO comparison above is for the first AI project of meaningful scope.

Is the agency margin reasonable?

Yes. The agency at $80K per month for a senior fractional team of three carries a roughly 30 to 40 percent gross margin after engineering costs, recruiting amortization, and overhead. That margin pays for the bench you do not maintain, the eval practice you do not build, and the recruiting risk you do not absorb.

What if the agency engagement runs over scope?

The eval-threshold engagement model with a 30 percent holdback is what protects the buyer here. We discuss the contract mechanics in the AI project pricing models piece. A fixed-bid engagement does not protect the buyer; an eval-threshold engagement does.

Why does the hybrid line not split the difference?

Because the build cost is concentrated in the first 6 months and the steady-state cost is spread across 18. The hybrid captures the agency’s build advantage but pays the in-house team’s full steady-state cost. It splits the difference on build and pays in-house on operate, which is closer to in-house overall than to agency.

Can the hybrid handoff fail?

Yes. The most common hybrid failure is a handoff at month 6 to a team that was not embedded enough during build to operate the eval suite. The agency should pair-staff during the last 6 weeks at minimum, and the in-house team should be hired by month 4. We discuss the handoff mechanics in the AI project sunk-cost piece; a botched handoff often resembles a sunk-cost trap.

What changes if we use only open-source models?

The inference line drops 40 to 60 percent but the observability and eval-engineering lines rise because open-source serving infrastructure carries operational cost the closed-vendor path absorbs. Net TCO is similar at most volumes; the breakeven sits at high inference workloads.

How does the make-or-buy tree connect to this?

The make-or-buy decision tree is upstream. It produces a “build” or “hybrid” answer; this TCO comparison decides who builds. Run the trees in order.

Key takeaways

Honest 2026 AI TCO requires seven line items: engineering build, eval engineering, model-upgrade re-eval, observability, inference, maintenance, and recruiting (in-house only).
For a 24-month, mid-market AI project, the in-house TCO runs about $2.4M, the agency TCO about $1.2M, and the hybrid TCO about $2.0M.
The agency line is materially lower because agencies amortize recruiting cost across engagements and carry portable eval and observability practices.
The answer flips on existing bench, project duration, workload volume, eval-engineering maturity, and model-upgrade pace.
Run the make-or-buy tree first; this comparison decides who builds once you know you should build.

The AI project TCO comparison: in-house team vs AI agency vs hybrid (with worked numbers)

The reference project

The full line-item TCO model

Scenario A; in-house team

Scenario B; AI agency

Scenario C; hybrid

Side-by-side 24-month totals

What moves the answer

Frequently asked questions

Why is the in-house line higher than the agency line at 24 months?

Doesn’t the in-house team get cheaper after the first project?

Is the agency margin reasonable?

What if the agency engagement runs over scope?

Why does the hybrid line not split the difference?

Can the hybrid handoff fail?

What changes if we use only open-source models?

How does the make-or-buy tree connect to this?

Key takeaways

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

Agentic AI Development: Tool Use and Function Calling

Agile AI Development: Sprint Planning with Your Agency

Where ideas become AI products

Company

General

Case Studies

Services

Resources