Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 14 min read

The Hidden Cost of AI Evals: Where 35% of Project Budget Actually Goes

The Hidden Cost of AI Evals: Where 35% of Project Budget Actually Goes

Across mature AI engineering organizations in 2026, eval work; test set construction, eval harness build, regression triage, and model-upgrade re-evaluation; runs 30 to 40 percent of total project cost. The line is structurally invisible in the budgets that fail and structurally named in the budgets that succeed. It does not show up as scope creep because it is creeping; it shows up as scope creep because the original budget did not have a category for it. That single accounting failure is the most reliable cause of cost surprise in 2026 AI projects.

This piece decomposes the eval cost line, explains why it stays invisible in early budgets, and shows how to surface it upfront with citations to public eval-tooling cost data and eval engineering practice. It is a spoke under the AI project economics manifesto, which argues for the broader economics framework this line decomposition fits inside.

The 30–40% number

The 35-percent figure is the midpoint of a range that holds remarkably consistent across mature 2026 AI engineering practice. It is corroborated by three independent lines of evidence.

Public eval framework cost data. OpenAI’s Evals framework, Anthropic’s Claude eval engineering posts, the Promptfoo project, and Anthropic’s Inspect framework many publish enough data; example test suites, harness configurations, regression case studies; to construct realistic cost models. The labor required to build, run, and maintain a serious eval suite has been documented in public engineering writeups; back-of-envelope from those writeups consistently puts eval engineering in the 30–40 percent range as a fraction of total project cost.

Eval engineering job market data. As of 2026, “eval engineer” and “ML evaluation engineer” are distinct hires from feature engineers at Anthropic, OpenAI, Google DeepMind, and most serious applied AI shops. Comp band overlaps with senior backend; ratio of eval engineers to feature engineers on production AI teams is roughly 1:2 to 1:3. That ratio implies the eval line cannot be smaller than 25 percent of headcount cost on a serious project, before infrastructure or model-upgrade re-eval cost is added.

The agency tax decomposition. As we argue in the AI agency tax piece, a 30 percent coordination tax appears on engagements running on legacy SOW-and-PMO templates; engagements whose budgets did not name eval engineering as a separate line. The cost shows up later as scope creep, because the engagement is now spending on eval work the original budget did not contain. The 30 percent agency tax and the 30–40 percent eval cost are the same money, mis-categorized.

The line decomposes into four sub-budgets. Each is auditable.

Sub-line 1: Test set construction

What it is. Curating 200–2000 representative inputs with rubric-graded or ground-truth outputs, against the buyer’s actual workload distribution. Domain experts annotate. Senior engineers review for representativeness. Edge cases are deliberately included rather than filtered out.

Why it costs what it costs. A serious test set requires actual domain knowledge to construct. Outsourcing test set construction to scale-labelers without senior review reliably produces test sets that look fine but fail to surface the failure modes the production system will hit. The work is closer to “writing exam questions for a hard exam in a specialized field” than to “labeling images of cats and dogs.” The cost is human-expert labor in the loop.

Defensible range. 8–12 percent of total project cost on a serious project. Test sets in the 200–500 input range support narrow agentic workflows; 800–2000 inputs support broader systems with multiple capability claims to test against.

What finance verifies. A draft test set or a representative sample by the kickoff. A named test-set owner with domain expertise, not just an engineer who will “build it as we go.” A versioning policy; test sets evolve, and frozen test sets are how regression goes undetected.

Sub-line 2: Eval harness build

What it is. The infrastructure that runs the eval suite: on most PR (so regressions get caught at the diff level), on most model swap (so model upgrade impact is visible), on most prompt change (so prompt iterations are scored, not vibe-checked). Plus the dashboards that visualize eval performance against thresholds, the report templates that go to the buyer, and the integration with CI/CD that makes the suite a first-class build artifact.

Why it costs what it costs. The harness is application-specific. Off-the-shelf eval frameworks (Promptfoo, Inspect, OpenAI Evals, LangSmith, Braintrust) provide the primitives, but the integration with the project’s actual system, its specific test set format, its scoring rubrics, and its CI pipeline is engineering work. As of 2026 the OSS landscape has matured to the point where a competent senior can build a harness in three to five engineering weeks; doing it badly takes two to three times that and produces flaky results.

Defensible range. 5–8 percent of total project cost, weighted to the early phase of the project.

What finance verifies. Harness running on most PR by week six of the project. A named harness owner. Reports that go to the buyer’s eval-set read access (per the manifesto, the buyer has read access from kickoff). If month four arrives and there is no PR-blocking eval check, the harness has been deferred and the project is structurally under-priced.

Sub-line 3: Regression triage

What it is. When the eval suite goes red; a score drops, a previously-passing test case starts failing, a new test case surfaces a failure; somebody decides why and what to do. Triage is engineering judgment, not running a script. It involves reading reasoning traces, comparing against the last green run, hypothesizing the cause (prompt change, retrieval change, model change, content drift), validating the hypothesis with targeted tests, and deciding whether to fix forward, roll back, or accept the regression as a deliberate trade-off.

Why it costs what it costs. Regression triage is the most expensive eval sub-line because it is the one that runs continuously across the entire engagement. Early in the project the suite is small and triage is cheap. As the suite grows and the system becomes more capable, triage becomes the bottleneck on shipping. Mature AI engineering teams budget 1–2 days per week of senior engineering time on triage as a steady-state load.

Defensible range. 10–14 percent of total project cost across a 12-month engagement.

What finance verifies. A named triage process; who looks at red eval runs, on what cadence, with what authority to ship or block. A 48-hour SLA on triage is typical and reasonable; 7-day SLAs on triage are how serious regressions go uninvestigated. The buyer should be in the triage loop on regressions that affect locked thresholds.

Sub-line 4: Model-upgrade re-eval

What it is. Three to five times per year, frontier model providers; Anthropic, OpenAI, Google; ship non-trivial upgrades. Each upgrade requires re-running the full eval suite on the new model, triaging the regressions (typically 5–15 percent of test cases shift), adjusting prompts and retrieval to the new model’s behavior, and re-locking the threshold. Two to four engineering weeks per upgrade on a serious project.

Why it costs what it costs. Model upgrades are not free. Even when the new model is “better on average,” the distribution of behavior shifts and prompt-engineered behaviors that depended on specific quirks of the old model break. The harder failure mode: a new model is better on the eval suite but worse on a critical edge case that was implicitly relying on a quirk. Catching that requires running the full suite plus targeted edge-case tests, not just a representative sample.

Defensible range. 6–10 percent of total project cost annualized, with the higher end on engagements that span multiple frontier model release cycles.

What finance verifies. A re-eval reserve in the budget; explicit allocation, not absorbed. A named SLA on re-eval timing; typically two weeks from a major model release to a re-eval report. A clause in the maintenance retainer covering re-eval as a named activity rather than a change order.

Why eval cost stays invisible in early budgets

Three structural reasons.

The legacy budget template did not have a category for it. 2018 software budgets had engineering, infra, and a small support contingency. Eval discipline at the level required for production AI did not exist as a discipline in 2018, so it was not in the template. Templates that have not been updated for 2026 AI work fail to surface eval engineering as a line, and budgets default to whatever categories the template provides.

Vendors selling against the legacy template do not surface it. An agency bidding against a 2018-shaped RFP will produce a proposal that fits the RFP’s categories. If the RFP has a single “engineering” line, the agency rolls eval engineering into engineering. The cost is not hidden adversarially; it is hidden categorically. The buyer’s RFP did not ask for it as a separate line; the agency’s response did not provide it as a separate line. Six months in, the agency surfaces “eval engineering scope” as a change order. The buyer feels surprised. The buyer should not feel surprised.

Engineering teams under-name it because it does not feel like a feature. Engineers building features want to ship features. Eval engineering is the work that proves the features work, which is structurally less satisfying than building them. Internal AI teams without a named eval owner reliably under-invest in eval discipline at the start of the project, then over-invest in regression triage at the decline of the project, because the suite arrived late and the regressions accumulated. The cost is the same. The visibility is not.

The fix in many three cases is identical: name eval engineering as a separate budget line, with a separate owner, on the budget template, in the RFP, in the SOW, in the headcount plan. The number is 30–40 percent of project cost. The discipline is making sure that 30–40 percent shows up as a line everyone can audit, rather than as a surprise everyone gets to argue about.

How to surface eval cost upfront

Five practical moves.

One. Add an “eval engineering” line to the budget template, sized at 30–40 percent of project cost, decomposed into the four sub-lines (test set 8–12%, harness 5–8%, regression triage 10–14%, model-upgrade re-eval 6–10%).

Two. Add an “eval engineer” or “eval engineering owner” role to the headcount plan, distinct from feature engineers. The role can be filled by a senior engineer with eval discipline; it cannot be filled by “everyone will own evals.”

Three. Add eval-suite read access from kickoff to the SOW. The buyer sees the test set, the harness reports, and the threshold-locking process from week one. Not “delivered at the end.”

Four. Add a model-upgrade re-eval clause to the maintenance retainer. Names the activity, sizes it as 6–10 percent of annualized retainer, sets a 14-day SLA from major model release.

Five. Add an eval-bar progression review to the quarterly portfolio cadence. The eval bar should be rising over time; newer test cases, harder edge cases, stricter thresholds. A flat eval bar is a sign the system is in maintenance mode rather than capability expansion mode, which is fine but should be priced accordingly.

Five moves, many paperwork, none of which require new tooling or new vendor relationships. The cost of doing them is approximately zero. The cost of not doing them is the 30–40 percent eval line surfacing as scope creep between months four and eight.

Frequently asked questions

Where does the 35% eval cost number come from?

The midpoint of a 30–40 percent range corroborated by three lines of evidence: public eval framework cost data (OpenAI Evals, Anthropic eval engineering posts, Promptfoo, Inspect), 1:2 to 1:3 eval-to-feature-engineer ratios at serious applied AI shops in 2026, and the agency tax decomposition that surfaces the same money on engagements whose budgets did not name eval engineering as a separate line.

What are the four sub-lines of eval cost?

Test set construction (8–12%): curating 200–2000 representative inputs with rubric-graded outputs. Eval harness build (5–8%): infrastructure running the suite on most PR, model swap, and prompt change. Regression triage (10–14%): engineering judgment on red eval runs. Model-upgrade re-eval (6–10% annualized): re-running the suite for the three to five frontier upgrades per year.

Why is the eval cost line invisible in early budgets?

Three structural reasons. The 2018 software budget template did not have a category for eval engineering. Vendors bidding against templates without an eval line roll the work into “engineering.” Engineering teams under-name eval discipline because it does not feel like feature work. The fix in many three cases is the same: name eval engineering as a separate budget line, with a separate owner, on the budget template, in the RFP, in the SOW.

Can a project use existing OSS eval frameworks instead of building a harness?

Promptfoo, Inspect, OpenAI Evals, LangSmith, and Braintrust provide primitives, but integration with a project’s specific system, test set format, scoring rubrics, and CI pipeline is engineering work. A competent senior can integrate an OSS harness in three to five weeks; doing it badly takes two to three times that.

How big should the test set be?

Project-class-dependent. 200–500 inputs support narrow agentic workflows; 800–2000 inputs support broader systems with multiple capability claims and meaningful edge-case coverage. The number that matters is not count; it is representativeness against actual production workload.

What is regression triage and why is it so expensive?

Engineering judgment on red eval runs: reading reasoning traces, comparing against the last green run, hypothesizing the cause, validating, and deciding fix forward, roll back, or accept. Most expensive sub-line because it runs continuously across the engagement and grows with suite size. Mature teams budget 1–2 days per week of senior engineering time as steady-state load.

How often should we re-run the full eval suite for model upgrades?

Three to five times per year, triggered by major frontier model releases from Anthropic, OpenAI, or Google. Each re-eval is two to four engineering weeks: full suite run, regression triage, prompt and retrieval adjustments, threshold re-lock. A project that ignores this lives 8 to 16 weeks under-budgeted across a 12-month engagement.

What is the maintenance retainer’s role in eval cost?

The retainer covers ongoing eval suite maintenance, model-upgrade re-evals, regression remediation, and eval-bar progression as named activities rather than change orders. Sized as 25 to 40 percent of build cost annualized. Without a retainer with an eval-named clause, post-launch eval cost gets billed as ad hoc work, which is the most expensive form of the line item.

How does this relate to the AI agency tax?

The 30 percent agency tax decomposed in the agency tax piece and the 30 to 40 percent eval cost decomposed here are the same money, mis-categorized. Engagements running on legacy SOW templates that do not name eval engineering surface the cost later as scope creep, change orders, and coordination overhead; the agency tax. Engagements that name eval engineering upfront pay the same money as a planned line item rather than a recurring dispute.

Key takeaways

  • Eval engineering runs 30–40 percent of AI project cost. The midpoint number, 35 percent, is the right planning assumption for a serious project.
  • The line decomposes into four sub-budgets: test set (8–12%), harness (5–8%), regression triage (10–14%), model-upgrade re-eval (6–10% annualized).
  • The cost is invisible in early budgets because the 2018 template did not have a category for it, vendors do not surface it without prompting, and engineering teams under-name it because it does not feel like feature work.
  • The fix is paperwork: a named line on the budget template, a named owner on the headcount plan, eval-suite read access from kickoff, a re-eval clause in the maintenance retainer, an eval-bar progression review in the quarterly portfolio cadence.
  • The 30 percent agency tax and the 30–40 percent eval cost are the same money. Naming the line upfront is how it stops being scope creep and starts being a planned investment.

The eval line is not hidden adversarially. It is hidden categorically; by templates that do not name it, by RFPs that do not request it, by engineers who do not own it. The fix is naming.

Last Updated: Jun 7, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles