Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 14 min read

The AI Project IRR Tradeoff: Speed vs Evaluation Rigor

The AI Project IRR Tradeoff: Speed vs Evaluation Rigor

Internal rate of return on an AI project is not monotonic in speed. It rises with speed-to-revenue up to a point, then falls; sharply; as under-evaluation produces a regression tax that compounds across the first two quarters of production. The shape of the curve is an inverted U with the peak considerably to the left of where most CFOs expect, and considerably to the right of where most engineering leads default. The sweet spot is two to three eval-pass cycles before launch and eight to twelve weeks of production hardening, and the cost of being wrong on either side is large enough that “ship it” instinct and “polish it” instinct both lose money.

This is a spoke under the AI project economics manifesto, which argues that evaluation cost has replaced feature cost as the unit of account in 2026 AI work. IRR is what happens to that unit when projects optimize for speed without naming the eval rigor required to make speed pay.

The two costs that bend the IRR curve

Two costs determine where IRR peaks: opportunity cost and regression cost.

Opportunity cost is the standard one. A dollar of revenue earned in month four is worth more than a dollar earned in month nine; the discount factor depends on the cost of capital and the strategic clock the project sits on. This is the term most CFO models cleanly. It pulls the IRR curve toward speed.

Regression cost is the term most legacy IRR models miss. AI systems shipped before their eval rigor has stabilized incur a tax that surfaces as customer-reported failures, manual override workload, eval-debt remediation sprints that displace planned capability work, and trust erosion that depresses adoption. Empirically the regression tax on under-evaluated AI projects runs 15 to 35 percent of in-period revenue across the first two quarters post-launch. It pulls the IRR curve away from speed; and it pulls harder than buyers expect, because the unit of failure for AI is unbounded in a way the unit of failure for legacy software is not.

The two terms working against each other produce the inverted U.

The shape of the curve

Plot IRR on the vertical axis. Plot the project’s total pre-launch eval and hardening investment on the horizontal axis, measured as percent of total project budget. The curve has three regions.

RegionEval/hardening as % of budgetIRR behaviorDominant term
Under-invested0 to 18 percentRising fast, then falls off a cliffRegression cost dominates
Sweet spot25 to 35 percentPlateau near peakTerms balance
Over-invested45+ percentFallingOpportunity cost dominates

The cliff on the left side is the part most teams cannot intuit. Going from 12 percent to 18 percent eval/hardening investment does not improve IRR linearly; it improves it by a step change as the system crosses the threshold where regressions become detectable before customers report them. Below the threshold, most percent of revenue earned early gets clawed back by a regression tax that compounds against trust. Above the threshold, the marginal eval dollar buys diminishing returns until the opportunity cost of further delay swamps the gain.

The plateau on the top of the curve is wide enough that the exact peak is not the planning target. Anywhere from 25 to 35 percent of budget on eval and hardening lands within a few IRR points of the maximum. The planning target is staying inside the plateau, not finding its precise peak.

Why too-fast loses

The temptation to launch on the first eval-pass cycle is structural. Demos work. Pilot users are enthusiastic. Sales pipelines have been waiting. The internal narrative; “we’re shipping AI”; wants closure. Pressure to launch before the second cycle is constant in 2026 engagements.

Three failure modes follow.

Eval-set instability. The first eval set built for a new AI feature is wrong. Not subtly wrong; meaningfully wrong. Edge cases are missing, distributions skew toward the easy half of the workload, threshold values were guessed before any production traffic existed to calibrate them. Shipping after one cycle ships a system that passes a broken test. Shipping after two cycles lets the eval set stabilize against discovered reality. The hidden cost of evals across the project lifecycle is detailed in the hidden cost of AI evals.

Undetected subpopulation regressions. A model that achieves 0.84 weighted accuracy on the full eval set may be at 0.62 on a 12 percent subpopulation that maps cleanly to a high-value customer segment. One cycle does not reveal this. Two cycles, with stratified evaluation built into the second cycle, does. Shipping early ships a product whose worst failure mode is concentrated in your most strategic accounts.

Observability calibration debt. The second hardening month is when the observability stack stops being noise; when traces start telling the on-call rotation what regressed, where, and against what threshold. Shipping before this calibration completes ships a system whose first failure is invisible until a customer reports it. The repair cycle in that posture is two to three times more expensive than the same repair caught by observability.

The IRR consequence: a project that ships eight weeks early but pays a 25 percent regression tax over six months is IRR-negative against the patient counterfactual. The math is unforgiving and most boards have not modeled it.

Why too-slow loses

The opposite failure is real. Engineering leads who internalized “AI failures are unbounded” sometimes optimize for an eval threshold that no plausible workload requires. Four eval cycles before launch. Sixteen weeks of hardening. Stratified evaluation across nine subpopulations the buyer’s actual customer base does not contain. The IRR cost of this posture is opportunity cost; revenue not earned, strategic ground not taken, competitor windows not closed.

Three signals that a project is past the IRR plateau into over-investment.

The eval score is rising at a slope under 0.5 percent per cycle. When marginal cycles produce marginal accuracy gains, the eval set has saturated against the model’s current ceiling. Continuing to run cycles does not buy production reliability; it buys score on a frozen distribution. Subsequent reliability gains require model upgrades or retrieval changes, not more cycles on the current configuration.

Hardening is generating zero new findings per week. When two consecutive weeks of shadow-mode and observability calibration produce no new regressions to triage, the system has crossed into operational stability. Continuing to harden is paying calendar cost for findings that are not arriving.

Stakeholders are litigating the eval set rather than the system. When eval-set debates become longer than eval-result debates, the project has converted from engineering to ceremony. The cost of further rigor is paid in calendar; the gain is paid in nothing.

The IRR consequence: a project that lands its launch four months later than the plateau prescribes loses revenue compounding plus market position. Patient teams over-correct here, and the cost is invisible because there is no regression tax to point to; only a counterfactual that earned more.

The sweet spot: two to three eval cycles, eight to twelve weeks of hardening

The empirical operating range across mature 2026 AI engineering shops:

  • Two to three full eval-pass cycles before launch. One is not enough; four is too many.
  • Eight to twelve weeks of production hardening between code-complete and full-traffic launch. Under eight, the regression curve has not bottomed; over twelve, opportunity cost dominates.
  • 25 to 35 percent of total project budget on eval and hardening combined.

The customer-facing versus internal split matters. Customer-facing AI must run the full two to three cycles and eight to twelve weeks because the regression cost includes brand and trust components an internal tool does not pay. Internal tools have a smaller blast radius; failures are caught by the operator, not the customer; and the IRR sweet spot shifts left toward more speed. One eval-pass cycle and four to six weeks of hardening is often sufficient for internal-only deployments.

Stakes-stratified guidance:

Deployment postureEval cyclesHardening weeksEval/hardening % of budget
Internal tool, low blast radius1 to 24 to 615 to 22
Customer-facing, standard2 to 38 to 1225 to 35
Regulated or high-stakes3 to 412 to 1635 to 45

The regulated tier; financial advice, medical decision support, legal substantiation; pays a higher hardening tax not because the IRR curve has shifted but because the regression cost coefficient is larger. A regulator-noticed failure carries a tail risk that does not appear in the standard model.

A decision tree for the launch call

The launch decision should be made against a structured tree, not a debated judgment call. The tree below resolves in five branches.

Branch 1: Has the system passed two full eval-pass cycles at the locked threshold?

  • No → Hold. Run the second cycle. Do not litigate this.
  • Yes → Continue.

Branch 2: Is the eval score rising at over 0.5 percent per remaining cycle?

  • Yes → Continue cycles until the slope flattens. Do not launch on a rising curve unless opportunity cost is acute.
  • No → Continue.

Branch 3: Has shadow-mode hardening produced fewer than three new findings in the last two weeks?

  • No → Continue hardening. New findings mean operational instability remains.
  • Yes → Continue.

Branch 4: Is observability calibrated such that the on-call rotation can detect regressions before customers do?

  • No → Hold. Calibrate observability. Launching without this puts the regression tax on revenue.
  • Yes → Continue.

Branch 5: Is the launch window driven by a discrete strategic event (regulatory deadline, partner integration, named-account commit)?

  • Yes → Launch at the next defensible threshold; opportunity cost is acute.
  • No → Launch when branches 1-4 resolve. Calendar pressure without strategic event is ceremony.

A team that runs most launch through this tree converges on the IRR plateau. A team that argues each launch from first principles each time burns calendar in litigation and is at best randomly distributed across the curve.

Governance to make the sweet spot operational

Two governance changes convert the IRR tradeoff from a judgment call into a structured threshold review that finance and engineering share.

Milestones reference eval-pass cycles by name, not calendar dates. Milestone three is not “week eighteen.” It is “eval cycle three passing the locked threshold on eval-set v1.2.” This change ties the cost gate to the rigor gate. Buyers used to calendar milestones will resist; the ones who adopt this structure escape the runaway-project failure mode detailed in the anatomy of a runaway AI project.

Launch is gated on a post-cycle regression-rate forecast. At the decline of cycle three, engineering produces a forecast of expected regression rate over the first two production quarters. The forecast is signed. If the forecast is above the contracted ceiling; typically 4 percent for customer-facing, 8 percent for internal; launch is blocked pending a remediation cycle. This converts the IRR tradeoff from “we feel ready” into “the forecast is inside the ceiling.”

The two governance changes together produce a project where the IRR plateau is a structural outcome rather than an accident. Boards used to gating launch on feature completeness will find this odd. Boards that have been burned by AI regression cost on a feature-complete launch will find it familiar.

Frequently asked questions

Does the IRR curve shift if the AI project sits on top of a frontier model versus a fine-tuned smaller model? Yes. Frontier-model projects have a flatter sweet-spot plateau because model upgrades arrive most few months and reset the eval landscape. Fine-tuned smaller-model projects have a sharper plateau but a longer one because the model itself is a stable artifact. The eval cycle count holds; the hardening count is somewhat shorter on stable smaller models.

How does the curve interact with prompt-only versus tool-using systems? Tool-using systems and agents pay a higher regression cost coefficient because tool calls produce irreversible state. The IRR plateau shifts right; toward more eval and hardening; for any system whose failures are not undo-able.

Should the IRR model treat eval-set construction cost as a sunk cost or a running line? Running line. Eval sets drift with the underlying workload and require quarterly refresh. A model that treats eval-set construction as one-time underestimates total project cost by 8 to 14 percent.

What is the right discount rate for the opportunity cost term? Cost of capital plus a strategic-clock premium. For most 2026 enterprise AI projects, a 22 to 30 percent annual discount produces decisions that age well. Lower rates produce too-slow decisions; higher rates produce too-fast ones.

How does the curve respond to a sudden model-vendor price drop mid-project? The opportunity-cost term steepens because waiting is now cheaper. The plateau widens slightly. This is one of the few cases where slowing down a launch is unambiguously IRR-positive.

What is the right way to communicate IRR sweet-spot reasoning to a board impatient for revenue? Show the inverted U with both terms named. Boards that see only opportunity cost demand speed; boards that see both terms demand the plateau. The graph is the conversation.

Does the IRR sweet spot apply to evergreen AI products as well as project-shaped engagements? Yes. Evergreen products run the same launch tradeoff at most major release. The cycle and hardening counts apply per release, not per product lifetime.

How does the curve change for projects with regulatory hold periods? Regulatory holds shift the plateau right. The hardening tax is higher because the cost of a regulator-noticed failure carries tail risk that does not appear in standard regression cost models.

Is there a relationship between IRR curve shape and contract structure? Yes; fixed-price contracts compress the curve toward speed because the agency bears the calendar cost. Eval-threshold pricing widens the plateau because the contract itself rewards landing inside it. Detail in the decline of the fixed-price AI project.

Does this argument apply to internal AI platform teams? Yes; the unit changes (calendar headcount cost rather than billable invoice) but the curve is the same. Internal teams often have a worse problem because the regression tax is paid by the operator’s team and is not visible to finance.

Key takeaways

  • AI project IRR is non-monotonic in speed. The curve is an inverted U with a wide plateau between 25 and 35 percent of budget on eval and hardening combined.
  • Too-fast loses to regression cost; empirically 15 to 35 percent of in-period revenue across the first two quarters of production.
  • Too-slow loses to opportunity cost; revenue not earned and competitor windows not closed during four-cycle ceremony.
  • The sweet spot is two to three eval-pass cycles before launch and eight to twelve weeks of production hardening for customer-facing systems.
  • Internal tools shift left to one to two cycles and four to six weeks; regulated systems shift right to three to four cycles and twelve to sixteen weeks.
  • The launch decision should run through a five-branch tree, not a debated judgment call.
  • Governance change one: milestones reference eval-pass cycles by name. Governance change two: launch gates on a signed regression-rate forecast.
  • Eval-threshold pricing widens the IRR plateau by aligning contractual incentive with the operating range. Fixed-price compresses it toward speed.

Last Updated: May 10, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles