Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 16 min read

The AI Project Cost-of-Rework Framework

The AI Project Cost-of-Rework Framework

Cost-of-rework is the third lens in the trio of project-economics frameworks alongside cost-of-quality and cost-of-delay, and it is the one most under-applied to AI work. Capers Jones’ decades of software economics research established the canonical fix-cost curve: a defect caught in design costs 1x, in implementation 5x, in testing 10x, in production 30x. The 1-5-10-30 ratio is one of the most replicated findings in software engineering economics. AI projects have their own version of this curve with three AI-specific rework drivers: eval threshold misses (defects discovered when the eval suite catches a regression below the locked threshold), model-upgrade regressions (defects introduced when the substrate model changes), and prompt registry rot (defects that accumulate from prompt edits made without the eval discipline). The AI fix-cost curve runs steeper than the traditional one; defects caught in the eval suite cost 1x, defects caught in staging cost 8x, defects caught in production cost 35 to 80x because of brand and SLA cost. This piece adapts cost-of-rework for AI projects, names the AI-specific rework drivers, and gives a quantified model for the AI 1-8-35 fix-cost curve.

This is a spoke under the AI project economics manifesto. The manifesto names eval engineering as the central cost line; cost-of-rework is the framework that names what happens to that line when the eval discipline is thin or skipped.

The Capers Jones fix-cost curve, AI version

The traditional Capers Jones curve, validated across thousands of software projects from the 1970s onward:

Stage caughtFix cost multiplier
Design / requirements1x
Implementation5x
Testing10x
Production30x

The AI version, calibrated against AI projects we’ve audited from 2023 to 2026:

Stage caughtFix cost multiplier
Eval design / spec1x
Eval suite (pre-deploy)8x
Staging / canary20x
Production35 to 80x

The AI curve is steeper at the production end because:

  • Production failures have brand cost. A traditional software bug is a customer service ticket. An AI hallucination or compliance failure can be a screenshotted-and-tweeted incident with multi-million-dollar brand consequences.
  • Production failures cascade through the eval suite. A discovered production failure usually means the eval suite did not cover the failure mode. Fixing it requires not just the immediate fix but expanding the suite, re-running it, re-locking thresholds, and re-validating prior fixes.
  • Production failures invalidate model-upgrade work. A live failure on the current model often requires re-validating that the same failure mode is also blocked on the next-generation model the team is preparing to migrate to. The single fix becomes 2 to 3 fixes across model versions.

The 1-8-35 to 1-8-80 curve means an AI defect caught at eval design costs roughly $500. The same defect caught in production costs $17K to $40K. Four to five orders of magnitude is consistent with the manufacturing-economics finding that prevention is dramatically cheaper than rework; and the AI multiplier is steeper than traditional software’s, not gentler.

Three AI-specific rework drivers

Three failure modes drive most AI rework cost, and each is invisible in traditional software economics:

  1. Eval threshold miss. A change ships, the eval suite runs in CI, and one or more thresholds fall below their locked value. The change must be reverted or fixed. If the team did not have the eval suite in place, the threshold miss surfaces in production instead; at the 4 to 10x higher fix-cost multiplier.

  2. Model-upgrade regression. Anthropic, OpenAI, or Google ships a new model. The team migrates. Five to fifteen percent of eval test cases regress in unexpected ways because the new model’s behavior shifts on prompts that the team’s prompts were tuned against. Rework includes re-tuning prompts, re-validating retrieval, re-locking thresholds, and re-validating downstream system behavior.

  3. Prompt registry rot. Engineers edit prompts in production over weeks and months without re-running evals on each edit. Quality drifts quietly. The first eval run after a long edit-without-test interval surfaces 20 to 40 percent of test cases failing. Rework is the cleanup of accumulated drift, which compounds because the edits happened in different contexts and have to be untangled.

These three drivers account for roughly 70 to 80 percent of the rework cost on AI projects we have audited. The remaining 20 to 30 percent is traditional software rework; bugs in tooling, integration errors, infrastructure failures; which AI projects have at the same rate as any other software project.

Eval threshold miss as a rework category

Eval threshold miss is the most preventable AI rework category. The structure:

  • The team has an eval suite with locked thresholds (e.g., F1 score ≥ 0.78 on the customer-support test set).
  • A code or prompt change is made.
  • CI runs the eval; F1 falls to 0.74.
  • The team has a choice: revert, fix and re-test, or accept and re-lock the threshold (with documented rationale).

The cost of catching this in CI is typically 1 to 4 hours of engineering time; at internal cost, $200 to $800. The cost of catching the same threshold miss after deploy when a customer reports degraded behavior is typically 20 to 60 hours of engineering work plus brand and SLA cost; $8K to $30K. Multiplier: 25x to 75x.

The discipline that prevents this is having the eval suite running in CI on most PR, with the threshold gate as a hard merge blocker. Roughly 60 percent of AI projects in 2026 have eval suites; roughly 30 percent of those have CI integration; roughly 15 percent have hard merge blocking on threshold violations. The 85 percent that don’t have hard blocking are paying the rework cost in production at 25-75x.

The evaluating-ai-code-quality guide covers the CI eval integration in detail. The threshold-locking discipline is the single highest-ROI investment a team can make in AI rework prevention.

Model-upgrade regression as a rework category

Model-upgrade regression is partially preventable but not avoidable. The structure:

  • A frontier provider ships a new model.
  • The team plans to migrate to capture cost savings or capability gains.
  • The full eval suite is re-run against the new model; some percentage of cases regress.
  • Rework is the engineering work to re-tune prompts, re-validate retrieval, and re-lock thresholds against the new substrate.

Typical rework cost per model upgrade: 2 to 4 weeks of senior engineering time, roughly $20K to $40K at internal rates. Across 3 to 5 model upgrades per year, this lands at $60K to $200K annually for a single AI product; a number that surprises CFOs who treat model upgrades as “free improvements.”

The cost-amplification path on this category is when the team migrates without re-running the full suite. The team ships, customers report degraded behavior on edge cases that the team didn’t think to test, and the team is now doing emergency re-eval on a live system at 5 to 10x the cost of the planned migration. The model-routing economics piece covers the related dynamic where multiple models in production multiply the model-upgrade rework surface.

The discipline that minimizes this rework: a model-upgrade runbook that includes full suite re-run, threshold re-lock, and 7-day canary against the prior model before full cutover. The runbook adds maybe $5K of process cost per upgrade and reduces rework cost by 60 to 80 percent.

Prompt registry rot as a rework category

Prompt registry rot is the most insidious AI rework category because it accumulates silently. The structure:

  • The team starts with a clean prompt registry; versioned, tested, documented.
  • Engineers make small prompt edits in response to specific bug reports or feature requests, often without running the full eval suite.
  • Over 6 to 18 months, the prompt registry accumulates dozens or hundreds of small edits, each individually defensible, none coordinated.
  • The first comprehensive eval run reveals widespread quality drift; 20 to 40 percent of test cases below threshold.

Typical rework cost when this surfaces: 2 to 6 weeks of senior engineering time, roughly $30K to $80K, plus 4 to 12 weeks of customer-trust recovery if the drift was visible to users. The full rework cost on a moderately rotted registry can land at $50K to $150K; equivalent to 10 to 30 percent of the original project cost paid back in cleanup.

The prevention discipline: most prompt edit triggers an eval run on at least the affected slice. No exceptions. Most rot stories I’ve audited start with a single “small fix” that skipped the eval, which set the pattern that subsequent edits also skipped the eval. The rot accumulates from there.

Quantifying rework cost on a real project

A worked example. Team is running a 12-month AI customer support project with $400K total budget. The team did not invest in eval infrastructure (“we’ll add it later”) and ran with a thin manual-spot-check QA process.

  • Eval threshold misses caught in production: 14 over 12 months, average rework cost $9K each = $126K (32 percent of project budget).
  • Model-upgrade regression: 3 model upgrades, 2 done without full re-eval, surfaced as production regressions at $25K each = $50K.
  • Prompt registry rot: Accumulated drift surfaced at month 9, comprehensive cleanup cost $65K (16 percent of project budget).
  • Traditional software rework: $20K (5 percent of project budget); typical rate.
  • Total rework cost: $261K; 65 percent of project budget.

A second team running the same project with disciplined eval infrastructure typically spends $40K on eval suite construction (10 percent of budget), and pays $35K to $50K in total rework cost (9 to 12 percent). Net difference: the disciplined team spends $80K to $90K total on prevention plus rework; the undisciplined team spends $260K. The 3x cost difference is consistent across the AI projects we have audited and is the strongest argument for eval-infrastructure investment at project kickoff.

The anatomy-of-a-runaway-AI-project piece shows how rework cost cascading produces the larger budget overruns that define runaway projects.

Reducing rework: the four-discipline checklist

Four disciplines together reduce AI rework cost by roughly 60 to 80 percent:

  1. Eval suite in CI with hard threshold gates. No PR merges if any threshold falls below its locked value. This single discipline prevents most threshold-miss rework.

  2. Model-upgrade runbook. Full suite re-run, threshold re-lock, 7-day canary against prior model. Runbook adds ~$5K per upgrade, prevents most upgrade regression rework.

  3. Per-edit prompt eval requirement. Most prompt edit, no matter how small, runs the affected slice of the eval suite. Prevents prompt registry rot from accumulating.

  4. Quarterly comprehensive eval re-run. Full suite, full coverage, fresh threshold validation. Catches the rare drift modes that per-edit evals miss because the drift crosses slices.

These four disciplines together typically cost 5 to 10 percent of project budget to implement and operate. They reduce total rework cost from a 30 to 60 percent of budget range to a 5 to 15 percent range. The ROI is the steepest in software engineering; comparable to the 5x to 10x ROI of unit testing in traditional software, but with the additional benefit of preventing brand-cost incidents that traditional software does not produce.

Frequently asked questions

What is cost-of-rework and how does it apply to AI projects?

Cost-of-rework is the project spend that goes to fixing defects discovered after their original creation point. Capers Jones’ canonical curve shows the cost grows roughly 5 to 30x depending on how late in the lifecycle the defect is caught. The AI version of the curve is steeper; 1x at eval design, 8x at eval suite, 35 to 80x in production; because AI production failures carry brand cost, cascade through the eval suite, and invalidate model-upgrade work. Rework cost on AI projects without eval discipline typically runs 30 to 60 percent of project budget.

What’s the AI fix-cost curve?

A defect caught at eval design costs 1x. Caught at eval suite (pre-deploy) costs 8x. Caught at staging or canary costs 20x. Caught in production costs 35 to 80x; the upper end when brand or SLA cost is involved. The AI curve is steeper at the production end than traditional software’s 1-5-10-30 curve because AI failures have non-linear brand cost and cascade through the eval suite and across model versions.

What are the AI-specific rework drivers?

Three drivers account for 70 to 80 percent of AI rework cost. Eval threshold miss: a change ships, the eval suite (or production) detects a threshold violation, work is required to fix or revert. Model-upgrade regression: a new frontier model breaks 5 to 15 percent of eval cases that previously passed; re-tuning is required. Prompt registry rot: edits accumulate without per-edit eval runs and quality drifts silently until a comprehensive eval surfaces 20 to 40 percent failure rate.

How big is the cost difference between disciplined and undisciplined AI teams?

Roughly 3x on total spend. A disciplined team invests 10 percent of project budget in eval infrastructure and pays 9 to 12 percent in rework cost; total 19 to 22 percent. An undisciplined team invests minimal eval infrastructure and pays 30 to 60 percent in rework cost. The discipline difference is the largest single cost driver in AI projects we have audited.

Why is the AI fix-cost curve steeper than traditional software?

Three reasons. AI production failures have non-linear brand cost; a hallucination tweeted by a customer can have multi-million-dollar reputational consequences. AI production failures cascade through the eval suite; fixing one usually requires expanding the suite, which requires re-running, re-locking, and re-validating prior fixes. AI production failures invalidate cross-model-version work; a fix on the current model often requires revalidation against the next-generation model. The compounding effects produce the 35 to 80x multiplier at the production end.

What’s the highest-ROI rework prevention discipline?

Eval suite in CI with hard threshold gates. Catches 70 to 80 percent of preventable rework before it reaches staging or production. Cost to implement: 5 to 8 percent of project budget. Cost reduction: 20 to 40 percent of project budget. ROI is the steepest in AI engineering economics. Roughly 15 percent of AI projects in 2026 have this discipline; the other 85 percent pay the rework cost in production.

How does prompt registry rot accumulate?

Each prompt edit, individually, looks small and defensible. The first edit that skips the eval run sets a pattern that subsequent edits follow. Over 6 to 18 months, dozens or hundreds of small unevaluated edits compound into widespread quality drift. The first comprehensive eval run after a long unevaluated period typically reveals 20 to 40 percent of test cases below threshold. Cleanup cost runs $30K to $150K depending on registry size.

Should most prompt edit require an eval run?

Yes, for the affected slice. The cost of running the affected slice is small; typically $5 to $50 in compute and 5 to 15 minutes of engineering time. The cost of skipping it cumulatively is the prompt-rot rework category, which runs 5 to 15 percent of project budget when it surfaces. The discipline scales; large registries automate the slice-eval-on-edit step into the development workflow so it is not a per-edit burden.

How do model upgrades produce rework cost?

Each frontier model upgrade introduces behavior changes that affect 5 to 15 percent of existing prompt-eval pairs. The team must re-run the full suite, identify regressions, re-tune prompts, re-validate retrieval, and re-lock thresholds. Rework cost per upgrade is typically 2 to 4 weeks of senior engineering time. Across 3 to 5 upgrades per year, this lands at $60K to $200K annually for a single AI product.

How does cost-of-rework relate to cost-of-quality and cost-of-delay?

The three frameworks are the trio of AI economics. Cost-of-quality names how to allocate prevention spend across the four CoQ buckets. Cost-of-delay names the cost of not-yet-shipping. Cost-of-rework names the cost of fixing defects after their original creation point. Healthy projects optimize many three: shifted toward prevention (CoQ), faster than competitors (CoD), low rework (CoR). The three frameworks collectively decompose the central manifesto claim that eval engineering is the dominant cost line.

Key takeaways

  • The AI fix-cost curve is 1x at eval design, 8x at eval suite (pre-deploy), 20x at staging or canary, and 35 to 80x in production. The AI curve is steeper than traditional software’s 1-5-10-30 curve because of brand cost, eval-suite cascade, and cross-model-version compounding.
  • Three AI-specific rework drivers account for 70 to 80 percent of AI rework cost: eval threshold miss, model-upgrade regression, and prompt registry rot.
  • Disciplined AI teams spend 19 to 22 percent of project budget on eval infrastructure plus rework. Undisciplined teams spend 30 to 60 percent of budget on rework alone, for a 3x total cost difference.
  • The four-discipline checklist (eval suite in CI with hard gates, model-upgrade runbook, per-edit prompt eval, quarterly comprehensive re-run) reduces rework cost by 60 to 80 percent at a cost of 5 to 10 percent of project budget.
  • Eval suite in CI with hard threshold gates is the highest-ROI single rework-prevention discipline, comparable to the ROI of unit testing in traditional software but with the additional benefit of preventing brand-cost incidents.

The cost-of-rework framework is the third leg of the AI economics trio, joining cost-of-quality and cost-of-delay. Together they decompose the manifesto’s central claim; eval engineering is the dominant cost line; into the three temporal dimensions: prevention budget (CoQ), shipping pace (CoD), and fix-cost discipline (CoR). The 1-8-35 curve makes the case for eval infrastructure self-evident: most dollar of prevention saves 35 to 80 dollars of production rework.

Last Updated: Jun 13, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles