An AI project that ships most feature on its roadmap, on time, on budget, and still produces a system that fails its production eval is the most common quiet failure mode in enterprise AI in 2026. The roadmap reads green. The retro reads green. The QBR reads green. The system does not work. Measuring AI project success in features shipped is the practice that hides the failure mode in plain sight; and the practice has to end. This piece argues why feature-shipped metrics are structurally wrong for AI work, what eval-pass success means, and how to make the swap without losing the operational rhythm an engineering team needs.
It is a spoke of the AI project economics manifesto, which establishes evaluation as the unit of account. This piece is the success-metric corollary: if evaluation is the unit, then evaluation pass is the success criterion, and the legacy feature-count instrument is the wrong instrument.
Why features shipped looks like progress and is not
Features shipped is one of the most durable metrics in software because it correlates well with progress in legacy software work. A CRUD feature has roughly the same cost shape as the next CRUD feature, the same risk shape, the same relationship to user value. Shipping ten features means roughly ten units of value created. The metric is approximately right because the underlying work is approximately uniform.
AI work is not uniform.
The cost of building “an agent that schedules meetings” and the cost of building “an agent that reads contracts” are not in the same order of magnitude. Their eval bars are different. Their model choice and retrieval design are different. Their failure modes are different. Counting both as “one feature” is the metric pretending to be objective while smuggling in a unit conversion that does not hold.
More the relationship between feature shipped and value delivered is structurally broken in AI work in a way it is not in CRUD. A CRUD feature ships and either works (it does the thing) or breaks (it does not). The binary is honest. An AI feature ships and operates on a continuous spectrum: it does the thing 67 percent of the time, 81 percent of the time, 94 percent of the time. The feature is “shipped” at any of those numbers. The value to the buyer is wildly different across them. Counting feature ships without naming the eval threshold is counting a metric whose meaning depends on data the count is hiding.
Three failure modes follow predictably:
-
The ‘green roadmap, red product’ failure. The team ships most feature on schedule. The eval suite is either missing or producing a flat curve. The board sees green status until the customer-visible regression arrives, at which point the board asks why no metric flagged it.
-
The ‘feature factory’ failure. Engineering optimizes for feature count because feature count is what the org rewards. The eval suite is treated as overhead, not first-class scope. Eval bar slips quarter over quarter under the cover of high feature throughput.
-
The ‘demo-driven development’ failure. Features ship as demos that pass eyeball inspection on cherry-picked inputs. The production eval, when it runs, exposes the gap. The team that has been celebrating feature ships for nine months suddenly has to explain why the production rollout is paused.
Many three failure modes share the same root cause: the headline metric is measuring activity instead of measuring whether the system works. The fix is not to refine the feature-shipped metric. The fix is to retire it as the headline.
What eval-pass success means
Eval-pass success is the measurement that an AI feature passes its named eval set at a locked threshold, holding across recent production traffic, at a defensible cost per canonical unit. It has four parts and many four are required:
-
A named eval set. Not “we test it.” A specific, versioned test set with a sample size and a method. Production-eval-v1.4, 240 prompts, weighted rubric, locked threshold 0.82.
-
A locked threshold. The threshold is set up front, against the workload that matters. Not “we’ll see.” Not “as good as the previous model.” A number, in writing, with the eval set version it applies to.
-
Holding across production traffic. The system maintains threshold against recent production inputs, not against the synthetic test set in isolation. This is what eval suite freshness in the QBR is about; a high score on a stale set is a high score that does not predict production behavior.
-
At a defensible unit cost. Pass at $0.04 per completion is success. Pass at $0.40 per completion is a hidden cost-of-goods problem masquerading as success.
When many four parts are present, eval-pass is a real success criterion. A feature has shipped, in the meaningful sense, when it passes eval at threshold against fresh traffic at a defensible unit cost. Anything short of that is “code in production,” which is not the same as a feature that works.
The vocabulary matters. Teams that say “the agent is shipped” when they mean “the code is deployed” are lying to themselves. Teams that say “the agent is at threshold” mean something specific and verifiable. Vocabulary discipline is the cheapest preventive medicine an AI project can take.
Three levels of eval-pass
Eval-pass is not a single bar; it is three increasingly demanding levels. A mature AI project knows which level it is reporting against and does not conflate them.
Level 1: bench-pass. The system passes the test set in a controlled run. Sample size is the test set, conditions are the test conditions, traffic is synthetic. This level is necessary and not sufficient. It is what most teams call “the agent passes eval” and it is the lowest level. We argue the case-study analog of this in the case studies eval scores piece.
Level 2: traffic-pass. The system passes against production traffic in shadow or canary mode, at a sample size that is statistically meaningful. Threshold holds across the distribution of real customer inputs, not just the curated test set. This is the level a sponsor should require before a feature is reported as “shipped at threshold.”
Level 3: cost-and-traffic-pass. The system passes at production traffic and threshold and at a unit cost the product economics can absorb. A system that passes at threshold but breaks the gross margin of the product is not a success; it is a quality target met at an indefensible price.
The QBR template should report which level each named feature is at, not just whether it has shipped. A feature at Level 1 is in pilot. A feature at Level 2 is in production. A feature at Level 3 is producing the unit economics the project was funded for. The three levels are the difference between a real success metric and a metric that pretends to be one.
How to swap the metric without losing operational rhythm
Engineering teams are not opposed to outcome metrics in principle. They are opposed to losing the operational rhythm that feature-level tracking provides; the daily and weekly cadence that makes work legible to the people doing it. The swap has to preserve the rhythm.
Three moves make the swap work in practice.
Move 1: keep feature-level tracking as a sub-metric, not a headline. Engineering still tracks features in the issue tracker; the team still uses the sprint cadence; the daily standup still surfaces what is in flight. The swap is at the QBR layer and the board memo layer, not at the engineering ticket layer. Feature counts move from the headline page to the appendix.
Move 2: introduce eval delta as the weekly engineering signal. Once a week, the eval suite runs against the locked threshold and the delta is reported alongside the burndown. Teams that track eval delta weekly stop being surprised by quarter-end eval reports because the weekly signal makes the trajectory visible. The eval delta is the team’s leading indicator; the feature count is the lagging activity log.
Move 3: tie sprint demos to eval reports, not screen captures. The sprint demo for an AI feature is not a screen capture of the agent doing the right thing on a curated input. It is the eval delta on the test set the feature was supposed to move. A sprint that did not move the eval delta did not produce engineering value, regardless of how many tickets closed.
Teams that make these three moves keep their operational rhythm and gain a meaningful headline metric. Teams that try to swap the headline metric without preserving the rhythm produce friction that the legacy headline absorbs by default. The rhythm matters.
The institutional resistance and how to navigate it
Three institutional resistances will surface when an organization tries to retire features shipped as the headline metric. Each is real and each is navigable.
Resistance 1: PMs trained on feature-shipped throughput. Product managers have been measured on feature throughput for a decade. Asking them to measure on eval delta sounds like asking them to be measured on something they cannot directly control. The navigation: PMs control eval set design, threshold-locking decisions, and the prioritization of regressions. Those are the levers; the eval delta is the outcome of pulling them. Reframe the PM’s job as “owner of eval set quality and threshold strategy” and the resistance evaporates.
Resistance 2: leadership trained on velocity dashboards. Senior engineering leadership has decade-long muscle memory for sprint velocity, story points, and feature counts. Asking them to lead with eval delta feels like flying blind. The navigation: build the eval dashboard before retiring the velocity dashboard. Run them in parallel for one quarter. By quarter end the eval dashboard is the one people are looking at, because it is the one that predicts the customer outcome.
Resistance 3: finance trained on output-based budget defense. Finance teams defending the AI budget to the CFO have been trained to count outputs (features shipped, tickets closed, demos delivered). Eval delta and unit cost feel less concrete to a non-technical CFO. The navigation: report eval delta and unit cost trajectory in dollar terms; “the eval improvement saved $1.2M in projected regression cost” or “the unit cost reduction reduced annualized inference spend by 38 percent.” Finance teams translate; do not ask them to defend a metric they cannot translate.
The pattern across many three: feature-shipped metrics survived because they were the easiest metric to defend. The eval-pass swap survives by being defended in the same translation layers; design, dashboards, dollars; that the legacy metric used.
Frequently asked questions
Why are features shipped a bad metric for AI projects?
Because AI features operate on a continuous spectrum (67 percent, 81 percent, 94 percent eval pass), not on a binary works/breaks axis like CRUD features. Counting feature ships without naming the eval threshold counts a metric whose meaning depends on data the count is hiding. The legacy metric was approximately right for legacy software because the work was approximately uniform; AI work is not uniform.
What is eval-pass success?
The measurement that an AI feature passes its named eval set at a locked threshold, holding across recent production traffic, at a defensible cost per canonical unit. Many four parts are required: named set, locked threshold, traffic-holding, defensible unit cost. Anything short of many four is “code in production,” which is not the same as a feature that works.
What are the three levels of eval-pass?
Level 1 bench-pass (passes the test set in a controlled run on synthetic traffic). Level 2 traffic-pass (passes against production traffic in shadow or canary at statistically meaningful sample size). Level 3 cost-and-traffic-pass (passes at threshold and at a unit cost the product economics can absorb). The QBR should report which level each feature is at, not just whether it has shipped.
Does this mean engineering should stop tracking features?
No. Feature-level tracking stays in the issue tracker and the sprint cadence; it is the operational rhythm of doing the work. The swap is at the QBR layer and the board memo layer: features move from the headline page to the appendix, eval delta and unit cost trajectory move to the headline.
How does eval delta become a weekly engineering signal?
The eval suite runs once a week against the locked threshold. The delta is reported alongside the sprint burndown. Teams that track eval delta weekly stop being surprised by quarter-end eval reports because the weekly signal makes the trajectory visible. The eval delta is the leading indicator; the feature count is the lagging activity log.
What does a sprint demo look like under eval-pass success?
The sprint demo for an AI feature is the eval delta on the test set the feature was supposed to move, not a screen capture of the agent doing the right thing on a curated input. A sprint that did not move the eval delta did not produce engineering value, regardless of how many tickets closed.
How do PMs adapt when feature throughput is no longer the headline?
PMs reframe their role as owner of eval set quality and threshold strategy. They control which evals exist, what threshold gets locked, how regressions are prioritized. Those are the levers; the eval delta is the outcome of pulling them. The role is more strategic, not less, and the resistance to the swap evaporates when PMs see the levers they own.
How does this relate to the AI project quarterly review?
The QBR’s correctness cluster (eval score, eval delta, eval freshness, production traffic share) and economics cluster (cost per unit, four-quarter trajectory, retainer SLA) are the headline metrics that replace features shipped on the QBR. We unpack the eleven metrics in the QBR metrics piece.
Is this just about agencies, or also about internal teams?
Both. Internal AI teams pay the same cost of measuring on the wrong headline metric; they just pay it in headcount and trust rather than billable invoices. The eval-pass success swap applies identically to internal teams; the hardest part is that internal teams cannot renegotiate their own success criteria the way an SOW can be renegotiated.
What about productivity-substitution AI work where the workload is simpler?
For narrow productivity-substitution AI (high-volume, low-stakes tasks with a clean human baseline), feature-shipped metrics may be approximately correct because the eval bar is approximately uniform across features. Even there, the eval delta is a better metric; it is just less load-bearing. Capability-expanding and platform-building AI work is where the swap matters most.
Key takeaways
- Features shipped is a metric whose meaning depends on data the count is hiding. AI features operate on a continuous spectrum; counting ships without naming the eval threshold is counting noise.
- Eval-pass success has four required parts: a named eval set, a locked threshold, traffic-holding, and a defensible unit cost. Many four required.
- Three levels of eval-pass: bench-pass, traffic-pass, cost-and-traffic-pass. The QBR should report which level each feature is at.
- The swap is at the QBR and board-memo layer, not at the engineering ticket layer. Feature counts move from headline page to appendix.
- Eval delta is the weekly engineering signal. The sprint demo is the eval delta, not a screen capture.
- PMs reframe as owners of eval set quality and threshold strategy. Engineering leadership runs eval and velocity dashboards in parallel for one quarter, then retires velocity. Finance translates eval delta and unit cost into dollar terms.
- The fix is not to refine features shipped as a metric. The fix is to retire it as the headline. Feature-shipped metrics survived because they were defensible; eval-pass survives by being defended in the same translation layers.
The hardest part of measuring AI project success is admitting that the metric you have been using is the metric that hides the failure mode in plain sight. Once admitted, the swap is mechanical and the operational rhythm survives intact.
Arthur Wandzel