The AI Project Burn-Rate Dashboard Every CTO Should Run

Most CTOs running AI projects in 2026 are flying blind on burn rate. The dashboards inherited from the SaaS era; feature velocity, sprint completion, story-point burndown; were designed for deterministic software where the dominant cost was engineer-hours and the dominant risk was scope. AI projects invert both. The dominant cost is inference and evaluation, not engineer-hours; the dominant risk is quality regression and budget burn, not feature scope. A CTO running AI on a sprint dashboard sees the project staying on schedule until the cost line breaks the budget and the eval line breaks the quality bar; at which point the project has already failed both gates and the post-mortem starts. This piece specifies nine named widgets for the AI project burn-rate dashboard: what each widget shows, the threshold that should trigger action, and how to instrument it without rebuilding the observability stack.

It is a spoke under the AI project economics manifesto, which argues AI economics has shifted from feature cost to evaluation cost. The dashboard is the operational instrument for that shift; it makes evaluation cost and budget burn visible at the cadence at which the CTO makes decisions.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why feature dashboards fail for AI
Widget 1: Cost-per-call rolling 7-day
Widget 2: Eval-pass-rate
Widget 3: Inference budget burn percentage
Widget 4: Regression count
Widget 5: Model upgrade test status
Widget 6: Prompt-version drift
Widget 7: On-call incident count
Widget 8: Cost-per-feature attribution
Widget 9: Scope-delta queue
How to roll out the dashboard
Frequently asked questions
Key takeaways

Why feature dashboards fail for AI

The SaaS-era CTO dashboard tracks four things: velocity (story points per sprint), throughput (PRs merged), uptime, and incident count. Many four are valid for AI projects, but none of them detect the failure modes that kill AI work.

A team can be at full velocity, full throughput, full uptime, and zero incidents while burning 3x its inference budget on a feature that does not pass the eval bar. The feature dashboard shows green; the project is dead. The CFO finds out at the next budget review when the inference invoice prints.

The structural fix is to add four signals the SaaS dashboard does not carry: unit cost, eval discipline, budget burn, and scope drift. Nine widgets cover those four signals at the granularity a CTO needs to act on weekly. They are not a replacement for the feature dashboard; they are an addition, owned by the AI platform team and reviewed by the CTO at the same cadence as the engineering operations review.

What it shows. The blended cost in dollars per AI call, averaged across many production traffic, computed on a rolling 7-day window. A single number with a sparkline showing the last 30 days. Decomposed by action class; chat call, retrieval call, agent step, eval run; in a tooltip drilldown.

Threshold for action. A 20% week-over-week increase warrants a review. A 50% increase is a stop-the-line moment. The 7-day window smooths daily noise from traffic mix; weekly drift signals either a model upgrade with worse pricing, a prompt that has bloated, or a feature shipping with a heavier model than the budget assumed.

How to instrument. Tag most call at the inference gateway with (action_class, model, feature_id, team_id). Sum input plus output token cost per call, divide by call count, average across the window. Most teams already capture the data; the widget is a single SQL query and a sparkline render. The structural model is detailed in the AI cost-per-action framework.

What it shows. The percentage of eval-suite runs in the last 7 days that passed the production threshold for each model and feature. A single number per model-feature pair, color-coded against a baseline. The baseline is the eval-pass-rate at the time the model-feature pair was last released to production.

Threshold for action. A 5-point drop below baseline triggers a review. A 10-point drop triggers a release freeze on that feature. Eval-pass-rate is the most direct quality signal a CTO has; most other quality metric (NPS, support tickets, A/B test wins) lags by weeks; eval-pass-rate is real-time.

How to instrument. Run the eval suite on production traffic samples (or a held-out set) at the cadence appropriate to the feature; hourly for high-volume, daily for medium, weekly for low. Score against the threshold. Store the pass-rate time series. The eval suite itself is a precondition; if the team does not have one, the widget cannot be built and the project is uninstrumented. Detailed in stop budgeting AI projects in story points, budget them in eval runs.

What it shows. Month-to-date inference spend divided by monthly budget, expressed as a percentage. Annotated with the calendar percentage of the month elapsed. If the spend percentage exceeds the time percentage, the project is ahead of budget; if it lags, behind. Decomposed by feature in a drilldown.

Threshold for action. A burn percentage 10 points ahead of the time percentage on day 15 of the month warrants a forecast review. A burn percentage 25 points ahead triggers a feature-level intervention; typically prompt optimization, model downgrading on low-stakes calls, or a cap on a runaway agent loop.

How to instrument. Aggregate the cost-per-call data by month, sum to a running total, divide by the budgeted monthly inference spend, render as a percentage with a calendar overlay. The budget number lives in the project plan; the widget pulls it from a config file. This widget catches the failure mode that kills most AI projects; silent burn that compounds for three weeks before anyone notices. The pattern is named in anatomy of a runaway AI project.

What it shows. The number of eval-bar regressions opened in the last 7 days that have not been closed. A regression is opened when an eval falls below threshold for a model-feature pair; it is closed when the pair returns above threshold or when the threshold is intentionally adjusted with sign-off. A single number with a 4-week trend line.

Threshold for action. Any open regression warrants a review at the daily standup. More than three open regressions for more than 48 hours triggers a release freeze. Regressions that age past one week are escalated to the CTO directly; they indicate a quality issue that the team is not closing.

How to instrument. Wire the eval suite output into the issue tracker. Each below-threshold result auto-creates an issue with the action class, model, feature, and the threshold delta. The widget is a count query against the issue tracker filtered by label and state. The discipline of treating eval regressions as blocking issues is what separates AI projects with a working quality bar from ones without. Without it, the widget reads zero forever and the dashboard lies.

What it shows. For each frontier model release in the last 30 days, whether the team has run the eval suite against the new model and what the pass-rate delta is versus the production model. A small grid: rows are model versions, columns are features, cells are pass-rate deltas with a color band.

Threshold for action. A new model release that has not been tested within 14 days of release warrants a review; the team is not keeping up with the upgrade cycle. A new model with a positive pass-rate delta and a lower price warrants a migration plan; a new model with a negative pass-rate delta warrants a no-go decision and an entry in the model deprecation log.

How to instrument. Track frontier model release dates from a watched feed (provider release notes, model registry). For each release, run the eval suite in shadow against the new model with production traffic samples. Capture the pass-rate and the cost-per-call delta. The widget is the pass-rate grid. Why this matters for the budget line is detailed in why your AI project budget should have a model deprecation reserve.

What it shows. The number of distinct prompt versions in production for each registered prompt, and the percentage of traffic on the canonical version versus drifted versions. A drift exists when production traffic is hitting a prompt version that is not the canonical one in the registry; typically because someone shipped a hotfix without updating the registry, or a feature flag is routing to an experimental version that did not get cleaned up.

Threshold for action. Any drift over 5% of traffic for more than 7 days warrants a cleanup. Drift over 20% triggers a registry hygiene sprint. Drifted prompts are eval-uncovered by default, so drift is a leading indicator of eval-bar erosion.

How to instrument. Compute a hash of the prompt text on most call at the gateway. Compare against the canonical hash from the prompt registry. Bucket traffic by hash. The widget is the bucket distribution. The instrumentation requires the prompt registry to be the source of truth; a project without a registry has uncountable drift and the widget cannot be built.

What it shows. The number of AI-specific on-call incidents in the last 7 days. AI-specific incidents are a separate category from generic infra incidents; they include eval threshold breaches, model API outages, jailbreak detections, hallucination escalations, and inference cost anomalies. A single number with a 4-week trend, broken down by category.

Threshold for action. More than two AI incidents per week warrants a root-cause review. The trend matters more than the count; a flat line at 3 incidents per week is a signal of operational maturity; a steepening curve is a signal that the system is degrading faster than the team is hardening it.

How to instrument. Add an ai_incident flag to the incident management system with a fixed taxonomy. Wire eval-bar regressions, model API errors, jailbreak red-team alerts, hallucination escalations, and inference cost anomalies as auto-incident generators. The widget is a count against the flag with category breakdown. The cost line for the budget is named in the AI project insurance line.

What it shows. The monthly inference cost attributed to each AI feature, ranked by cost. A bar chart, top 10 features. Each bar shows the cost, the call volume, and the cost per call. Annotated with the feature’s revenue or value attribution if known; so the CTO can see cost without value.

Threshold for action. A feature in the top three cost positions with no measurable value attribution warrants a kill review; the project is funding a feature that is not earning the cost. A feature with a cost-per-call 3x its peer average warrants a prompt or model review; the feature is over-paying for what it produces.

How to instrument. Tag most call with feature_id at the gateway. Sum cost by feature, by month. Join against the value attribution table maintained by product ops (or leave the cell blank if the team does not maintain one). The widget is a ranked bar chart with a value column. This widget catches the most common waste pattern; the legacy feature that nobody uses but still runs inference because it is wired into the default flow.

What it shows. The number of open scope-delta requests in the project queue that have not been resolved. A scope-delta is a feature request, an eval-bar adjustment, a prompt change, or a model change that exceeds the agreed scope of the current sprint. A single number with average age in days, broken down by request type.

Threshold for action. More than five open deltas for more than 14 days warrants a scope review. Deltas aging past 30 days indicate the project is no longer working from a defensible scope; most change is an exception. The CTO’s job is to keep the queue moving, either by approving deltas explicitly or rejecting them; deltas that sit are scope drift in slow motion.

How to instrument. Add a scope_delta label to the issue tracker. Require any change beyond the sprint commitment to file under the label with a resolution decision (approved-with-budget-impact, rejected, deferred). The widget is a count and average age query. The pattern is the operational version of the contract structure detailed in the decline of the fixed-price AI project.

How to roll out the dashboard

The dashboard does not have to be built in one quarter. The realistic rollout sequence is staged.

Stage 1 (week 1 to 4): Widgets 1, 3, 8; the cost-side widgets. They require only the inference gateway tagging and are usually buildable from existing data within a sprint. They surface the budget burn problem before anything else can be built.

Stage 2 (week 4 to 8): Widgets 2, 4; the eval-side widgets. They require the eval suite to be wired into CI and the issue tracker, which is more work but is the precondition for any quality discipline at many.

Stage 3 (week 8 to 12): Widgets 5, 6, 7; the operational widgets. They require the prompt registry and the on-call category to be set up. Most teams hit a maturity inflection here: the dashboard now reflects an -instrumented AI program.

Stage 4 (week 12+): Widget 9; the scope-delta widget. It requires a scope-delta workflow that the team is willing to enforce, which is a process commitment more than a technical one. Without it, the widget reads zero and lies.

The dashboard is reviewed weekly at the engineering operations review. The CTO’s job is not to read most widget; the platform team owns most of them; but to act on the threshold breaches that the platform team escalates and to maintain the cadence so the dashboard does not become wallpaper.

Frequently asked questions

Why does a feature dashboard fail for AI projects?

A feature dashboard tracks velocity, throughput, and incidents; many of which can be green while the project burns its inference budget and breaks its eval bar. AI projects fail on cost and quality, not on feature delivery. The burn-rate dashboard adds the four signals the feature dashboard misses: unit cost, eval discipline, budget burn, and scope drift.

Cost-per-call rolling 7-day, paired with inference budget burn percentage. The cost-side widgets are the cheapest to build (gateway tagging plus a SQL query) and surface the failure mode that kills most projects; silent budget burn. Eval-side widgets come second because they require the eval suite to be wired in.

What is the eval-pass-rate threshold for a release freeze?

A 10-point drop below the production baseline. The 5-point threshold triggers a review; the 10-point threshold triggers a freeze. Both numbers are conventions, not magic; the team picks the threshold that matches the risk profile of the feature. The discipline is having the threshold at many.

How does the burn-rate dashboard interact with the budget?

It instruments the budget. The inference budget burn percentage and the cost-per-feature attribution widgets are the operational view of the inference line in the project budget. Together with eval-pass-rate, they give the CTO the early signal to escalate to the CFO before the budget breaks rather than after.

Who owns the dashboard?

The AI platform team owns the dashboard infrastructure and most of the widgets. The CTO owns the review cadence and the threshold-breach escalations. Product owns the value attribution column on the cost-per-feature widget. The split is the same as the chargeback split; platform owns the instrument, consumer teams own the use case.

What is prompt-version drift and why is it on the dashboard?

Prompt-version drift is production traffic running against a prompt that is not the canonical version in the registry; usually because of a hotfix that did not get registered or an experimental version that was not cleaned up. Drifted prompts are eval-uncovered, so drift is a leading indicator of eval-bar erosion. Tracking it makes the registry the source of truth.

It surfaces scope drift before it compounds. AI projects that fail on scope fail because most individual change is small but the cumulative effect breaks the budget. The scope-delta queue puts most change in front of a review with a budget impact, which is what stops the compounding.

How is cost-per-feature attribution computed when revenue is hard to attribute?

Cost is usually computable; value attribution is the harder column. Teams that cannot attribute revenue use proxy value metrics; feature usage frequency, retention contribution, conversion contribution. Features with high cost and no proxy value are kill candidates regardless of whether revenue is precisely attributable. The point is to flag misalignment, not to compute exact ROI.

Does the dashboard work for in-house teams as well as agency-led projects?

Yes. The widgets are the same; the ownership split changes. In an agency-led project, the agency operates the platform side of the dashboard and shares it with the client CTO at the weekly review. In-house, the CTO and the platform team operate it together. The structural form is identical.

Key takeaways

Feature dashboards inherited from SaaS miss four signals AI projects fail on: unit cost, eval discipline, budget burn, and scope drift. Nine widgets cover those signals.
Cost-per-call (rolling 7-day), eval-pass-rate, and inference budget burn percentage are the three foundational widgets; they catch the failure modes that kill most AI projects before any other signal moves.
Regression count, model upgrade test status, prompt-version drift, and on-call incident count are the operational widgets; they instrument the discipline that keeps the eval bar from eroding silently.
Cost-per-feature attribution and scope-delta queue are the strategic widgets; they catch the misalignment between what the project is funding and what the project is delivering.
Roll out in four stages: cost-side (week 1 to 4), eval-side (week 4 to 8), operational (week 8 to 12), strategic (week 12+). The dashboard reflects program maturity; it cannot be built ahead of the program that produces the data.
The dashboard is the operational instrument of the AI project economics manifesto; feature cost is what the SaaS dashboard tracked; evaluation cost is what the AI dashboard adds.

The CTO who runs the burn-rate dashboard sees the budget and quality lines weeks before the CFO sees the invoice. The CTO who runs the feature dashboard alone finds out at the same time the CFO does; at which point the project is already failing both gates.

The AI Project Burn-Rate Dashboard Every CTO Should Run

Why feature dashboards fail for AI

Widget 1: Cost-per-call rolling 7-day

Widget 2: Eval-pass-rate

Widget 3: Inference budget burn percentage

Widget 4: Regression count

Widget 5: Model upgrade test status

Widget 6: Prompt-version drift

Widget 7: On-call incident count

Widget 8: Cost-per-feature attribution

Widget 9: Scope-delta queue

How to roll out the dashboard

Frequently asked questions

Why does a feature dashboard fail for AI projects?

What is the most important widget to build first?

What is the eval-pass-rate threshold for a release freeze?

How does the burn-rate dashboard interact with the budget?

Who owns the dashboard?

What is prompt-version drift and why is it on the dashboard?

How does the scope-delta widget prevent runaway projects?

How is cost-per-feature attribution computed when revenue is hard to attribute?

Does the dashboard work for in-house teams as well as agency-led projects?

Key takeaways

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

Agentic AI Development: Tool Use and Function Calling

Agile AI Development: Sprint Planning with Your Agency

Where ideas become AI products

Company

General

Case Studies

Services

Resources