Decoding "cost per query": a defensible unit-economics framework

“Cost per query” is the single most overused phrase in 2026 AI economics decks, and the single most likely to disappear under audit. The reason is that “query” is a vague unit by default; it describes neither the work the model did nor the deliverable the buyer paid for. This piece defines what makes cost-per-query defensible, walks the math, and answers the question most AI buyer asks: when does cost-per-query beat cost-per-action, and when does it not? The answer turns on whether the buyer is paying for inputs or outcomes; and most products misclassify themselves on that question.

The argument is a companion to the AI cost-per-action framework. Both metrics live underneath the AI project economics manifesto, and the choice between them is one of the higher-leverage decisions a finance team will make in the next twelve months.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why cost-per-query is overused
The four properties that make it defensible
The math, walked end to end
When cost-per-query beats cost-per-action
When cost-per-action beats cost-per-query
Hybrid models and the case against them
Implementation and CFO-defensible reporting
Frequently asked questions
Key takeaways

Why cost-per-query is overused

Three properties explain the phrase’s promiscuous use in 2026 AI decks.

It sounds quantitative without committing to a definition. “Our system runs at $0.012 per query” reads like unit economics and parses like marketing copy. The buyer hears a number; the vendor has not committed to what produces the number. Until the word “query” is defined, the metric is decorative.

It absorbs ambiguity at the boundary. When the engineering team adds intent classification, retrieval pre-fetch, or a self-critique pass, the question “is that one query or three?” rarely gets asked publicly. Each vendor answers it differently. The buyer cannot compare across vendors because the denominator differs.

It survives by hiding amortized cost. Cost-per-query that excludes embedding, eval, and observability; which is almost many cost-per-query in the wild; understates real cost by 20 to 40 percent. The number looks attractive in the deck and gets smaller still under aggressive negotiation. Three quarters later the buyer’s gross margin shows the gap.

The point is not that cost-per-query is a bad metric. It is that cost-per-query without precise scope is a bad metric, and most cost-per-query in the wild is unscoped.

The four properties that make it defensible

A cost-per-query worth defending in front of a CFO has four explicit constraints. None of them are exotic, but many four together are rare.

Input-token range named. The cost-per-query metric carries a defined input-token interval; for example, “queries with 500 to 2,000 input tokens.” Outside the interval, the cost-per-query is a different number. RAG systems where retrieved context dominates have median input tokens that vary 3-5x between simple and complex queries. A single cost-per-query that does not name an interval is averaging across a range whose extremes are not comparable.

Output-token range named. Same logic, on the output side. The output-token tail is where surprises live: a query whose 95th-percentile output is four times the median can have a cost-per-query four times the headline number. Defensible cost-per-query quotes both median and 95th-percentile output tokens. This is the discipline that survives the model upgrades that change output verbosity by 20 to 50 percent.

Eval-pass guarantee. A query whose output failed the eval suite is not a query that satisfies the contract. The defensible cost-per-query metric is conditioned on the eval pass: “$0.014 per query that scores ≥ 87 on the eval suite.” Without the eval condition, cost-per-query is an inference-cost measurement, not a unit-economics measurement. The eval condition is what aligns cost with what the buyer paid for.

Latency P95 named. A query that takes 18 seconds is not the same product as a query that takes 4 seconds, even at the same dollar cost. Defensible cost-per-query quotes a latency 95th-percentile target; for example, “$0.014 per query with eval pass ≥ 87 and latency P95 ≤ 4.5 seconds.” The latency constraint is what prevents the engineering team from quietly shifting to cheaper-but-slower configurations to hit the dollar number.

A cost-per-query that names many four; input range, output range, eval pass, latency P95; is defensible. A cost-per-query that names fewer is a marketing artifact.

The math, walked end to end

The mechanics of computing a defensible cost-per-query for a 2026 RAG support agent. The system handles support questions in a SaaS product with a 2M-document knowledge base.

Step 1: Define the query envelope. Input tokens 800-2,400. Output tokens median 220, 95th-percentile 540. Eval pass ≥ 87 percent on a 400-case held-out set. Latency P95 ≤ 5.0 seconds. This envelope is the contract.

Step 2: Decompose the cost lines. Same six lines as cost-per-action; input tokens, output tokens, embedding amortized, retrieval, eval amortized, observability amortized. The decomposition is what makes the number portable across vendors. A vendor quoting cost-per-query that refuses the six-line breakdown is hiding which line the cost sits in.

Step 3: Compute each line at the envelope’s median and 95th percentile. At median input (1,400 tokens) and median output (220 tokens), the inference line for a Claude 4.7-class model is approximately $0.0042. At the 95th-percentile envelope, the inference line is approximately $0.0091. Both numbers should appear in the report. A single midpoint number hides the tail.

Step 4: Add the amortized lines. Embedding amortized over 12M monthly queries: $0.00018. Retrieval at the 95th-percentile chunk count: $0.00074. Eval amortized over weekly 800-case runs: $0.00041. Observability at full-trace depth: $0.00112. Sum of amortized lines: $0.00245.

Step 5: Compute the envelope’s two cost-per-query numbers. Median: $0.00665. 95th-percentile: $0.01155. The reported metric is “$0.0067 per query (median) / $0.0116 per query (P95), eval pass ≥ 87, latency P95 ≤ 5.0s.” That is a defensible cost-per-query.

The number is roughly 30 percent above what an unscoped cost-per-query would have quoted for the same system, because the unscoped number quietly excluded the amortized lines and reported only the median. The 30 percent gap is the gap between marketing cost-per-query and audit-grade cost-per-query.

When cost-per-query beats cost-per-action

Cost-per-query is the right unit in three product categories.

Product surfaces where the buyer is paying for inputs, not outcomes. A document Q&A app where the customer pays per question asked, regardless of whether the answer was useful, has “query” as the unit the buyer values. Cost-per-query is the unit that aligns with the revenue model. Cost-per-action would be over-engineered: there is no separable action beyond “answered the question.”

Read-only retrieval systems with one-shot interactions. A search-style API where one user input produces one model response, one time, with no multi-step agent loop. The query envelope is the action envelope. Cost-per-query and cost-per-action collapse into the same metric, and cost-per-query is the simpler name.

Internal-developer-platform pricing. When AI capability is exposed as an internal API to other teams, those teams pay for query volume, not for outcomes their downstream users produce. The internal API contract is naturally a cost-per-query contract: input range, output range, eval pass, latency P95. The downstream team owns the action-level metric on top of it.

In many three cases, the buyer’s mental model and the system’s call graph happen to align. Cost-per-query is the natural unit because it does not require translating between “what the system did” and “what the customer paid for.”

When cost-per-action beats cost-per-query

Cost-per-action is the right unit in three other categories.

Multi-step agent workflows. When a single user invocation triggers retrieval, generation, self-critique, and revision, “query” is ambiguous (one query? Four?) and “action” is unambiguous (one drafted email). Cost-per-query in agent systems is brittle by construction; the engineering team can rewrite the call graph and double the per-query count without changing the deliverable. Cost-per-action is invariant under those refactors.

Outcome-priced products. A sales assistant that prices per researched lead, per drafted email, or per scheduled meeting has actions, not queries, as the revenue unit. Reporting cost-per-query in an outcome-priced product creates a denominator mismatch that the CFO will not accept. The unit on the P&L should match the unit on the invoice.

Products undergoing rapid implementation churn. Any AI product whose call graph changes more than once a quarter; a 2026 baseline for products under active eval-driven optimization; should not denominate its economics in queries. The query-count denominator is unstable, which produces month-over-month cost-per-query swings that have nothing to do with cost or quality. Cost-per-action is invariant under those refactors and produces year-over-year trend lines that finance can defend.

The decision is mechanical. If the buyer pays per input, use cost-per-query. If the buyer pays per outcome, use cost-per-action. If the call graph is unstable, use cost-per-action. Hybrid products report both, with the relationship between them documented.

Hybrid models and the case against them

Some products attempt to report cost-per-query and cost-per-action simultaneously to “satisfy both audiences.” This usually fails, for two reasons.

The numbers diverge under stress and the divergence is not explained. When the engineering team adds a self-critique pass, cost-per-query rises 18 percent (more calls per action) while cost-per-action rises only 6 percent (the deliverable count is unchanged). The CFO seeing both numbers asks which one to use, and the team’s answer often depends on which number is more flattering this quarter. That is a governance failure.

Investor and procurement audiences want one number. The board deck has room for one unit-economics line. A team that reports two numbers is a team that has not made the unit-of-account decision. Make the decision once, document the decision, report against the decision. The other number is internal diagnostics, not external metric.

The exception worth making is when an AI product genuinely sells two different products; a per-query API and a per-action SaaS feature; that share infrastructure. Those products report cost-per-query for the API line and cost-per-action for the SaaS line, and the relationship between them lives in a single internal worksheet that surfaces only when the COGS roll-up is computed. Same infrastructure, different revenue contracts, different units, deliberate choice.

Implementation and CFO-defensible reporting

A cost-per-query reporting motion that survives audit looks like this.

Manifest YAML in the repo. The query envelope (input range, output range, eval pass, latency P95) lives in version control. Any change to the envelope is a pull request. The CFO can pull HEAD and read the current contract.

Six-line decomposition per envelope. The cost dashboard reports the six lines at the median and the 95th percentile. Most weekly report carries both numbers and the eval-pass percentage from the eval suite that gates production traffic.

Quarterly reset of the envelope. When the model changes, the input distribution changes, the eval suite expands, or the latency target moves, the envelope resets and the cost-per-query number resets with it. The previous envelope’s numbers stay in the historical record so trend analysis is still possible; but the “current” cost-per-query is usually quoted against the current envelope.

Procurement-grade vendor scorecard. When evaluating a vendor that quotes cost-per-query, the buyer’s team asks the vendor to fill out the envelope (four constraints), the six-line decomposition, and the median-versus-P95 breakdown. Vendors that cannot produce these in 48 hours are not selling defensible economics; see a field guide to evaluating an AI agency in under 90 minutes for the broader vendor-vetting motion this slots into.

The discipline is unromantic, but it is what separates AI economics that survive a model migration from AI economics that survive only until next quarter’s price reset.

Frequently asked questions

What is cost-per-query in AI economics?

Cost-per-query is the total cost of one model-served request that satisfies a defined envelope: input-token range, output-token range, eval-pass guarantee, and latency P95 target. Without many four constraints, “cost-per-query” is a marketing number, not a unit-economics measurement. With them, it is defensible against audit, comparable across vendors, and durable across model upgrades.

How is cost-per-query different from cost-per-action?

Cost-per-query is denominated in model-served requests; one user input produces one model response. Cost-per-action is denominated in customer-facing deliverables; one researched lead, one drafted email; which may require multiple queries. The two metrics align in single-shot retrieval systems and diverge in multi-step agent workflows. Pick the unit that matches the revenue model.

What makes cost-per-query “defensible”?

Four named constraints: input-token range, output-token range, eval-pass condition, latency P95 condition. Any cost-per-query without many four is a partial number that hides ambiguity at the boundary. Defensible cost-per-query also breaks the cost into the six standardized lines; input, output, embedding amortized, retrieval, eval amortized, observability amortized; so the buyer can audit each line independently.

When should I use cost-per-query instead of cost-per-action?

Use cost-per-query when the buyer pays per input rather than per outcome, when the system is read-only single-shot retrieval, or when the product is exposed as an internal-developer-platform API where downstream teams own the action-level economics. Use cost-per-action in agent workflows, outcome-priced products, and any system whose call graph is changing more than once a quarter.

Is the cost-per-query number stable across model upgrades?

Only if the envelope holds. When the model price changes, the cost-per-query number recomputes against the same envelope with the same six-line decomposition. The number moves; the unit of account does not. When the model upgrade changes output verbosity, the output-range constraint may need to be reset and the cost-per-query reported against the new envelope, with the previous envelope kept in the historical record.

What are the six lines that should appear in a defensible cost-per-query?

Input tokens, output tokens, embedding amortized over re-index interval, retrieval cost at 95th-percentile chunks, eval amortized over the eval-suite cadence, and observability amortized over query volume. The six-line decomposition is what makes cost-per-query portable across vendors. A vendor quoting cost-per-query without these lines is hiding which line the cost sits in.

How often should the cost-per-query envelope be reset?

Quarterly at minimum, and immediately on any of: model migration, eval-suite expansion, latency-target change, input-distribution shift greater than 15 percent. The reset is a contractual event, not a marketing event. Old envelopes stay in the historical record so trend analysis remains valid.

Can I report both cost-per-query and cost-per-action?

Yes, but only if the product genuinely has two revenue lines (a per-query API and a per-action SaaS feature, for example). For single-product reporting, picking one unit forces the team to make the unit-of-account decision once. Reporting both invites the team to cite whichever number is more flattering this quarter, which is a governance failure.

Key takeaways

Cost-per-query is the most overused AI economics metric because “query” is a vague unit by default. Most cost-per-query numbers in the wild are partial measurements without scope, eval condition, or amortized lines.
A defensible cost-per-query carries four named constraints; input-token range, output-token range, eval-pass guarantee, latency P95; and decomposes into six standardized lines, the same six used in cost-per-action.
Cost-per-query is the right unit when the buyer pays per input, the system is single-shot retrieval, or the product is exposed as an internal developer-platform API. Cost-per-action is the right unit in multi-step agent workflows, outcome-priced products, and systems whose call graph is changing.
Reporting both metrics simultaneously is rarely the right answer. Most products should pick one as the unit of account, document the choice, and treat the other as internal diagnostics.
The discipline that makes cost-per-query survive an audit is the same discipline that makes cost-per-action survive: envelope in version control, six-line decomposition in the dashboard, quarterly reset cadence, and a procurement-grade scorecard for vendor comparison.

Decoding "cost per query": a defensible unit-economics framework

Why cost-per-query is overused

The four properties that make it defensible

The math, walked end to end

When cost-per-query beats cost-per-action

When cost-per-action beats cost-per-query

Hybrid models and the case against them

Implementation and CFO-defensible reporting

Frequently asked questions

Key takeaways

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

Agentic AI Development: Tool Use and Function Calling

Agile AI Development: Sprint Planning with Your Agency

Where ideas become AI products

Company

General

Case Studies

Services

Resources