Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 12 min read

The AI project model-routing economics: why a router saves 38% of inference spend

The AI project model-routing economics: why a router saves 38% of inference spend

Most AI projects route most request to the strongest model in the menu. That is the right strategy in 2023, when the gap between the strongest and the rest was wide enough that the strong model was usually the safe choice. In 2026 the gap has narrowed, the price spread has widened, and routing the easy 60 percent of requests to a smaller model; while reserving the strong model for the hard 40 percent; saves roughly 38 percent of inference spend without a measurable accuracy penalty. This piece walks the four routing patterns that work, the eval-anchored breakeven for each, and the four failure modes that turn a well-meaning router into a quiet accuracy regression.

The argument sits inside the AI project economics manifesto: if inference is a pass-through line and observability is COGS, then the routing layer is the highest-leverage place to lower the inference line without compromising the eval bar.

Why routing matters more in 2026

Three structural changes between 2023 and 2026 made routing economics newly attractive.

The price spread widened. Frontier model pricing dropped 80 percent across the board, but the smaller-model price tier dropped further than the frontier tier. The price ratio between a strong frontier model (Claude Opus 4.7, GPT-5, Gemini Ultra-class) and a fast small model (Haiku 4.7-class, GPT-5 Mini, Gemini Flash) is now 8x to 15x. In 2023 it was 3x to 5x. Same accuracy delta, much higher savings on most request the small model can handle.

The accuracy gap narrowed. A request that genuinely cannot be served by a 2026 mid-tier or fast model is a smaller fraction of the request distribution than it was in 2023. On a typical enterprise workload; summarization, classification, simple code edits, structured extraction; the fast model now handles 50 to 70 percent of requests at parity. The strong model is only required for the residual 30 to 50 percent.

Eval tooling matured. Routing without a per-class eval is dangerous. In 2023 you mostly could not afford the eval engineering to know whether the small model held parity on a given request class. By 2026, Promptfoo, Inspect, OpenAI Evals, and Anthropic’s evaluation tooling make per-class accuracy measurement a normal engineering deliverable. The eval ammunition for a defensible router is now affordable.

These three shifts together moved routing from a research curiosity to a default architecture for production AI. Most well-run AI projects in 2026 ship a router on day one or in the first three months post-launch. Projects that do not are leaving 30 to 40 percent of inference spend on the table; sometimes more.

The four routing patterns that work

Not many routers are the same shape. The four patterns below are the ones we have seen save money in production, in roughly increasing complexity order.

Pattern 1; Static class-based routing. Requests are tagged at ingest with a class (e.g., “classify intent,” “summarize,” “extract structured data,” “draft long-form reply”) and the router maps each class to a pre-decided model. Implementation is a switch statement. Eval is per-class. This is the right starting pattern. Most teams underestimate how much it captures; typically 70 to 80 percent of the savings of more complex patterns at 5 percent of the engineering cost.

Pattern 2; Heuristic complexity routing. A lightweight pre-classifier (length heuristics, keyword presence, structure detection, sometimes a small embeddings model) estimates request complexity and routes accordingly. This adds nuance to pattern 1 by handling within-class variance. Useful when a single class (e.g., “draft reply”) has wide complexity distribution.

Pattern 3; Dynamic LLM-judge routing. A small model first reads the request and decides whether to handle it itself or escalate to the strong model. This is the most flexible pattern and the one most likely to over-engineer. It captures the most value on highly heterogeneous workloads and over-promises on homogeneous ones.

Pattern 4; Cascade routing. The fast model attempts the request first; a confidence check or eval-judge step decides whether to retry on the strong model. This pattern is the right answer when accuracy thresholds are high and the cost of a bad answer dominates the cost of double-inference. It is overkill for most workloads.

The pattern matters more than the implementation framework. Implementing pattern 1 well usually beats implementing pattern 4 poorly.

Where the 38 percent comes from

The 38 percent number is the median realized savings we observe across mid-market enterprise AI projects that ship a router. It is not a marketing claim; it is what the spreadsheets say. The arithmetic that produces it:

Take a representative production workload. Assume 60 percent of requests are routable to a fast model at parity, 40 percent require the strong model. Assume the price ratio between strong and fast is 12x (mid-range for 2026 pairings). Pre-router cost is 100 percent of requests at strong-model pricing. Post-router cost is 40 percent at strong + 60 percent at fast. With a 12x ratio, the 60 percent of requests on the fast model cost only 5 percent of the strong-model bill. Total post-router cost is roughly 45 percent of pre-router cost. Savings: 55 percent.

Why do realized savings sit at 38 percent rather than 55 percent? Because the assumptions above are optimistic: real routers carry a small classifier overhead, not many “routable” requests stay routable as workloads drift, and most teams hold a margin of safety on borderline classes. The 38 percent number is what you get when you discount the theoretical 55 percent by realistic routing accuracy and overhead.

The number is also workload-dependent. On highly heterogeneous workloads (mixed customer intents, varied content types) we see 45 to 55 percent. On highly homogeneous workloads (single-purpose pipelines) it can be 20 to 30 percent. Below 20 percent the routing investment usually does not pay back, and above 55 percent the team is probably over-routing to the fast model and quietly regressing accuracy. We connect this to the broader cost picture in the cost-per-query framework piece.

The eval-anchored breakeven

The router only works if the eval infrastructure exists to prove that small-model routes do not regress accuracy on their classes. Without per-class eval, the router is gambling.

The breakeven calculation is simple but rigorous. For each request class, measure: accuracy on the strong model (As), accuracy on the fast model (Af), inference cost on the strong model (Cs), and inference cost on the fast model (Cf). Route the class to the fast model if (As − Af) is less than the smallest accuracy regression the buyer will accept on that class, AND (Cs − Cf) is materially positive.

For most enterprise workloads the threshold for “smallest acceptable regression” is set by the eval bar in the manifesto; the contracted threshold below which the system fails. If the eval bar is 0.87 and the fast model scores 0.86 on a class while the strong model scores 0.88, the route is defensible because both clear the bar with margin. If the fast model scores 0.83 on a class while the bar is 0.87, the route fails the bar and is not defensible regardless of cost savings.

This is the eval-anchored breakeven: routing is justified only on classes where the fast model clears the eval bar. Routing for cost reasons on classes where the fast model does not clear the bar is a quiet accuracy regression that erodes the eval-threshold contract.

The four failure modes

Routers fail in four characteristic ways. Each is preventable with discipline.

Failure 1; Routing without per-class eval. The team ships a router based on an aggregate eval score and assumes “if overall accuracy holds, the router is fine.” This is wrong. Aggregate accuracy can hold while one critical class quietly regresses by 8 points because the easy classes mask the hard ones in the average. Mitigation: run per-class eval, set per-class accuracy bars, gate routing decisions on per-class clearance.

Failure 2; Workload drift after launch. The router was tuned on the launch-month workload. Three months in, the request distribution has shifted (new feature, new customer segment, new failure mode), and 15 percent of requests now route to the wrong model. Mitigation: re-eval the router monthly, watch for drift in the request distribution, retire stale class definitions.

Failure 3; Pre-classifier cost erases the savings. A heuristic or LLM-judge pre-classifier that costs 30 percent of the small-model inference cost erases a meaningful share of the savings. Mitigation: instrument the pre-classifier cost explicitly, hold it under 5 to 10 percent of small-model cost as a target, prefer cheap heuristics over LLM judges where they work.

Failure 4; The “over-routing” trap. A team eager to maximize savings routes 80 percent of requests to the fast model and accepts a small accuracy regression as “worth it.” This usually fails the eval-threshold contract on at least one class and produces a customer-visible quality regression six weeks later. Mitigation: hold the per-class eval bar; do not over-route to chase savings.

We discuss the broader pattern in the AI project budget anti-patterns piece; over-aggressive routing is one of the recurring failure modes across the engagements we have seen.

How to operationalize a router

Three practices turn the theory into production-ready routing.

Ship pattern 1 first; measure for two months; only then layer complexity. Static class-based routing captures most of the savings most of the time. The temptation to skip to pattern 3 or 4 wastes engineering and produces a system the team cannot reason about. Pattern 1 is auditable; patterns 3 and 4 are emergent.

Wire the router into the eval-suite. Most per-class eval runs against both routes. The router’s decisions are visible in observability. When per-class accuracy drifts, the router decision becomes part of the regression-triage discussion, not a hidden variable. We discuss the observability mechanics in the AI agency manifesto.

Re-evaluate the router on most model upgrade. A model upgrade changes the strong-vs-fast accuracy spread and may move classes between routes. Treat this as part of the model-upgrade re-eval cost; three to five times per year, two to four engineering days per upgrade for the router itself.

Frequently asked questions

Why is the realized savings 38 percent rather than 55 percent?

Because real routers carry classifier overhead, not many requests stay in their original class as workloads drift, and most teams hold a margin of safety on borderline classes. The theoretical 55 percent assumes a perfect router on a stable workload, which is rare. 38 percent is the median across realistic deployments.

Does routing work on most workload?

No. Highly homogeneous workloads (single-purpose pipelines on uniform inputs) see smaller savings; sometimes 15 to 25 percent. Routing is most valuable on heterogeneous workloads (mixed intents, varied content types). The decision to invest in a router should follow the workload heterogeneity check.

Which routing pattern should we start with?

Pattern 1; static class-based routing. It captures 70 to 80 percent of the savings of more complex patterns at a fraction of the engineering cost. Move to pattern 2 only when within-class variance is meaningful. Move to pattern 3 only when class boundaries are unstable. Pattern 4 (cascade) is rarely the right starting answer.

Doesn’t an LLM-judge pre-classifier defeat the purpose?

It can. The pre-classifier cost has to stay under 5 to 10 percent of small-model inference cost or it erases the savings. A heuristic pre-classifier is usually cheaper and adequate. LLM-judge pre-classifiers are justified only when class boundaries truly cannot be captured by heuristics.

How do we know if we are over-routing?

Watch the per-class eval scores. If any class fails the contracted eval bar, you are over-routing on that class. The eval bar is the binding constraint, not the aggregate accuracy. Over-routing usually shows up as a slow regression on one or two specific classes that the aggregate score masks.

How does routing interact with reserved capacity?

Carefully. A reservation tied to a single model is incompatible with a router that dynamically picks across models. The clean pattern is multi-model marketplace commits at the cloud vendor layer, which we discuss in the AI project compute strategy piece.

How often should we re-tune the router?

Per model upgrade (three to five times per year) and on any meaningful workload change. The router is not a one-time decision; it is an artifact that lives inside the eval-suite and gets re-evaluated alongside the rest of the system.

Is the 38 percent savings an industry-wide number or a SFAI Labs number?

It is the median we observe across our engagements; published case studies from Anthropic, OpenAI, and major cloud vendors describe similar ranges (25 to 55 percent) on production routing deployments. The 38 percent is not a precise universal constant; it is the right central estimate for planning purposes.

Key takeaways

  • A model router saves roughly 38 percent of inference spend on a typical 2026 enterprise workload; the median across realistic deployments, with realistic-workload range 25 to 55 percent.
  • The four routing patterns are static class-based, heuristic complexity, LLM-judge dynamic, and cascade. Start with pattern 1; complexity rarely pays back.
  • The eval-anchored breakeven: route classes to the fast model only where the fast model clears the contracted eval bar.
  • The four failure modes are routing without per-class eval, ignoring workload drift, overpriced pre-classifiers, and over-routing to chase savings.
  • Re-evaluate the router on most model upgrade; the router lives inside the eval-suite, not outside it.

Last Updated: May 9, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles