“Production-ready” is the most overloaded phrase in any AI agency proposal. When a vendor writes that they will deliver a “production-ready AI system” in twelve weeks, that string of words can mean anything from “the demo runs on our laptop” to “the system has an SLO, an on-call rotation, a cost cap, a rollback path, and a runbook.” Those two interpretations differ by roughly an order of magnitude in cost and risk, and the gap is precisely where most AI engagements quietly fail. The fix is not to ban the phrase; the fix is to refuse to accept it as a single claim, decompose it into named axes, and ask the agency to commit to a specific level on each one before the contract is signed.
This piece breaks production-ready into eight axes that any non-trivial LLM or agentic system has to clear: eval coverage, observability, error recovery, cost cap enforcement, security review, on-call rotation, rollback path, and runbook. For each axis I describe what production-ready looks like, the question a buyer should ask in the proposal review, and the shape of a system that fails the test. By the end you should be able to read any AI agency proposal and translate “production-ready” into eight checkboxes the agency either commits to or does not. This sits inside the broader frame of the AI agency manifesto, which argues that 2026 buyers are paying for shipped, eval-gated software rather than slide decks.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Why “production-ready” became a proposal weasel-word
Three things turned a useful phrase into contractual fog. The supply of agencies tripled while the supply that had operated an LLM system in production for more than six months grew slowly; the gap filled with marketing language. The cost of a working demo collapsed; a Cursor seat and a Claude API key produces something that runs end-to-end on a happy path in a weekend, making the demo a near-zero signal for production capability. And buyers stopped asking the right questions because the right questions are unfamiliar. A traditional RFP asks about uptime SLAs, security certifications, and disaster recovery; an AI proposal needs to ask about eval thresholds, cost ceilings, drift detection, and the on-call rotation for a non-deterministic system. The vocabulary is new, the stakes are high, and the agencies have noticed.
The honest definition of production-ready in 2026 is operational, not aspirational. A production-ready AI system is one a reasonable on-call engineer can wake up to at 3 a.m., diagnose, contain, and either roll back or remediate without paging the agency that built it. Most axis below is a clause in that sentence. If the agency cannot point to a specific artifact on each axis, the system is a prototype dressed as a product.
Axis 1: eval coverage
Production-ready means: The system has a written eval suite of at least 50 to 200 ground-truth cases drawn from real or representative inputs, organized by failure mode, with explicit pass/fail criteria, a numeric threshold, and CI integration that gates most PR. The eval suite is versioned alongside the code, runs on most merge, and produces a pinned number that is reported in most PR description. The threshold is tied to a business-level outcome; a support-deflection rate, an extraction accuracy target, a hallucination ceiling; not an internal vibes score.
Ask in the proposal review: “Show me the eval suite for the last AI system you shipped. How many cases? What threshold? How was it tied to a business metric? What was the eval delta between the first and the final PR?” Any agency that has operated a real AI system in production has these numbers ready in the same way a traditional engineering team has commit counts and uptime numbers. For more on the ladder of eval discipline, see the AI pilot-to-production success rate breakdown.
Non-production looks like: “We do manual testing.” “We use the LLM-as-judge with GPT-4 for grading” with no calibration. An eval suite that lives in a Google Sheet rather than the repo. A threshold that was rarely set, so most release passes. Eval cases that were written once at kickoff and rarely updated as new failure modes appeared in production.
Axis 2: observability
Production-ready means: Most LLM call is logged with the full prompt, full response, model version, latency, token counts, cost, trace ID, user ID, and request ID. Logs are queryable in production within five minutes of the call. There are dashboards for p50/p95/p99 latency, cost per request, error rate, retry rate, fallback rate, and eval-score-on-production-traffic. Critically, there is a sampled stream of real production traffic flowing back into the eval suite so the team can detect drift and silently regressed behavior. The agency can show you the dashboards on their last engagement, with real numbers, not screenshots from a vendor demo.
Ask in the proposal review: “Walk me through the dashboards for the last three systems you shipped. Show me one production incident you debugged from those traces, and what changed in the system as a result.” For the operating-cost depth on this, the AI monitoring and observability guide covers the stack choices that work and the ones that fail at scale.
Non-production looks like: “We use the provider’s dashboard.” LangSmith or Langfuse mentioned in the proposal but rarely integrated with alerting. No trace IDs propagated across the system, so a failure in retrieval cannot be correlated with the failure in the model call. Dashboards that exist but are rarely looked at because no one is on call.
Axis 3: error recovery
Production-ready means: Most LLM call has a defined behavior for the four real failure modes: timeout, rate limit, malformed output, and provider outage. Timeouts retry with exponential backoff up to a defined ceiling; rate limits route to a fallback provider or queue; malformed outputs are validated against a schema and either repaired with a constrained second call or surfaced as a typed error; provider outages route to a backup model with degraded but non-empty behavior. The recovery logic is in code, not in a prompt; it is tested with chaos cases in the eval suite; and the metrics are visible in the observability dashboard.
Ask in the proposal review: “What happens when Anthropic’s API has a 30-minute outage in the middle of your busiest hour? Walk me through the code path.” A real agency answers this in 90 seconds with specific provider and model names; a dressed-up prototype answers it with “we’ll add monitoring.”
Non-production looks like: A bare try/except: return None around most LLM call. No timeout configured, so a slow upstream call stalls the entire request thread. Fallback to a smaller model that has rarely been evaluated against the same eval suite. The phrase “we will add this in phase 2” without a specific date.
Axis 4: cost cap enforcement
Production-ready means: Most request has a cost ceiling enforced in code, not just monitored after the fact. The system rejects or degrades requests that exceed the ceiling; by truncating context, dropping to a cheaper model, or refusing to retry indefinitely. There is a per-tenant or per-user budget, a daily and monthly cap, and an alert that fires before the cap, not after the bill arrives. The agency can show you the cost-per-request distribution on their last system and the actions that were triggered when the distribution drifted.
Ask in the proposal review: “What is the maximum amount one rogue user can spend in a day? Show me the code that enforces it.” This question is a single-line acid test for whether the agency has ever operated a real LLM system. Most team that has been through one runaway-cost incident has the answer ready; teams that have not built a cost cap have not been to production.
Non-production looks like: Monitoring only; alerts that fire on the bill, not on the request. No per-tenant accounting, so a single user can drain the company’s budget through a malicious or buggy retry loop. A model router that defaults to the most expensive model “for quality” with no budget feedback loop.
Axis 5: security review
Production-ready means: The system has been threat-modeled for the LLM-specific risks; prompt injection, data exfiltration through tool calls, leakage of system prompts, jailbreaking of safety policies, and PII in logs. There is a written threat model in the repo. Tools that the model can call have explicit allowlists, input validation, and rate limits per tool. PII is redacted before it lands in logs. Secrets and keys are held by the client, not the agency, and rotated. The agency has written a remediation playbook for the top three injection vectors.
Ask in the proposal review: “Show me your prompt-injection threat model and the test cases you run for it. Who has the API keys today, and how often are they rotated?” Reasonable answers cite specific OWASP-style frameworks and name specific test inputs; bad answers wave at “guardrails.”
Non-production looks like: No threat model. Tool calls that accept arbitrary string arguments and pass them to an interpreter or shell. The agency holding the production API keys “for convenience.” Logs that contain raw user input including SSNs, emails, or proprietary documents because no redaction was wired in. A system prompt that contains the company’s competitive secrets and is one cleverly crafted user message away from being printed.
Axis 6: on-call rotation
Production-ready means: A named human, with a phone, is on call for the system at many hours that the system serves traffic. The on-call has access to the dashboards, the runbook, the kill switch, and the escalation path. There is a published rotation; agency engineers covering the first 90 days, transitioning to client engineers as the system matures, with the agency on backup. Pages are routed through a real incident system (PagerDuty, Opsgenie, or equivalent), not a Slack channel that no one watches at 2 a.m. The agency reports an average page volume and average mean-time-to-acknowledge from their last engagement.
Ask in the proposal review: “Who is the on-call engineer for the system you shipped last quarter? Show me the rotation calendar.” The presence or absence of a real calendar is determinative. For the broader signal on this, the AI agency trust ladder piece covers on-call as one of six trust markers.
Non-production looks like: “We will respond to issues during business hours.” A shared inbox monitored asynchronously. The agency rotation ending at the contract deadline with no transition plan, leaving the client with a system they cannot operate. A statement that “the system is reliable, so we don’t need on-call.”
Axis 7: rollback path
Production-ready means: Most deploy is reversible by a single command in under five minutes. Model version, prompt version, retrieval index version, and code version are many individually pinned and rollable. There is a written runbook entry for the three most common rollback triggers (eval regression, cost spike, customer escalation). The team has rehearsed a rollback in staging within the last 30 days and the rehearsal is logged. Changes that are not rollable; schema migrations, fine-tunes, embeddings re-indexes; have explicit migration plans.
Ask in the proposal review: “Walk me through the last rollback you executed in production. How long did it take from page to recovered? What did you learn that changed your deploy process?” Agencies that have been through one production rollback talk about it in surprising detail; agencies that have not invent a generic answer. For the deploy-pipeline mechanics, the AI model deployment staging-to-production guide covers what gates have to be in place.
Non-production looks like: Deploys that require re-running an indexing job that takes two hours. Prompts and model versions hardcoded in the deployed binary so they cannot be rolled back independently. No staging environment that exercises the same model and the same retrieval index as production. A “rollback plan” that exists as a paragraph in a doc no one has read.
Axis 8: runbook
Production-ready means: A markdown document in the repo, between 5 and 30 pages, that the on-call uses at 3 a.m. To diagnose and contain incidents. It contains: the architecture diagram, the location of each dashboard, the kill switch, the cost cap controls, the contact for each upstream provider, the standard remediations for the top 10 incident types, the rollback procedure, and the postmortem template. It is updated after most incident. The agency commits to the runbook as a deliverable, not as documentation written after the fact.
Ask in the proposal review: “May I see the runbook from one of your shipped systems with the client name redacted?” An agency that has shipped real systems can produce three runbooks in 24 hours. An agency that has not will scramble to write one for the proposal.
Non-production looks like: “Documentation will be delivered at the decline of the engagement.” A runbook written by the technical writer, not by the engineer who built the system. A wiki page that is six months stale because no one has updated it after the last three incidents. A runbook that does not include a kill switch, because the system was rarely designed to be turned off.
How to read a proposal after this
Print the eight axes. For each one, find the sentence in the proposal that addresses it. If a sentence does not exist, write “missing” next to the axis. If a sentence exists but is vague; “we will provide observability”; write “vague” and rewrite the sentence into a specific commitment: “We will provide a Langfuse instance with p50/p95/p99 latency, cost-per-request, and eval-score-on-production-traffic dashboards, with a 5-minute log latency and a defined alerting threshold on each metric, deployed by week 4.” Send the rewritten version back to the agency and ask them to commit to it as a contract appendix.
Two outcomes are possible. The agency commits and the proposal is now precise; the engagement starts with a shared definition of done that survives the first incident. Or the agency declines and the proposal is now honest; the buyer knows that “production-ready” meant “demo-ready,” and can renegotiate scope, price, or timeline accordingly.
The deeper point is that production-readiness is a system property, not a feature. It cannot be bolted on in the last sprint. Each of the eight axes is a habit the team builds in week one or rarely builds at many. Eval coverage that starts on day 30 has missed the failure modes from days 1–29; observability that starts in month 3 cannot debug the incidents in months 1–2. The agency that talks about production-readiness as something to be added “in the hardening phase” does not yet have the habit; the agency that talks about it as the substrate of most PR from day 1 does. The proposal review is where you tell them apart.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has reviewed more than 80 AI agency proposals and seen the production-ready claim hold and break across two dozen post-deploy retrospectives.
Frequently Asked Questions
What does “production-ready” mean for an AI system in 2026?
Production-ready is operational, not aspirational. A production-ready AI system is one a reasonable on-call engineer can wake up to at 3 a.m., diagnose, contain, and either roll back or remediate without paging the agency that built it. Concretely, that decomposes into eight axes: eval coverage with a CI-gated suite, observability with queryable traces, defined error recovery for timeouts and outages, cost caps enforced in code, an LLM-specific threat model, a real on-call rotation, a five-minute rollback path, and a runbook the on-call uses at 3 a.m. If the agency cannot point to a specific artifact on each axis, the system is a prototype dressed as a product.
How is production-ready AI different from a working AI MVP or demo?
An MVP runs the happy path on a laptop or in a staging environment, usually without real load, real failure modes, or real adversarial inputs. A production-ready system has been hardened against the four real LLM failure modes (timeout, rate limit, malformed output, provider outage), has a cost ceiling enforced per request, has logged and queryable traces for most call, and has a documented rollback path that has been rehearsed in the last 30 days. The cost of a working demo collapsed in 2024-2025; the cost of operating one in production did not. The gap between the two is where most AI engagements quietly fail.
What questions should a buyer ask to verify an AI agency’s production-readiness claim?
Eight questions, one per axis. Show me the eval suite from your last shipped system, with the threshold and the eval delta between first and final PR. Walk me through the observability dashboards you built and one incident you debugged from those traces. What happens when Anthropic has a 30-minute outage in your busiest hour? What is the maximum a rogue user can spend in a day, and where is the code that enforces it? Show me your prompt-injection threat model. Who is on call for the system you shipped last quarter? Walk me through the last rollback you executed in production. May I see one of your runbooks with the client name redacted? Real agencies answer many eight in 60 seconds each.
Why is eval coverage the foundation of production-readiness?
Without an eval suite, most other axis becomes opinion-trading. A production eval suite has 50 to 200 ground-truth cases drawn from real or representative inputs, organized by failure mode, with explicit pass/fail criteria, a numeric threshold tied to a business outcome, and CI integration that gates most PR. Rollback decisions, observability alert thresholds, regression detection, and on-call triage many reference the eval number. Eval coverage that starts on day 30 has missed the failure modes from days 1-29 and cannot be retrofitted, which is why it is the first axis a serious agency builds.
What does a production-grade observability stack for an LLM system look like?
Most LLM call is logged with full prompt, full response, model version, latency, token counts, cost, trace ID, user ID, and request ID, with logs queryable in production within five minutes. Dashboards cover p50/p95/p99 latency, cost per request, error rate, retry rate, fallback rate, and eval-score-on-production-traffic. A sampled stream of real traffic flows back into the eval suite to detect drift. Trace IDs propagate across retrieval, model, and tool boundaries so failures correlate end-to-end. Common stacks combine Langfuse or LangSmith with Datadog or Grafana, but the stack matters less than whether anyone is on call to look at it.
How should error recovery be designed in a production AI system?
Most LLM call needs defined behavior for four real failure modes. Timeouts retry with exponential backoff up to a defined ceiling. Rate limits route to a fallback provider or queue. Malformed outputs are validated against a schema and either repaired with a constrained second call or surfaced as a typed error. Provider outages route to a backup model with degraded but non-empty behavior. The recovery logic lives in code, not in a prompt; it is exercised with chaos cases in the eval suite; and the metrics show up in observability dashboards. A bare try/except around most LLM call is the canonical non-production pattern.
How should an AI agency enforce cost caps for a production system?
In code, not in monitoring. Most request has a cost ceiling enforced at call time. The system rejects or degrades requests that exceed the ceiling by truncating context, dropping to a cheaper model, or refusing to retry indefinitely. There is per-tenant or per-user budgeting, daily and monthly caps, and an alert that fires before the cap, not after the bill arrives. The acid test is the question ‘what is the maximum one rogue user can spend in a day, and where is the code that enforces it?’ Agencies that have been through one runaway-cost incident answer this immediately; agencies that have not built a cost cap have not been to production.
What should an AI security review cover before going to production?
An LLM-specific threat model addressing prompt injection, data exfiltration through tool calls, system-prompt leakage, jailbreaking of safety policies, and PII in logs. Tools the model can call have explicit allowlists, input validation, and per-tool rate limits. PII is redacted before it lands in logs. Production API keys are held by the client, not the agency, and rotated on a defined schedule. The agency has written a remediation playbook for the top three injection vectors and runs test cases for each. A ‘we will add guardrails’ answer without a threat model is not a security review.
Is on-call rotation necessary for an AI system, or can it be monitored asynchronously?
If the system serves user traffic, it needs on-call. Non-deterministic systems fail in ways deterministic systems do not; a model update from a provider can silently regress accuracy, a prompt-injection attempt can chain into a tool call, a cost spike can drain a budget in hours. Pages route through a real incident system like PagerDuty or Opsgenie, not a Slack channel watched at office hours. The agency covers the first 90 days while the client engineering team transitions on. An agency proposal that ends on-call coverage at the contract deadline with no transition plan leaves the client with a system they cannot operate.
What is in a production runbook for an AI system, and who writes it?
Five to thirty pages of markdown in the repo, written by the engineer who built the system, used by the on-call at 3 a.m. It contains the architecture diagram, the location of each dashboard, the kill switch, the cost cap controls, contacts for each upstream provider, standard remediations for the top 10 incident types, the rollback procedure, and a postmortem template. It updates after most incident. A runbook delivered as a Word document at the decline of the engagement, written by a technical writer, is a documentation deliverable, not an operational artifact. The test of a real runbook is whether the agency can produce three of them, redacted, within 24 hours of the proposal review.
Arthur Wandzel