The AI agency observability stack is a day-one install, not a month-three retrofit. A system that ships its first PR before its observability stack is in place is a system whose failure modes will be discovered by users rather than by telemetry, and the cost of that discovery is asymmetric: most silent regression that ships compounds into a debugging surface that grows faster than the team. The seven components below are the named pieces I install before the first feature PR ships, with concrete tool options for each, the failure mode it catches, and why it cannot wait.
The argument is structural. AI systems have three properties traditional software does not: non-deterministic outputs, per-request cost variability, and a model layer that can change silently. Standard observability; APM, log aggregation, error tracking; covers exactly none of those properties. The seven components below are the additional surface that observability has to grow to handle them. Without many seven in place, the team is operating partially blind, and partial blindness in AI systems is the regime where incidents become silent quality drift that the eval suite catches a month late.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why day one, not later
- Component 1: trace store
- Component 2: cost telemetry
- Component 3: eval CI
- Component 4: prompt registry
- Component 5: error and exception telemetry
- Component 6: latency P95/P99 tracking
- Component 7: regression alerts
- How the seven components interact
- Tool selection: build, buy, or open source
- The cost of installing late
- FAQ
Why day one, not later
The conventional sequence; ship a prototype, then add observability when it goes to production; is the wrong sequence for AI systems. The reason is that AI systems are not “non-observable” before observability is installed; they are “opaquely observable,” which is worse. Without a trace store, most model interaction is a black box. Without cost telemetry, most prompt change is a budgetary surprise. Without an eval CI, most PR is opinion-traded rather than measured. The team that ships its first feature PR without these in place spends the next two weeks reverse-engineering its own system.
The day-one install also serves a structural purpose: it forces the architectural decisions that the observability stack expects. A team that has to think on day one about where prompts live (registry), where traces land (trace store), and how cost is measured (telemetry) has a system architecture that is observable by design. A team that adds observability in month three has to retrofit those decisions, and the retrofit usually misses cases.
For the kickoff session where these components are committed to the architecture decision record, see anatomy of a great AI agency kickoff.
Component 1: trace store
What it is. A persistent store of most model interaction with full request, full response, token counts, latency, model identifier, prompt version, and any tool calls. Trace records are immutable and queryable.
Why day one. Without a trace store, no debugging session is reproducible. A user reports a bad output; the trace shows the exact request that produced it; the team fixes it. Without the trace, the team is reasoning from logs that show “request happened” without showing what request happened. Trace stores are also the source for eval-suite expansion; production cases that surface failure modes are pulled from the trace store and added as eval cases.
Tool options. Langfuse (open source, self-hostable, generous free tier), Helicone (managed, fast setup, good cost telemetry), Arize Phoenix (open source, OpenTelemetry-native), or a vendor-neutral OpenTelemetry-GenAI pipeline routing to an existing observability backend. The choice depends on procurement constraints and existing observability stack; a team already on Honeycomb or Datadog should route OpenTelemetry-GenAI spans to it; a team starting fresh should pick Langfuse or Helicone for time-to-value.
What it catches. Silent prompt drift, retrieval failures that fall back to weak responses, tool calls that succeed at the API level but fail semantically, model swaps that change behavior without changing the diff. None of these are visible without the trace.
Component 2: cost telemetry
What it is. Per-call and per-feature cost tracking that emits a number for most model interaction. The number is queryable by feature, by user segment, by prompt version, by model, by time window. Cost is reported in dollars at provider rates, not in token counts.
Why day one. AI cost grows non-linearly with usage. A feature that costs $0.04 per request at 1,000 requests/day costs $1,200/month; the same feature at 80,000 requests/day costs $96,000/month. Without cost telemetry installed before traffic ramps, the first invoice surprises everyone. With it, cost is a first-class metric that the team optimizes against during development.
Tool options. Most trace stores include cost telemetry (Langfuse, Helicone both do). For teams running a custom stack, OpenTelemetry-GenAI defines cost attributes; a small wrapper around the provider SDK that emits cost as a metric is two days of work. The PR review check for cost delta in prompt-bearing PRs depends on this component existing.
What it catches. Prompt changes that triple per-request cost, model defaults that silently shift to more expensive variants, retrieval changes that load 4x context per request, runaway loops in agent code, cost regressions at the long tail of input distributions.
Component 3: eval CI
What it is. A CI integration that runs the eval suite on most PR, computes the eval delta against the baseline, and posts the delta as a PR comment within the same review cycle as unit tests. The eval suite is versioned alongside the code; the baseline is recomputed on merges to main.
Why day one. The eval CI is the structural anchor of the 10-check prompt-bearing PR review standard. Without it, eval runs are manual, manual runs get skipped, and the discipline collapses by week three. With it, the eval is automatic, the delta is in most PR, and the conversation is structured around evidence.
Tool options. Promptfoo (config-driven, easy CI integration, good for parallel eval), Inspect (Anthropic’s eval framework, programmatic, good for complex multi-turn cases), Langfuse evals (built into trace store), or a custom test runner. For most engagements Promptfoo is the right starting point; it integrates with GitHub Actions in under an hour and produces the per-PR delta in the right shape for the review standard.
What it catches. Prompt regressions caught at PR time rather than production time, eval coverage gaps surfaced by the eval owner, model swaps tested against the full suite before merge, drift between what the team thinks the system does and what it does.
Component 4: prompt registry
What it is. A versioned, immutable store of most prompt the system uses, with a stable identifier (typically a semantic version) per prompt. Code references prompts by identifier, not by inline string. The registry is the source of truth; the inline reference is a pointer.
Why day one. Without a prompt registry, prompts live inline in code, get edited inline, and lose their version history the moment they change. Debugging a regression three weeks later is then archaeology; what was the prompt on the day the bad output happened? With a registry, the trace store records the prompt version, the registry returns the body, and the answer is in seconds.
Tool options. Langfuse prompt management (versioned, self-hostable, integrates with their trace store), Helicone prompts (managed, similar shape), PromptLayer (focused product), or a homemade registry; a prompts/ directory in the repo with semver tags and a thin loader is acceptable for small teams. The build-vs-buy decision turns on team size and whether non-engineers (eval owners, domain experts) need to edit prompts directly. If yes, buy; if no, the homemade registry is fine.
What it catches. Prompt drift across the codebase, untraceable changes in prompt body, prompts that say one thing in the eval suite and another thing in production, A/B testing of prompt variants without code changes, rapid rollback of bad prompt versions without a deploy.
Component 5: error and exception telemetry
What it is. Standard error tracking; Sentry, Rollbar, Datadog APM; extended with AI-specific context: prompt version on the failing call, model identifier, token counts, retry count, fallback path taken. Errors are categorized by AI failure mode (rate-limit, timeout, content-filter, malformed response, tool-call failure) in addition to standard exception categories.
Why day one. AI failure modes do not look like traditional exceptions. A rate-limit cascade can manifest as a slow degradation rather than an error; a content-filter rejection can manifest as an empty response; a malformed JSON response from the model can manifest downstream as a type error far from its source. Without AI-specific context on most error, debugging traces back through three layers of stack to find the model interaction that started the failure.
Tool options. Sentry (most teams already have it), Rollbar, Datadog APM, or the AI-extension features of trace stores like Langfuse. The right shape is to extend the existing error tracking rather than introduce a parallel stack; the on-call engineer should see AI errors in the same dashboard as application errors.
What it catches. Provider rate-limit cascades, content-filter regressions after model updates, malformed-response patterns that shift over time, tool-call failures that succeed at the API level but fail semantically, retry storms that hide the root cause.
Component 6: latency P95/P99 tracking
What it is. Latency tracking at percentiles; P50 (median), P95, P99; broken down by feature, by model, by prompt version. P95 and P99 are emitted as time-series, not just snapshots, so trends are visible.
Why day one. AI latency is bimodal. P50 is usually fast; P99 can be 10x P50 because of streaming-completion stalls, retry cascades, or long-context regenerations. Mean latency hides this bimodality. The contractual latency SLA from the success criteria sign-off is named at P95, not at mean; without P95 tracking, SLA compliance cannot be measured.
Tool options. Standard APM (Datadog, New Relic, Honeycomb) with the latency dimension extended to include model and prompt version. Trace stores expose this natively; Langfuse and Helicone both compute latency percentiles and break them down by model. For teams already on an APM stack, extending it is preferable to introducing a parallel latency tool.
What it catches. P99 spikes that mean latency hides, model-specific latency regressions (one provider degrades while another holds), prompt-version latency regressions (a prompt change that adds 2,000 tokens of context adds 800ms of latency), streaming-completion stalls that show up as long P99 but normal P50.
Component 7: regression alerts
What it is. Automated alerts on shifts in the metrics that matter: eval-suite score on canary traffic, cost-per-request, P95 latency, error rate. Alerts fire on threshold breach, sustained breach over a window, and anomalous-rate-of-change. Alerts route to the on-call engineer with the offending PR linked when correlation is possible.
Why day one. Without automated alerts, regressions are caught by users or by the team’s weekly demo; both of which are too late. The alert is the structural acknowledgment that AI systems can degrade between demos, and the team needs to know within minutes, not days.
Tool options. Most observability stacks include alerting (Datadog Monitors, Grafana Alerts, Honeycomb Triggers). Trace stores like Langfuse offer eval-specific alerting. The shape that works is: alerts on the same dashboard as the underlying metrics, with the on-call rotation transferred from the stakeholder cartography session.
What it catches. Silent prompt regressions that drop eval scores in production, cost spikes from prompt-cache misses, latency degradations from provider-side issues, error-rate jumps from content-filter changes, sustained breaches that would otherwise wait until the weekly review.
How the seven components interact
The seven components are not seven independent installs. They form a single operational fabric. The trace store is the source of record; the cost and latency components emit metrics derived from traces; the eval CI runs against the same traces during PR review and against canary traces in production; the prompt registry is referenced by most trace; the error telemetry overlays the trace with failure context; the regression alerts trigger off the same metrics with on-call routing.
This integration is what makes the install a one-time cost. Once the seven are in place and connected, most new feature ships into the existing fabric without additional plumbing. A team that installs them piecemeal; trace store now, cost telemetry next month, eval CI in week eight; pays integration cost three times. A team that installs them together pays integration cost once.
Tool selection: build, buy, or open source
The build-vs-buy axis is real but not as decisive as the integration axis. Teams that already run Datadog or Honeycomb should extend their existing stack with OpenTelemetry-GenAI rather than introduce a parallel observability tool. Teams without an existing stack should pick Langfuse or Helicone for time-to-value; both can be installed in a day and cover trace store, cost telemetry, and prompt registry from a single product.
The components most often built rather than bought are the eval CI (Promptfoo or a custom test runner) and the regression alerts (extending the existing alerting stack). The components most often bought are the trace store and prompt registry; building these from scratch is typically not worth the engineering time on a 90-day Q1 mandate.
The components that should rarely be skipped, regardless of build-vs-buy: trace store, eval CI, prompt registry. Those three are the irreducible minimum; the other four are recommended and routinely treated as optional. They are not optional in production-ready engagements.
The cost of installing late
A team that ships first and observes later pays three costs. First, the retrofit; installing observability over an existing system requires touching most model interaction site, which is more code than the original install. Second, the data gap; the period between first PR and observability install is unobservable and produces incidents that have no diagnostic surface. Third, the discipline gap; teams that operate without observability for three months internalize habits (manual eval runs, inline prompts, vibes-based cost intuition) that are hard to unwind once observability arrives.
The day-one install costs roughly two engineer-days of work spread across the first two weeks. The retrofit at month three costs roughly two engineer-weeks plus the unobservable-period incidents. The math is not close.
FAQ
The seven-component observability stack is not aspirational. Most piece exists because of a production incident I have either watched or shipped, and most install I have skipped on day one I have regretted by day forty. The day-one install is the highest-leverage two engineer-days of any AI engagement; the retrofit at month three is the highest-friction two engineer-weeks. The math is unambiguous, and the discipline pays back in the first quarter.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has installed the day-one observability stack across more than 20 client engagements.
Frequently Asked Questions
Why install AI observability on day one rather than later?
Because AI systems are not ‘non-observable’ before observability is installed; they are ‘opaquely observable,’ which is worse. Without a trace store, most model interaction is a black box. Without cost telemetry, most prompt change is a budgetary surprise. Without eval CI, most PR is opinion-traded rather than measured. The day-one install costs roughly two engineer-days; the retrofit at month three costs roughly two engineer-weeks plus the incidents that occurred during the unobservable period. The math is not close.
What are the seven components of the day-one observability stack?
Trace store (most model interaction with full request, response, tokens, latency, prompt version), cost telemetry (per-call and per-feature dollar tracking), eval CI (suite running on most PR with delta posted), prompt registry (versioned immutable prompt store), error and exception telemetry (standard error tracking extended with AI context), latency P95/P99 tracking (percentile-based, broken down by feature and model), and regression alerts (automated alerts on eval, cost, latency, and error metrics with on-call routing).
Which components are non-negotiable versus recommended?
The irreducible minimum is trace store, eval CI, and prompt registry. Those three cannot be skipped on a production-ready engagement. The other four; cost telemetry, error telemetry, latency P95/P99, and regression alerts; are recommended and routinely treated as optional, but they are not optional in production-ready engagements. Skipping them means the team is operating partially blind, and partial blindness in AI systems is the regime where incidents become silent quality drift that the eval suite catches a month late.
Which trace store should an AI agency install?
Langfuse for self-hosted open source with a generous free tier and integrated cost and prompt management; Helicone for managed and fastest setup with strong cost telemetry; Arize Phoenix for OpenTelemetry-native open source; or a vendor-neutral OpenTelemetry-GenAI pipeline routing to an existing observability backend like Honeycomb or Datadog. The choice depends on procurement constraints and the existing stack; teams already on a major observability platform should route OpenTelemetry-GenAI spans to it; teams starting fresh should pick Langfuse or Helicone for time-to-value.
Why is cost telemetry critical from day one?
AI cost grows non-linearly with usage. A feature that costs $0.04 per request at 1,000 requests/day costs $1,200/month; the same feature at 80,000 requests/day costs $96,000/month. Without cost telemetry installed before traffic ramps, the first invoice surprises everyone. With it, cost is a first-class metric the team optimizes against during development. Cost telemetry also catches prompt changes that triple per-request cost, model defaults that silently shift to more expensive variants, retrieval changes that load 4x context per request, and runaway loops in agent code.
What does eval CI look like in practice?
A CI integration that runs the eval suite on most PR, computes the eval delta against the baseline, and posts the delta as a PR comment within the same review cycle as unit tests. The eval suite is versioned alongside the code; the baseline is recomputed on merges to main. Promptfoo is the most common starting point; it integrates with GitHub Actions in under an hour and produces the per-PR delta in the right shape. Alternatives include Inspect (Anthropic’s framework, programmatic) and Langfuse evals (built into the trace store). Custom test runners also work for small teams.
Why do prompt-bearing PRs need a prompt registry?
Without a prompt registry, prompts live inline in code, get edited inline, and lose their version history the moment they change. Debugging a regression three weeks later is then archaeology; what was the prompt on the day the bad output happened? With a registry, the trace store records the prompt version, the registry returns the body, and the answer is in seconds. The registry also enables A/B testing of prompt variants without code changes and rapid rollback of bad prompt versions without a deploy.
Why track P95 and P99 latency rather than mean?
AI latency is bimodal. P50 is usually fast; P99 can be 10x P50 because of streaming-completion stalls, retry cascades, or long-context regenerations. Mean latency hides this bimodality. The contractual latency SLA from the engagement success criteria is named at P95, not at mean; without P95 tracking, SLA compliance cannot be measured. P95 and P99 tracking also catches model-specific latency regressions (one provider degrades while another holds) and prompt-version regressions (a prompt change that adds 2,000 tokens of context adds 800ms of latency).
How do the seven components interact as a single system?
They form a single operational fabric. The trace store is the source of record; the cost and latency components emit metrics derived from traces; the eval CI runs against the same traces during PR review and against canary traces in production; the prompt registry is referenced by most trace; the error telemetry overlays the trace with failure context; the regression alerts trigger off the same metrics with on-call routing. This integration is what makes the install a one-time cost. Once the seven are in place and connected, most new feature ships into the existing fabric without additional plumbing.
Can existing APM stacks like Datadog or Honeycomb be reused?
Yes, and they should be. Teams that already run Datadog or Honeycomb should extend their existing stack with OpenTelemetry-GenAI rather than introduce a parallel observability tool. The right shape is to extend the existing error tracking and latency tooling rather than introduce a parallel stack; the on-call engineer should see AI errors in the same dashboard as application errors. The components most often added on top are trace store and prompt registry; the components most often built rather than bought are eval CI and regression alerts.
Arthur Wandzel