The unglamorous truth about a 2026 AI development studio: most of the margin is in what gets reused between engagements, not what gets built inside one. A reference architecture; named tools, named templates, named rituals, named artifacts; is what turns the second engagement into a 30-percent-cheaper version of the first, and the tenth into a 60-percent-cheaper version. Without one, most project re-buys the same eval harness, re-writes the same retry logic, and re-debates the same Friday-demo format. Standardization compounds margin; ad-hoc work does not.
This piece is the implementation depth of Decoding the AI Agency Stack. Where that article names the layers; Roles, Rituals, Review cadences, Reusable artifacts; this one names the products: which LLM gateway, which eval framework, which observability stack, which deploy platform, which document templates, which weekly cadence, which shared libraries. It is the spoke of The AI Agency Manifesto that a buyer can verify by asking to see a package.json.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Why a reference architecture, not a best-practices list
A best-practices list says “use evals.” A reference architecture says: the default eval framework is Promptfoo for declarative suites and a thin in-house Python harness for stateful agent runs; results land in Langfuse; the CI gate is a GitHub Actions job named evals-required; the threshold file is checked into evals/thresholds.yaml. The first version is dinner-table conversation; the second is the difference between a project starting on day one and a project starting on day twelve.
The four axes of the reference architecture map one-to-one onto the four layers in Decoding the AI Agency Stack:
| Axis | What it standardizes | What it eliminates |
|---|---|---|
| Tools | The default product for each engineering capability. | Per-project tool debates; vendor-lock-in surprises. |
| Templates | The default document for each engagement artifact. | Bespoke charters; idiosyncratic ADR formats; demo-deck reinvention. |
| Rituals | The default week on the studio calendar. | Ritual drift; per-engagement cadence negotiation. |
| Reusable artifacts | The default internal packages and configs. | Re-implementing retries, routers, eval harnesses, prompt registries. |
This piece does not re-cover the Roles axis from the sibling stack-decomposition article; the people are the consumers of this architecture. Below, each axis names specific products and explains why that product, not its alternatives.
Axis 1: Tools; the default stack
The studio’s defaults across the engineering capabilities. New engagements deviate only with a recorded ADR.
| Capability | Default | Alternative we evaluate against | Why this default |
|---|---|---|---|
| LLM gateway / model router | LiteLLM (self-hosted) wrapped by an internal model-router package | OpenRouter; bespoke fetch wrappers | Single OpenAI-compatible interface across Anthropic, OpenAI, Google, Bedrock, and OSS endpoints. Adds retries, fallbacks, per-call cost telemetry, and a model-alias indirection layer. Vendor deprecations become a config commit. |
| Code-gen agent (engineer-side) | Claude Code as primary, Cursor as secondary | Aider; Cline; bare IDE | Claude Code lives in the terminal where the codebase already does, runs as an agent, and edits in place. Cursor wins for in-editor flow and inline acceptance. Both are paid for most engineer; we do not pretend “free tier” is a position. |
| Eval framework | Promptfoo for declarative suites; in-house evalkit for agentic runs | DeepEval; LangSmith evals only; ad-hoc pytest | Promptfoo has the cleanest YAML surface for golden-input regression suites and assertion DSL. evalkit handles multi-step agent rollouts where Promptfoo’s flat shape doesn’t fit. |
| Observability + tracing | Langfuse (self-hosted Cloud) + OpenTelemetry instrumentation in the SDK layer | LangSmith; Helicone; Datadog APM only | Langfuse gives prompt-level traces with cost and latency baked in. OTel instrumentation means traces flow into both Langfuse and the client’s existing APM (Datadog, Honeycomb) without re-instrumentation. |
| Prompt registry | Langfuse Prompts (versioned, environment-tagged) | A repo folder of .md files; Helicone Prompts | Versioned prompts decoupled from deploys mean a prompt rollback is a button, not a hotfix. The repo-folder pattern dies the first time a non-engineer needs to ship a wording change. |
| Agent / orchestration | Anthropic SDK or OpenAI SDK directly for ≤2-step flows; LangGraph for multi-step graphs; Inngest / Temporal for durable workflows | LangChain (rejected for production); raw asyncio | Direct SDK calls are the right tool for 70 percent of features. LangGraph’s graph model maps cleanly to the agent loops we ship. LangChain’s abstraction tax outweighs its convenience past prototype. |
| Vector store + RAG | pgvector on Postgres for ≤10M vectors; Pinecone for production multi-tenant; Turbopuffer for cost-sensitive scale | Weaviate; Chroma in production; Qdrant | pgvector keeps the data in the same database the app already uses; “one less system” beats marginal recall gains for almost most engagement. |
| CI / CD | GitHub Actions with evals-required and cost-budget-check as required checks | CircleCI; Buildkite | GHA is where the code already lives. The two required checks; eval threshold gate and per-feature cost budget; are studio-standard, not per-project. |
| Deploy platform | Vercel for Next.js apps; Fly.io for long-running agents and queues | AWS ECS; Render; Railway | Vercel + Fly covers 90 percent of the web-app + worker pattern with zero per-engagement infra-debate. We move to AWS only when compliance forces it. |
| Secrets + config | Doppler with environment-scoped projects | Vault; AWS Secrets Manager only | Doppler ships per-environment config to local dev, CI, and runtime through one CLI. Vault is for compliance-forced engagements. |
| Error tracking | Sentry on the app layer; Langfuse on the LLM layer | Datadog Errors; Rollbar | Two layers, two tools, two failure-mode scopes. Conflating them; “just put LLM errors in Sentry”; loses the prompt context that makes an LLM error actionable. |
| Documentation surface | Mintlify for client-facing API docs; Notion for engagement workspace; the repo’s /docs for ADRs | Confluence; ReadMe; raw GitHub Pages | Mintlify renders OpenAPI cleanly; Notion is where clients already live; ADRs belong in the repo because that is where they are read. |
Three rationale notes that buyers should push on:
- The model router is non-negotiable. Frontier-model deprecations now arrive on a sub-quarterly cadence; the eighteen months ending Q2 2026 saw GPT-5, GPT-5.4, Claude Opus 4 through 4.6, Gemini 3.1 Pro, and the open-weights frontier moving from Llama 3 to Llama 4 Scout. A studio without a router pays a one-to-two-sprint tax most quarter; a studio with one absorbs the change in a config commit.
- LangChain in production is a liability. The framework’s abstraction surface changes faster than its maintainers update the docs, and the indirection makes traces nearly unreadable. We use LangGraph (the same team’s graph runtime) where graph semantics genuinely help, and direct SDK calls everywhere else. This is the single most consequential opinion in our default stack.
- Self-hosted Langfuse, not LangSmith. LangSmith is excellent in the LangChain ecosystem; Langfuse is provider-neutral, OTel-native, and self-hostable on Postgres, which matters in roughly half our engagements where data residency is a constraint.
For the deeper engineering case behind the defaults, see Evaluating LLM Development Companies and LLM Integration Pricing Guide for Enterprise.
Axis 2: Templates; the document library
Engagements ship in documents almost as much as in code. Each template eliminates a class of recurring work; each is committed to the repo as the default starting point.
| Template | Sections | Why this format |
|---|---|---|
| Engagement charter (1 page) | Goal, two-week milestones, eval threshold, demo cadence, decision log link, exit criteria. | Single page forces clarity. The eval threshold and exit criteria fields are the two that bespoke charters usually omit and AI engagements usually need. |
| Architecture decision record (ADR) | Context, decision, alternatives considered, consequences, deprecation date if relevant. | The “alternatives considered” field is what makes future-you trust past-you. The deprecation date acknowledges that 2026 architecture decisions have shorter half-lives than 2018 ones. |
| Eval test set spec | Domain, golden inputs, expected outputs or assertions, threshold, owner, regeneration cadence. | A non-engineer can read this and know whether the eval matches the contract. The regeneration-cadence field forces a conversation about test-set drift before it becomes a problem. |
| Weekly demo deck (5 slides) | What shipped, what passed evals, what cost more or less, what’s next, one open question. | Five slides, no exceptions. The “cost” slide is the one that surfaces token-arbitrage drift before a finance conversation does. |
| Postmortem (one page) | Trigger, blast radius, root cause, contributing factors, fix, what we change in the playbook. | The “what we change in the playbook” field is what turns a postmortem into compounding judgment instead of a private apology. |
| Eval review log (running) | Date, failing inputs, owner, root cause, fix, regression test added (Y/N). | A daily artifact that becomes a quarterly leading indicator of where the system is weakest. |
| Onboarding playbook (client-side, 2 pages) | Codebase access checklist, eval threshold negotiation script, demo-cadence agreement, decision-log conventions, escalation path. | The first 14 days are what the engagement is judged on. A playbook makes them predictable; see Anatomy of an AI Agency Engagement. |
| Cost budget memo (per feature) | Feature, model mix, expected per-call cost, monthly volume estimate, alarm threshold. | A document, not a dashboard, because a number you have to write is a number you have to defend. |
Two notes on how the templates connect:
- Charter → eval spec → cost memo → demo deck is the primary loop. The charter sets the eval threshold; the eval spec encodes it; the cost memo bounds it economically; the demo deck reports against many three weekly. Engagements that skip any one of the four start drifting in week three and start fighting in week eight.
- ADRs are the only template that lives in the repo. Most other template lives in Notion. Architecture decisions live next to the code they govern, because that is where the next engineer to make a related decision will read them.
Axis 3: Rituals; the week on the calendar
Rituals are the time axis. The defaults are spelled out in Decoding the AI Agency Stack; the specifics here are the products and templates each ritual consumes.
| Ritual | Cadence | Tool | Template consumed | Template produced |
|---|---|---|---|---|
| Async standup | Daily, before 10am local | Slack thread (no video) | Yesterday / today / blockers | None |
| Plan-and-demo (Monday) | Weekly | Zoom + repo + Notion | Engagement charter; last week’s demo deck | Week’s spec-review notes |
| Eval review | Daily, 15 min | Langfuse + Promptfoo dashboard | Eval review log (running) | Updated eval review log; Linear tickets for failures |
| Wed eval architecture review | Weekly, 30 min | Repo + Langfuse | Eval test set spec; eval review log | New eval suites; threshold deltas |
| Friday client demo | Weekly, 30 min | Zoom + screen share | Weekly demo deck | Same deck, archived to Notion |
| Monthly architecture review | Monthly, 3 hr | Repo + Notion | Open ADRs; deprecation calendar; cost memos | New ADRs; closed deprecations |
| Postmortem (on incident) | Within 72 hr of an incident | Repo + Notion | Incident timeline | Postmortem doc; playbook delta |
Three notes:
- Monday plan-and-demo, not Monday plan + Friday demo only. The Monday demo of what we shipped last Friday after the call compresses the feedback loop that makes Friday demos honest. Without it, Friday becomes a stage and Monday becomes a re-litigation of last week’s scope.
- Wednesday eval architecture review is the ritual most studios skip. It is not the daily eval-failure triage; it is a 30-minute weekly look at the eval suite itself; what coverage is missing, what threshold needs to move, what golden-input set has drifted out of representativeness. A daily review without a weekly architecture review compounds blindspots.
- Monthly architecture review consumes ADRs and produces ADRs. That recursive consumption is what makes the architecture self-correcting on a quarterly horizon. The same loop is described in operational terms in Inside the AI Agency Operating System.
Axis 4: Reusable artifacts; the internal package shelf
The shelf of internal packages, configs, and harnesses that pre-date most engagement. Each is versioned, documented, and owned.
| Artifact | Implementation | What it replaces in most project |
|---|---|---|
model-router (private package) | TypeScript wrapper around LiteLLM with retry policy, fallback chain, per-call cost emission, and model-alias indirection. | A custom fetch + retry + cost-counting layer in most codebase. |
evalkit (private package) | Python harness for stateful agent evals; emits Langfuse-compatible traces; reads thresholds from evals/thresholds.yaml. | A bespoke Pytest harness in each repo; ad-hoc threshold negotiation. |
prompt-registry-cli | CLI wrapping Langfuse Prompts: prompt sync, prompt rollback, prompt diff. | Engineers SSHing into prod to revert a prompt. |
retry-and-budget (TS + Python) | Token-bucket budget enforcer that combines latency-based retry with a per-feature monthly cost cap. Throws a typed BudgetExceeded rather than 500-ing. | The “we forgot we were retrying with exponential backoff on a $0.04-per-call model” postmortem. |
rag-starter | Forked baseline (Next.js + pgvector + ingest pipeline + retrieval evals) with chunking strategy parameterized. | Three sprints of “where do we put the embeddings.” |
obs-dashboards-as-code | Terraform/Pulumi modules for Langfuse projects, Sentry projects, Datadog dashboards, and per-feature cost alarms. | Click-ops in three SaaS dashboards on day one of a project. |
eval-template-library | 40+ eval suites by domain (regulatory Q&A, code-search, financial extraction, medical triage handoff, structured extraction). | A custom eval set built from scratch in week three. |
prompt-pack | Battle-tested system prompts by task (summarization, structured extraction, tool-use loops, RAG synthesis). | The first month of prompt iteration. |
engagement-template (repo skeleton) | A degit-able starter with the model-router, evalkit, GHA workflows, ADR folder, and evals/ structure pre-configured. | Day-one setup time on most engagement. |
postmortem-library (private wiki) | Anonymized writeups of past production failures and the playbook delta each generated. | Senior judgment locked in tribal knowledge. |
hiring-rubric + take-home | Calibrated take-home (build a small agent; ship it with one eval) and on-site loop. | Hiring drift as the team grows. |
Two non-obvious choices a buyer should ask about:
retry-and-budgetis shared infrastructure, not a copy-paste pattern. The first time an engineer copy-pastes the wrong exponential-backoff constant is the day cost-of-goods-sold gets a 4× outlier in a monthly review. The package version-pins the policy.engagement-templateis what makes day-one a real day-one. Adegitof the template scaffolds the repo, the GHA workflows, the eval folder, and the model-router config in twenty minutes. Without it, a new engagement loses its first three days to setup that adds zero client-visible value.
The artifacts compound: each engagement either consumes from the shelf or contributes back to it. An engagement that does neither is a yellow flag in retro.
How standardization compounds margin
The economic argument is mechanical. A 2026 engagement carries roughly the same fixed setup cost; eval scaffolding, model-routing logic, observability dashboards, retry policy, prompt registry, demo-deck format, charter negotiation; whether the studio re-builds it or pulls it from the shelf. Setup cost in our books is roughly twelve engineer-days when re-built, and roughly two when pulled from the shelf. Across ten engagements per year, that is one hundred engineer-days of recovered capacity; about half a senior engineer’s annual output, applied directly to client-visible work or studio R&D.
The compounding part is the second derivative. Most engagement that consumes from the shelf also contributes a delta back: a new eval template, a postmortem, a prompt-pack entry, a model-router edge case. The shelf grows on a schedule the competition does not. Five years in, the gap between a studio that standardized in year one and one that did not is not a process gap; it is a cost-of-goods-sold gap of forty to sixty percent on comparable scope. The same compounding shows up on the other side of the ledger: setup cost per engagement halves roughly most two years as the shelf matures, while the staff-aug shop’s setup cost stays flat.
Two failure modes to watch:
- Standardization without ownership is brittle. Each artifact needs a named owner; usually the role that uses it most. The model-router is owned by the founding engineer; the eval-template library is owned by the eval engineer; the postmortem library is owned by the agent SRE. Unowned artifacts rot fastest.
- Standardization that fights the engagement loses the engagement. Defaults are defaults, not laws. A regulated engagement may need a different vector store; a long-running agent may need Temporal instead of Inngest. The ADR is the mechanism that lets the deviation happen and stay legible.
What a buyer can verify
A reference architecture is a thing on disk, not a thing on a slide. In a procurement call, ten minutes of pointed questions confirm whether the architecture is real:
- “Show me your
model-routerpackage and its README.” A real studio has the repo open in 30 seconds, with retries, fallbacks, and per-call cost emission visible in the code. A staff-aug shop pivots to a slide. - “What’s the name of your eval framework, and where does the threshold file live?” The answer should be a single sentence; Promptfoo, in
evals/thresholds.yaml, gated by theevals-requiredGHA check. Anything vaguer is improv. - “Pull up your
eval-template-library. What’s there for my domain?” A 30-second tour of the directory beats any case study. If the library is empty for the buyer’s domain, the studio should say so and price the gap. - “What was the last ADR you wrote?” A real studio has it in the repo. The contents matter less than the cadence.
- “What did your last weekly demo deck look like?” The five-slide template either looks lived-in or it does not.
If the studio cannot pass these in ten minutes, the operating model is being narrated, not lived. The same vetting frame extended to roles, rituals, and reviews is in Decoding the AI Agency Stack. And the broader case for why a senior-density studio is the right shape to wield this architecture is in Inside the AI Agency Operating System.
The reference architecture is not an aesthetic preference. It is the operating leverage that lets a 12-person studio out-ship a 50-person agency, year after year, on margin that compounds.
Frequently asked questions
What is an AI agency reference architecture?
A reference architecture is the studio’s named defaults across four axes: tools (the default product for each engineering capability; LLM gateway, eval framework, observability, deploy platform), templates (the default document for each engagement artifact; charter, ADR, eval spec, demo deck), rituals (the default week on the calendar; Monday plan-and-demo, daily eval review, Friday client demo, monthly architecture review), and reusable artifacts (the default internal packages; model-router, evalkit, prompt-registry, retry-and-budget). It exists so that the second engagement is meaningfully cheaper than the first, and the tenth is dramatically cheaper.
Why does standardization matter more for AI work than for traditional web work?
AI workloads have failure modes; silent regressions, cost drift, model deprecations, prompt rot; that traditional web work does not. Each failure mode needs a named tool and a named ritual to catch it: an eval framework, a cost telemetry layer, a model router, a prompt registry. A studio that re-invents this surface on most engagement burns the first sprint of most project on plumbing that the second sprint of the next project will burn again. Standardization here is not preference; it is the only way to ship AI work at agency margins.
What LLM gateway should an AI agency standardize on?
We default to LiteLLM (self-hosted) wrapped by an internal model-router package, evaluated against OpenRouter and bespoke fetch wrappers. LiteLLM gives a single OpenAI-compatible interface across Anthropic, OpenAI, Google, Bedrock, and OSS endpoints; the wrapper adds retries, fallbacks, per-call cost telemetry, and a model-alias indirection layer that turns a vendor deprecation into a config commit. Without this layer, most model migration is a one-to-two-sprint tax.
What eval framework should an AI agency use?
Promptfoo for declarative golden-input regression suites and a thin in-house evalkit for stateful, multi-step agent runs. Promptfoo’s YAML surface is the cleanest available for assertion DSLs and threshold gating; agentic runs need richer instrumentation than its flat shape supports. Both pipe results into Langfuse so traces and eval failures live in one place. The evals-required GitHub Actions check is the studio-standard CI gate.
Why not LangChain in production?
LangChain’s abstraction surface changes faster than its docs, and the indirection makes traces nearly unreadable when something goes wrong at 2am. We use LangGraph; the same team’s graph runtime; where graph semantics help, and we make direct SDK calls (Anthropic SDK, OpenAI SDK) for the 70 percent of features where they are simpler. This is the single most consequential opinion in the default stack and the one a buyer should ask the most pointed questions about.
What templates does an AI agency need beyond a statement of work?
Eight, in priority order: a one-page engagement charter with eval threshold and exit criteria, the ADR format with an alternatives-considered section and a deprecation date, an eval test set spec with regeneration cadence, a five-slide weekly demo deck, a one-page postmortem, a running eval review log, a two-page client-side onboarding playbook, and a per-feature cost budget memo. The charter, eval spec, cost memo, and demo deck form a connected loop; the others handle incidents and the first 14 days.
What rituals does an AI agency reference architecture include?
Seven: an async daily standup, Monday plan-and-demo, daily 15-minute eval review, weekly 30-minute Wednesday eval architecture review, Friday client demo, monthly 3-hour architecture review, and on-incident postmortem within 72 hours. The Wednesday eval architecture review; distinct from daily eval triage; is the one most studios skip and pay for in eval blindspot accumulation.
What internal packages should an AI agency build?
Eleven: model-router, evalkit, prompt-registry-cli, retry-and-budget, rag-starter, obs-dashboards-as-code, an eval-template-library, a prompt-pack, an engagement-template repo skeleton, a postmortem-library wiki, and a calibrated hiring-rubric plus take-home. Each is versioned, documented, and owned by the role that uses it most. Together they collapse engagement setup time from roughly twelve engineer-days to roughly two.
How does a standardized stack compound margin over time?
Setup cost per engagement falls from roughly twelve engineer-days to roughly two when artifacts are pulled from the shelf rather than re-built. Across ten engagements per year that is one hundred engineer-days of recovered capacity; about half a senior engineer. The shelf also grows on a schedule: most engagement contributes a postmortem, an eval template, or a prompt-pack entry. Five years in, a studio that standardized in year one runs at a 40–60 percent cost-of-goods-sold advantage over a comparable shop that did not.
How can a buyer verify a vendor’s reference architecture is real?
Ten minutes of pointed questions on a procurement call. Ask to see the model-router repo and its README; ask the name of the eval framework and where the threshold file lives; ask for a directory tour of the eval-template library; ask for the last ADR they wrote; ask for the last weekly demo deck. A real studio passes many five inside ten minutes with screen-shared evidence. A staff-aug shop pivots to a slide deck on at least three of the five, which is the same theatre signal described across The AI Agency Manifesto.
Arthur Wandzel