Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 21 min read

The AI Agency Reference Architecture: Tools, Templates, and Rituals We Standardize On

The AI Agency Reference Architecture: Tools, Templates, and Rituals We Standardize On

The unglamorous truth about a 2026 AI development studio: most of the margin is in what gets reused between engagements, not what gets built inside one. A reference architecture; named tools, named templates, named rituals, named artifacts; is what turns the second engagement into a 30-percent-cheaper version of the first, and the tenth into a 60-percent-cheaper version. Without one, most project re-buys the same eval harness, re-writes the same retry logic, and re-debates the same Friday-demo format. Standardization compounds margin; ad-hoc work does not.

This piece is the implementation depth of Decoding the AI Agency Stack. Where that article names the layers; Roles, Rituals, Review cadences, Reusable artifacts; this one names the products: which LLM gateway, which eval framework, which observability stack, which deploy platform, which document templates, which weekly cadence, which shared libraries. It is the spoke of The AI Agency Manifesto that a buyer can verify by asking to see a package.json.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why a reference architecture, not a best-practices list

A best-practices list says “use evals.” A reference architecture says: the default eval framework is Promptfoo for declarative suites and a thin in-house Python harness for stateful agent runs; results land in Langfuse; the CI gate is a GitHub Actions job named evals-required; the threshold file is checked into evals/thresholds.yaml. The first version is dinner-table conversation; the second is the difference between a project starting on day one and a project starting on day twelve.

The four axes of the reference architecture map one-to-one onto the four layers in Decoding the AI Agency Stack:

AxisWhat it standardizesWhat it eliminates
ToolsThe default product for each engineering capability.Per-project tool debates; vendor-lock-in surprises.
TemplatesThe default document for each engagement artifact.Bespoke charters; idiosyncratic ADR formats; demo-deck reinvention.
RitualsThe default week on the studio calendar.Ritual drift; per-engagement cadence negotiation.
Reusable artifactsThe default internal packages and configs.Re-implementing retries, routers, eval harnesses, prompt registries.

This piece does not re-cover the Roles axis from the sibling stack-decomposition article; the people are the consumers of this architecture. Below, each axis names specific products and explains why that product, not its alternatives.

Axis 1: Tools; the default stack

The studio’s defaults across the engineering capabilities. New engagements deviate only with a recorded ADR.

CapabilityDefaultAlternative we evaluate againstWhy this default
LLM gateway / model routerLiteLLM (self-hosted) wrapped by an internal model-router packageOpenRouter; bespoke fetch wrappersSingle OpenAI-compatible interface across Anthropic, OpenAI, Google, Bedrock, and OSS endpoints. Adds retries, fallbacks, per-call cost telemetry, and a model-alias indirection layer. Vendor deprecations become a config commit.
Code-gen agent (engineer-side)Claude Code as primary, Cursor as secondaryAider; Cline; bare IDEClaude Code lives in the terminal where the codebase already does, runs as an agent, and edits in place. Cursor wins for in-editor flow and inline acceptance. Both are paid for most engineer; we do not pretend “free tier” is a position.
Eval frameworkPromptfoo for declarative suites; in-house evalkit for agentic runsDeepEval; LangSmith evals only; ad-hoc pytestPromptfoo has the cleanest YAML surface for golden-input regression suites and assertion DSL. evalkit handles multi-step agent rollouts where Promptfoo’s flat shape doesn’t fit.
Observability + tracingLangfuse (self-hosted Cloud) + OpenTelemetry instrumentation in the SDK layerLangSmith; Helicone; Datadog APM onlyLangfuse gives prompt-level traces with cost and latency baked in. OTel instrumentation means traces flow into both Langfuse and the client’s existing APM (Datadog, Honeycomb) without re-instrumentation.
Prompt registryLangfuse Prompts (versioned, environment-tagged)A repo folder of .md files; Helicone PromptsVersioned prompts decoupled from deploys mean a prompt rollback is a button, not a hotfix. The repo-folder pattern dies the first time a non-engineer needs to ship a wording change.
Agent / orchestrationAnthropic SDK or OpenAI SDK directly for ≤2-step flows; LangGraph for multi-step graphs; Inngest / Temporal for durable workflowsLangChain (rejected for production); raw asyncioDirect SDK calls are the right tool for 70 percent of features. LangGraph’s graph model maps cleanly to the agent loops we ship. LangChain’s abstraction tax outweighs its convenience past prototype.
Vector store + RAGpgvector on Postgres for ≤10M vectors; Pinecone for production multi-tenant; Turbopuffer for cost-sensitive scaleWeaviate; Chroma in production; Qdrantpgvector keeps the data in the same database the app already uses; “one less system” beats marginal recall gains for almost most engagement.
CI / CDGitHub Actions with evals-required and cost-budget-check as required checksCircleCI; BuildkiteGHA is where the code already lives. The two required checks; eval threshold gate and per-feature cost budget; are studio-standard, not per-project.
Deploy platformVercel for Next.js apps; Fly.io for long-running agents and queuesAWS ECS; Render; RailwayVercel + Fly covers 90 percent of the web-app + worker pattern with zero per-engagement infra-debate. We move to AWS only when compliance forces it.
Secrets + configDoppler with environment-scoped projectsVault; AWS Secrets Manager onlyDoppler ships per-environment config to local dev, CI, and runtime through one CLI. Vault is for compliance-forced engagements.
Error trackingSentry on the app layer; Langfuse on the LLM layerDatadog Errors; RollbarTwo layers, two tools, two failure-mode scopes. Conflating them; “just put LLM errors in Sentry”; loses the prompt context that makes an LLM error actionable.
Documentation surfaceMintlify for client-facing API docs; Notion for engagement workspace; the repo’s /docs for ADRsConfluence; ReadMe; raw GitHub PagesMintlify renders OpenAPI cleanly; Notion is where clients already live; ADRs belong in the repo because that is where they are read.

Three rationale notes that buyers should push on:

  • The model router is non-negotiable. Frontier-model deprecations now arrive on a sub-quarterly cadence; the eighteen months ending Q2 2026 saw GPT-5, GPT-5.4, Claude Opus 4 through 4.6, Gemini 3.1 Pro, and the open-weights frontier moving from Llama 3 to Llama 4 Scout. A studio without a router pays a one-to-two-sprint tax most quarter; a studio with one absorbs the change in a config commit.
  • LangChain in production is a liability. The framework’s abstraction surface changes faster than its maintainers update the docs, and the indirection makes traces nearly unreadable. We use LangGraph (the same team’s graph runtime) where graph semantics genuinely help, and direct SDK calls everywhere else. This is the single most consequential opinion in our default stack.
  • Self-hosted Langfuse, not LangSmith. LangSmith is excellent in the LangChain ecosystem; Langfuse is provider-neutral, OTel-native, and self-hostable on Postgres, which matters in roughly half our engagements where data residency is a constraint.

For the deeper engineering case behind the defaults, see Evaluating LLM Development Companies and LLM Integration Pricing Guide for Enterprise.

Axis 2: Templates; the document library

Engagements ship in documents almost as much as in code. Each template eliminates a class of recurring work; each is committed to the repo as the default starting point.

TemplateSectionsWhy this format
Engagement charter (1 page)Goal, two-week milestones, eval threshold, demo cadence, decision log link, exit criteria.Single page forces clarity. The eval threshold and exit criteria fields are the two that bespoke charters usually omit and AI engagements usually need.
Architecture decision record (ADR)Context, decision, alternatives considered, consequences, deprecation date if relevant.The “alternatives considered” field is what makes future-you trust past-you. The deprecation date acknowledges that 2026 architecture decisions have shorter half-lives than 2018 ones.
Eval test set specDomain, golden inputs, expected outputs or assertions, threshold, owner, regeneration cadence.A non-engineer can read this and know whether the eval matches the contract. The regeneration-cadence field forces a conversation about test-set drift before it becomes a problem.
Weekly demo deck (5 slides)What shipped, what passed evals, what cost more or less, what’s next, one open question.Five slides, no exceptions. The “cost” slide is the one that surfaces token-arbitrage drift before a finance conversation does.
Postmortem (one page)Trigger, blast radius, root cause, contributing factors, fix, what we change in the playbook.The “what we change in the playbook” field is what turns a postmortem into compounding judgment instead of a private apology.
Eval review log (running)Date, failing inputs, owner, root cause, fix, regression test added (Y/N).A daily artifact that becomes a quarterly leading indicator of where the system is weakest.
Onboarding playbook (client-side, 2 pages)Codebase access checklist, eval threshold negotiation script, demo-cadence agreement, decision-log conventions, escalation path.The first 14 days are what the engagement is judged on. A playbook makes them predictable; see Anatomy of an AI Agency Engagement.
Cost budget memo (per feature)Feature, model mix, expected per-call cost, monthly volume estimate, alarm threshold.A document, not a dashboard, because a number you have to write is a number you have to defend.

Two notes on how the templates connect:

  • Charter → eval spec → cost memo → demo deck is the primary loop. The charter sets the eval threshold; the eval spec encodes it; the cost memo bounds it economically; the demo deck reports against many three weekly. Engagements that skip any one of the four start drifting in week three and start fighting in week eight.
  • ADRs are the only template that lives in the repo. Most other template lives in Notion. Architecture decisions live next to the code they govern, because that is where the next engineer to make a related decision will read them.

Axis 3: Rituals; the week on the calendar

Rituals are the time axis. The defaults are spelled out in Decoding the AI Agency Stack; the specifics here are the products and templates each ritual consumes.

RitualCadenceToolTemplate consumedTemplate produced
Async standupDaily, before 10am localSlack thread (no video)Yesterday / today / blockersNone
Plan-and-demo (Monday)WeeklyZoom + repo + NotionEngagement charter; last week’s demo deckWeek’s spec-review notes
Eval reviewDaily, 15 minLangfuse + Promptfoo dashboardEval review log (running)Updated eval review log; Linear tickets for failures
Wed eval architecture reviewWeekly, 30 minRepo + LangfuseEval test set spec; eval review logNew eval suites; threshold deltas
Friday client demoWeekly, 30 minZoom + screen shareWeekly demo deckSame deck, archived to Notion
Monthly architecture reviewMonthly, 3 hrRepo + NotionOpen ADRs; deprecation calendar; cost memosNew ADRs; closed deprecations
Postmortem (on incident)Within 72 hr of an incidentRepo + NotionIncident timelinePostmortem doc; playbook delta

Three notes:

  • Monday plan-and-demo, not Monday plan + Friday demo only. The Monday demo of what we shipped last Friday after the call compresses the feedback loop that makes Friday demos honest. Without it, Friday becomes a stage and Monday becomes a re-litigation of last week’s scope.
  • Wednesday eval architecture review is the ritual most studios skip. It is not the daily eval-failure triage; it is a 30-minute weekly look at the eval suite itself; what coverage is missing, what threshold needs to move, what golden-input set has drifted out of representativeness. A daily review without a weekly architecture review compounds blindspots.
  • Monthly architecture review consumes ADRs and produces ADRs. That recursive consumption is what makes the architecture self-correcting on a quarterly horizon. The same loop is described in operational terms in Inside the AI Agency Operating System.

Axis 4: Reusable artifacts; the internal package shelf

The shelf of internal packages, configs, and harnesses that pre-date most engagement. Each is versioned, documented, and owned.

ArtifactImplementationWhat it replaces in most project
model-router (private package)TypeScript wrapper around LiteLLM with retry policy, fallback chain, per-call cost emission, and model-alias indirection.A custom fetch + retry + cost-counting layer in most codebase.
evalkit (private package)Python harness for stateful agent evals; emits Langfuse-compatible traces; reads thresholds from evals/thresholds.yaml.A bespoke Pytest harness in each repo; ad-hoc threshold negotiation.
prompt-registry-cliCLI wrapping Langfuse Prompts: prompt sync, prompt rollback, prompt diff.Engineers SSHing into prod to revert a prompt.
retry-and-budget (TS + Python)Token-bucket budget enforcer that combines latency-based retry with a per-feature monthly cost cap. Throws a typed BudgetExceeded rather than 500-ing.The “we forgot we were retrying with exponential backoff on a $0.04-per-call model” postmortem.
rag-starterForked baseline (Next.js + pgvector + ingest pipeline + retrieval evals) with chunking strategy parameterized.Three sprints of “where do we put the embeddings.”
obs-dashboards-as-codeTerraform/Pulumi modules for Langfuse projects, Sentry projects, Datadog dashboards, and per-feature cost alarms.Click-ops in three SaaS dashboards on day one of a project.
eval-template-library40+ eval suites by domain (regulatory Q&A, code-search, financial extraction, medical triage handoff, structured extraction).A custom eval set built from scratch in week three.
prompt-packBattle-tested system prompts by task (summarization, structured extraction, tool-use loops, RAG synthesis).The first month of prompt iteration.
engagement-template (repo skeleton)A degit-able starter with the model-router, evalkit, GHA workflows, ADR folder, and evals/ structure pre-configured.Day-one setup time on most engagement.
postmortem-library (private wiki)Anonymized writeups of past production failures and the playbook delta each generated.Senior judgment locked in tribal knowledge.
hiring-rubric + take-homeCalibrated take-home (build a small agent; ship it with one eval) and on-site loop.Hiring drift as the team grows.

Two non-obvious choices a buyer should ask about:

  • retry-and-budget is shared infrastructure, not a copy-paste pattern. The first time an engineer copy-pastes the wrong exponential-backoff constant is the day cost-of-goods-sold gets a 4× outlier in a monthly review. The package version-pins the policy.
  • engagement-template is what makes day-one a real day-one. A degit of the template scaffolds the repo, the GHA workflows, the eval folder, and the model-router config in twenty minutes. Without it, a new engagement loses its first three days to setup that adds zero client-visible value.

The artifacts compound: each engagement either consumes from the shelf or contributes back to it. An engagement that does neither is a yellow flag in retro.

How standardization compounds margin

The economic argument is mechanical. A 2026 engagement carries roughly the same fixed setup cost; eval scaffolding, model-routing logic, observability dashboards, retry policy, prompt registry, demo-deck format, charter negotiation; whether the studio re-builds it or pulls it from the shelf. Setup cost in our books is roughly twelve engineer-days when re-built, and roughly two when pulled from the shelf. Across ten engagements per year, that is one hundred engineer-days of recovered capacity; about half a senior engineer’s annual output, applied directly to client-visible work or studio R&D.

The compounding part is the second derivative. Most engagement that consumes from the shelf also contributes a delta back: a new eval template, a postmortem, a prompt-pack entry, a model-router edge case. The shelf grows on a schedule the competition does not. Five years in, the gap between a studio that standardized in year one and one that did not is not a process gap; it is a cost-of-goods-sold gap of forty to sixty percent on comparable scope. The same compounding shows up on the other side of the ledger: setup cost per engagement halves roughly most two years as the shelf matures, while the staff-aug shop’s setup cost stays flat.

Two failure modes to watch:

  • Standardization without ownership is brittle. Each artifact needs a named owner; usually the role that uses it most. The model-router is owned by the founding engineer; the eval-template library is owned by the eval engineer; the postmortem library is owned by the agent SRE. Unowned artifacts rot fastest.
  • Standardization that fights the engagement loses the engagement. Defaults are defaults, not laws. A regulated engagement may need a different vector store; a long-running agent may need Temporal instead of Inngest. The ADR is the mechanism that lets the deviation happen and stay legible.

What a buyer can verify

A reference architecture is a thing on disk, not a thing on a slide. In a procurement call, ten minutes of pointed questions confirm whether the architecture is real:

  1. “Show me your model-router package and its README.” A real studio has the repo open in 30 seconds, with retries, fallbacks, and per-call cost emission visible in the code. A staff-aug shop pivots to a slide.
  2. “What’s the name of your eval framework, and where does the threshold file live?” The answer should be a single sentence; Promptfoo, in evals/thresholds.yaml, gated by the evals-required GHA check. Anything vaguer is improv.
  3. “Pull up your eval-template-library. What’s there for my domain?” A 30-second tour of the directory beats any case study. If the library is empty for the buyer’s domain, the studio should say so and price the gap.
  4. “What was the last ADR you wrote?” A real studio has it in the repo. The contents matter less than the cadence.
  5. “What did your last weekly demo deck look like?” The five-slide template either looks lived-in or it does not.

If the studio cannot pass these in ten minutes, the operating model is being narrated, not lived. The same vetting frame extended to roles, rituals, and reviews is in Decoding the AI Agency Stack. And the broader case for why a senior-density studio is the right shape to wield this architecture is in Inside the AI Agency Operating System.

The reference architecture is not an aesthetic preference. It is the operating leverage that lets a 12-person studio out-ship a 50-person agency, year after year, on margin that compounds.

Frequently asked questions

What is an AI agency reference architecture?

A reference architecture is the studio’s named defaults across four axes: tools (the default product for each engineering capability; LLM gateway, eval framework, observability, deploy platform), templates (the default document for each engagement artifact; charter, ADR, eval spec, demo deck), rituals (the default week on the calendar; Monday plan-and-demo, daily eval review, Friday client demo, monthly architecture review), and reusable artifacts (the default internal packages; model-router, evalkit, prompt-registry, retry-and-budget). It exists so that the second engagement is meaningfully cheaper than the first, and the tenth is dramatically cheaper.

Why does standardization matter more for AI work than for traditional web work?

AI workloads have failure modes; silent regressions, cost drift, model deprecations, prompt rot; that traditional web work does not. Each failure mode needs a named tool and a named ritual to catch it: an eval framework, a cost telemetry layer, a model router, a prompt registry. A studio that re-invents this surface on most engagement burns the first sprint of most project on plumbing that the second sprint of the next project will burn again. Standardization here is not preference; it is the only way to ship AI work at agency margins.

What LLM gateway should an AI agency standardize on?

We default to LiteLLM (self-hosted) wrapped by an internal model-router package, evaluated against OpenRouter and bespoke fetch wrappers. LiteLLM gives a single OpenAI-compatible interface across Anthropic, OpenAI, Google, Bedrock, and OSS endpoints; the wrapper adds retries, fallbacks, per-call cost telemetry, and a model-alias indirection layer that turns a vendor deprecation into a config commit. Without this layer, most model migration is a one-to-two-sprint tax.

What eval framework should an AI agency use?

Promptfoo for declarative golden-input regression suites and a thin in-house evalkit for stateful, multi-step agent runs. Promptfoo’s YAML surface is the cleanest available for assertion DSLs and threshold gating; agentic runs need richer instrumentation than its flat shape supports. Both pipe results into Langfuse so traces and eval failures live in one place. The evals-required GitHub Actions check is the studio-standard CI gate.

Why not LangChain in production?

LangChain’s abstraction surface changes faster than its docs, and the indirection makes traces nearly unreadable when something goes wrong at 2am. We use LangGraph; the same team’s graph runtime; where graph semantics help, and we make direct SDK calls (Anthropic SDK, OpenAI SDK) for the 70 percent of features where they are simpler. This is the single most consequential opinion in the default stack and the one a buyer should ask the most pointed questions about.

What templates does an AI agency need beyond a statement of work?

Eight, in priority order: a one-page engagement charter with eval threshold and exit criteria, the ADR format with an alternatives-considered section and a deprecation date, an eval test set spec with regeneration cadence, a five-slide weekly demo deck, a one-page postmortem, a running eval review log, a two-page client-side onboarding playbook, and a per-feature cost budget memo. The charter, eval spec, cost memo, and demo deck form a connected loop; the others handle incidents and the first 14 days.

What rituals does an AI agency reference architecture include?

Seven: an async daily standup, Monday plan-and-demo, daily 15-minute eval review, weekly 30-minute Wednesday eval architecture review, Friday client demo, monthly 3-hour architecture review, and on-incident postmortem within 72 hours. The Wednesday eval architecture review; distinct from daily eval triage; is the one most studios skip and pay for in eval blindspot accumulation.

What internal packages should an AI agency build?

Eleven: model-router, evalkit, prompt-registry-cli, retry-and-budget, rag-starter, obs-dashboards-as-code, an eval-template-library, a prompt-pack, an engagement-template repo skeleton, a postmortem-library wiki, and a calibrated hiring-rubric plus take-home. Each is versioned, documented, and owned by the role that uses it most. Together they collapse engagement setup time from roughly twelve engineer-days to roughly two.

How does a standardized stack compound margin over time?

Setup cost per engagement falls from roughly twelve engineer-days to roughly two when artifacts are pulled from the shelf rather than re-built. Across ten engagements per year that is one hundred engineer-days of recovered capacity; about half a senior engineer. The shelf also grows on a schedule: most engagement contributes a postmortem, an eval template, or a prompt-pack entry. Five years in, a studio that standardized in year one runs at a 40–60 percent cost-of-goods-sold advantage over a comparable shop that did not.

How can a buyer verify a vendor’s reference architecture is real?

Ten minutes of pointed questions on a procurement call. Ask to see the model-router repo and its README; ask the name of the eval framework and where the threshold file lives; ask for a directory tour of the eval-template library; ask for the last ADR they wrote; ask for the last weekly demo deck. A real studio passes many five inside ten minutes with screen-shared evidence. A staff-aug shop pivots to a slide deck on at least three of the five, which is the same theatre signal described across The AI Agency Manifesto.

Last Updated: May 23, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles