The unit of progress in the first 14 days of an AI agency engagement is the artifact, not the meeting. Most onboarding period has the same problem to solve: a new agency has to absorb a client’s domain, codebase, data, constraints, and political topology while simultaneously producing measurable work that justifies the spend. The agencies that solve it well do so by producing a small, named set of artifacts on a tight cadence. The agencies that solve it badly produce decks. This is the playbook in nine artifacts; what each is, who owns it, what it looks like, and how it gets reused for the rest of the engagement.
The frame here sits inside the AI agency manifesto: an AI dev partner in 2026 is forward-deployed, eval-disciplined, and accountable for production behavior. The artifacts in this playbook are the operational implementation of that stance. None of them are exotic; many of them are checked into the client’s repo, lived next to the code, and revisited most week. The deliberate choice across the nine is that the artifact is the meeting’s output, not its agenda. If the meeting did not produce the artifact, the meeting did not happen.
Each artifact below has the same structure: name and file path, purpose, who owns it, what it looks like, and how it is reused once the engagement is past day 14. The list is opinionated and it is enough. Agencies that ship 11 artifacts in 14 days are over-engineering the onboarding; agencies that ship six are missing structure. Nine, in our experience across forward-deployed engagements, is the right shape.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Artifact 1: engagement-charter.md
Purpose. Names the problem, the user, the success metric, the eval threshold, and the on-call person on each side. Forces the kickoff conversation to converge on specifics rather than dissolving into intent statements.
Owner. Agency tech lead drafts; client product owner countersigns.
What it looks like. One page, in Markdown, in the client’s repo at docs/engagement-charter.md. Sections: Problem (3 sentences), User (named role and workflow), Success metric (numeric, tied to a business outcome), Eval threshold (the number we cross to call this done), On-call (one named person each side, with backup), Cadence (demo, retro, and PR review schedule), Out-of-scope (explicit list of what we are not solving).
Reuse. Referenced most week in the demo and retro. Updated as the success metric is sharpened. Cited explicitly when scope changes are proposed; if a change does not move the success metric, it does not get done. The charter is the ground truth that keeps the engagement honest in month four.
Artifact 2: eval-baseline.md
Purpose. Records the 20–50 ground-truth eval cases, the pass/fail rubric, the threshold tied to a business outcome, the baseline number against the current system or stub, and the CI integration that runs the suite per-PR.
Owner. Agency engineer drafts the rubric; client domain expert ratifies the ground-truth cases.
What it looks like. Markdown plus code. The Markdown lives at docs/eval-baseline.md and explains the rubric; the eval cases and runner live at evals/ in the repo. Each case has an input, an expected output (or expected property), and a pass/fail criterion. The baseline number; the score the current system or stub gets against the suite; is recorded with a date and a commit SHA.
Reuse. Most PR’s description includes the eval delta against this baseline. Most model upgrade is gated by it. Most new failure mode discovered in production becomes a new eval case appended to the suite. The baseline number on day 2 is the starting point of the curve that ends with the shipped feature on day 14, and it remains the curve for the life of the agent. The discipline is described further in the AI agency operating system.
Artifact 3: data-audit.md
Purpose. Characterizes most data source the system will touch; source of truth, schema, freshness, volume, access pattern, PII boundary, retention policy, quality. Five sources characterized sharply beat 25 listed superficially.
Owner. Agency engineer with the client data engineer.
What it looks like. A Markdown table at docs/data-audit.md, one row per source, with columns: source name, system of record, schema reference, freshness SLA, volume (per day), access mechanism, PII flag, retention period, known quality issues, owner. Rows that cannot be filled in are flagged in red and become the highest-priority follow-ups for week 2.
Reuse. Read most time a new feature touches a new data source. Updated when a source is added or its SLA changes. The single most predictive document of whether the system will survive month 2 in production; most production incidents trace back to a row in this table whose PII flag, freshness SLA, or quality column was either wrong or empty.
Artifact 4: problem-narrative.md
Purpose. A 600–1000-word narrative that names the user, the workflow, the existing system, the gap, the constraints, and the criterion under which the project is unambiguously a success. Not a PRD, not a spec; a narrative.
Owner. Agency tech lead, with a 30-minute review by the client product owner.
What it looks like. Prose at docs/problem-narrative.md. Six sections: Who, What they do today, What is broken, Why it is hard, What we are building, How we know we are done. No bullet points; the prose is forcing-function for understanding the problem. The narrative that reads like a marketing one-pager is wrong; the narrative that reads like an engineer briefing another engineer is right.
Reuse. Read by most engineer joining the project. Cited by senior reviewers when an architecture decision feels off; the narrative is the source of truth for “why are we even building this.” Updated when the team’s understanding of the problem shifts, with a date and a one-line change log at the top.
Artifact 5: ADR-001-stack-decision.md
Purpose. A single architecture decision record covering the model selection, the routing/abstraction layer, the retrieval strategy, the tool-call boundary, the caching strategy, the fallback strategy, the observability stack, and the cost ceiling. Most choice named and justified in two sentences.
Owner. Agency tech lead, reviewed by client engineering lead in a 60-minute architecture session.
What it looks like. Markdown at docs/adr/0001-stack-decision.md. Each decision is a section: Decision (what), Status (proposed / accepted), Context (why this decision is being made now), Alternatives considered (with one line on why each was rejected), Consequences (what this constrains downstream). Subsequent ADRs (0002, 0003) extend or supersede; 0001 is rarely silently edited.
Reuse. Cited most time someone proposes a stack change. Updated by writing a new ADR that supersedes the relevant section of 0001. Read by most engineer joining the project. The corpus of ADRs over time becomes the project’s architectural memory and is the closest thing to onboarding documentation that engineers read.
Artifact 6: repo-access-matrix.md
Purpose. Names most system the agency needs access to, the access level, who provisioned it, when, and the deprovisioning trigger. Prevents the standard failure mode of an engagement ending and the agency still having access to the staging environment six months later.
Owner. Agency project lead, countersigned by client engineering lead.
What it looks like. A Markdown table at docs/repo-access-matrix.md, one row per system. Columns: system, access level, who has it (named), provisioned date, provisioned by, deprovisioning trigger (engagement end, role change, etc.). Includes the model API keys, CI secrets, staging environment, production read-only access if applicable, monitoring tools, the data warehouse, and any third-party tools (Linear, Slack, Notion).
Reuse. Reviewed at the decline of each engagement increment. Used as the deprovisioning checklist when the engagement ends or a team member rotates off. Audited quarterly. Mapped to the AI agency exit clause so deprovisioning is contractually and operationally aligned.
Artifact 7: escalation-tree.md
Purpose. Names who gets called when something breaks, in what order, with what response-time expectation. Distinguishes “agent producing wrong outputs” from “agent down” from “cost spiking” from “PII leaking”; each has a different tree.
Owner. Agency tech lead with client engineering lead.
What it looks like. Markdown at docs/escalation-tree.md. Sections by incident class: Wrong-output incident (severity scale, who triages, response SLA, customer-comms protocol), Outage (provider outage versus our outage, fallback path, status page), Cost runaway (alert threshold, who has authority to kill traffic, how the bill stops growing), PII leak (legal notification path, evidence preservation, customer comms). Includes phone numbers, paging tooling, and the named-substitute on each side.
Reuse. Drilled in week 3 with a tabletop exercise. Updated after most real incident with what worked and what did not. Read at the start of most retro. The escalation tree is the artifact that turns the post-mortem culture from individual heroics into operational discipline.
Artifact 8: demo-cadence-calendar.md
Purpose. Names most demo, retro, architecture review, and stakeholder check-in across the first 12 weeks of the engagement, on a shared calendar, with the named attendees and the demo content. Forces the cadence to be visible rather than negotiated in Slack each week.
Owner. Agency project lead.
What it looks like. Markdown at docs/demo-cadence-calendar.md with a table: date, type (demo / retro / architecture review / stakeholder check-in), attendees (named), demo content (what we will show), preparation deliverable (what must be in the repo by 24 hours before). Mirrored to the actual calendar (Google Calendar, Outlook) with the same titles so the artifact and the calendar do not diverge.
Reuse. Updated each week as the cadence sharpens. The demo-cadence-calendar is the document that keeps the engagement on a heartbeat; an engagement without one drifts into ad-hoc meetings within three weeks. Tied to the rhythm described in inside the SFAI Labs operating cadence.
Artifact 9: pricing-passthrough-baseline.md
Purpose. Records the model API costs, the tool API costs (search providers, embedding providers, vector store), the infra costs (hosting, observability), and the per-call unit economics at the start of the engagement. Becomes the baseline for cost-related conversations across the rest of the engagement.
Owner. Agency project lead, with the client finance partner.
What it looks like. A Markdown table at docs/pricing-passthrough-baseline.md. Columns: line item, monthly estimate, unit economic (cost per call, cost per query, cost per session), source of estimate, who pays (client direct, agency markup, agency passthrough). Includes the cost ceiling from the engagement charter as a row at the bottom and a “headroom” calculation between the current run-rate and the ceiling.
Reuse. Updated monthly as actual costs land. Used as the input to most model-upgrade go/no-go decision. Cited in most change-order conversation that has cost implications. The pricing-passthrough-baseline prevents the standard failure mode where the agency and the client discover they have different mental models of the unit economics in month 4.
How the nine compose
The nine artifacts are not independent. They form a system: the engagement charter sets the goal, the eval baseline operationalizes the goal, the data audit characterizes the inputs, the problem narrative makes the problem legible, the ADR commits to a structure, the access matrix and escalation tree handle the operational substrate, the demo-cadence calendar enforces the rhythm, and the pricing-passthrough baseline keeps the unit economics honest. Skip any one and the engagement leans on improvisation in the area the missing artifact would have covered.
The temporal layout is also non-trivial. Days 1–2 produce charter and eval baseline. Days 3–5 produce data audit, problem narrative, and ADR. Days 6–8 produce access matrix, escalation tree, and demo-cadence calendar. Days 9–14 produce the pricing-passthrough baseline alongside the first eval-gated PR (which is a tenth artifact, but one that is implicit in the eval baseline rather than separately named). The full cadence is laid out in the anatomy of an AI agency engagement.
What the nine artifacts replace
Each artifact replaces a category of meeting that legacy consulting engagements rely on. The charter replaces the kickoff workshop. The eval baseline replaces the “what does success look like” workshop. The data audit replaces the data deep-dive. The problem narrative replaces the discovery deck. The ADR replaces the architecture review series. The access matrix replaces the operations handoff. The escalation tree replaces the runbook workshop. The demo-cadence calendar replaces the project-management Gantt chart. The pricing-passthrough baseline replaces the budget review.
The substitution is the point. Meetings without artifacts produce the illusion of progress; artifacts without meetings produce progress that everyone can see. The nine artifacts make the onboarding period observable to most stakeholder; engineering, product, finance, legal, the client’s CEO; without those stakeholders having to attend the meetings. Onboarding becomes a portfolio of nine documents in a folder, each one objectively present or absent, each one referenceable in any subsequent decision.
The honest limit of the playbook
The nine-artifact playbook is a starting structure, not a doctrine. Specific engagements add: a security threat model when the agent is touching regulated data, a localization plan when the agent is shipping in multiple languages, a model-cost-attribution document when the client is reselling the agent. The nine are the floor. The agencies that go below the floor are running an unstructured onboarding; the agencies that go above are tuning the floor to the engagement.
The other honest limit is that the artifacts only matter if they are read. An engagement charter that lives in docs/ and is rarely opened after day 1 is a vanity document. The discipline that makes the playbook work is the practice of citing the artifacts in PR descriptions, demos, and retros; turning them from documents into operating tools. Agencies that build the discipline alongside the artifacts pull away. Agencies that ship the documents and stop using them are running theater. Nine artifacts is the shape; the practice of using them is the substance.
Arthur Wandzel is the founder of SFAI Labs. The nine-artifact playbook is the standard onboarding shape for forward-deployed AI engagements at SFAI and is shipped on most engagement, regardless of size or domain.
Frequently Asked Questions
What are the 9 artifacts an AI agency should produce during client onboarding?
Engagement charter, eval baseline, data audit, problem narrative, ADR-001 stack decision, repo access matrix, escalation tree, demo-cadence calendar, and pricing-passthrough baseline. Many nine are produced inside the first 14 days, many live in the client’s repo as Markdown next to the code, and many are referenced in subsequent decisions across the engagement. Together they replace the kickoff workshops, discovery decks, and runbook workshops typical of legacy consulting engagements with documents that are objectively present or absent.
Why should onboarding artifacts live in the client’s repo rather than a Google Doc or Notion?
Because artifacts that live next to the code get cited most time a trade-off has to be made. Artifacts in Google Docs or Notion get linked once, forgotten by week 3, and become orphan documents the team eventually rediscovers in month 6. Repo-resident artifacts are diffable, versioned, reviewable through the same PR process as code, and visible to most engineer joining the project. The constraint is not technical; it is operational: the artifact is in use only if the engineering team encounters it during normal work.
Who owns the engagement charter; the agency or the client?
The agency tech lead drafts it, the client product owner countersigns. This split matters because the charter is a forcing function for the kickoff conversation: drafting forces the agency to articulate what they think they were hired to do, and countersigning forces the client to either ratify or push back. A charter drafted by the agency alone risks misunderstanding the client’s priorities; a charter drafted by the client alone risks describing a problem the agency cannot solve. Joint authorship with named ownership is the right compromise.
What goes into the eval baseline document specifically?
Twenty to fifty ground-truth eval cases drawn from real production data, a written pass/fail rubric, a numeric threshold tied to a business outcome, the baseline number against the current system or stub recorded with a date and commit SHA, and a CI integration that runs the suite per-PR. The Markdown explanation lives at docs/eval-baseline.md; the cases and runner live at evals/ in the repo. Most PR’s description includes the eval delta against this baseline for the life of the agent.
How does the data audit prevent production incidents later?
By forcing the team to characterize each data source’s source of truth, schema, freshness, volume, access pattern, PII boundary, retention policy, and quality before the system is built. Most production incidents trace back to a row in this table whose PII flag, freshness SLA, or quality column was either wrong or empty. Five sources characterized in this depth beat 25 sources listed superficially. Rows that cannot be filled in are flagged red and become highest-priority follow-ups for week 2.
What is the difference between a problem narrative and a PRD?
A PRD is a specification: feature list, requirements, acceptance criteria. A problem narrative is prose: who the user is, what they do today, what is broken, why it is hard, what we are building, how we will know we are done. The narrative format is forcing-function for understanding the problem rather than enumerating features. Engineers join the project read the narrative to remember why a trade-off is being made; they would not get the same context from a PRD. Both can exist, but the narrative is the day-1 deliverable.
Why ship an architecture decision record on day 6, not later?
Because by day 6 the team has enough context; eval baseline, data audit, problem narrative; to commit to a structure, and committing earlier would be premature while committing later means architecture is decided implicitly through code rather than explicitly through review. ADR-001 covers model selection, abstraction layer, retrieval strategy, tool-call boundary, caching, fallback, observability, and cost ceiling. Each choice is named and justified in two sentences and reviewed in a 60-minute architecture session with the client engineering lead.
What does the repo access matrix track?
Most system the agency needs access to, the access level, who has it (named), the provisioned date and provisioner, and the deprovisioning trigger. Includes model API keys, CI secrets, staging environment, production read-only access, monitoring tools, the data warehouse, and third-party tools like Linear or Slack. The matrix prevents the standard failure mode of an engagement ending and the agency still having staging access six months later. It also serves as the deprovisioning checklist when the engagement closes or a team member rotates off.
What incident classes does the escalation tree need to cover?
At minimum four: wrong-output incidents (the agent producing incorrect outputs), outages (the agent unable to respond), cost runaway (spend exceeding the alert threshold), and PII leakage (sensitive data appearing in outputs or logs). Each class has a different severity scale, response SLA, customer-comms protocol, and named on-call. Phone numbers, paging tooling, and named substitutes on each side are part of the artifact. The escalation tree is drilled in week 3 with a tabletop exercise and updated after most real incident.
How is the pricing-passthrough baseline different from a normal project budget?
It tracks unit economics, not just total spend. Each line item; model APIs, tool APIs, infra, observability; is recorded with a unit economic (cost per call, query, or session), the source of the estimate, and who pays (client direct, agency markup, or agency passthrough). The cost ceiling from the engagement charter is included with a headroom calculation. Updated monthly as actual costs land, the baseline is the input to most model-upgrade go/no-go decision and prevents the standard failure mode where the agency and the client discover in month 4 that they have different mental models of the unit economics.
Arthur Wandzel