The AI agency engagement status report should fit on one page, take fifteen minutes to write, and replace the 30-slide weekly deck the engagement inherited from a different era of consulting. A 30-slide deck is an artifact of a delivery model where the agency’s job was to demonstrate effort and the buyer’s job was to interpret it. A 2026 AI engagement inverts the relationship: effort is cheap (Cursor, Claude Code, Codex CLI), trust is the bottleneck, and the only signal the buyer needs each week is whether the system is meeting its eval thresholds at predictable cost. Everything else is noise the buyer’s executives will not read and the buyer’s engineers will not action.
This piece presents a seven-section weekly status template, each section prescribed in detail, with formatting rules that constrain the agency to honesty. The template is downstream of the AI agency manifesto’s commitment to evals as the contract and is shaped by the operating cadence described in the AI agency operating system. Use it verbatim, modify on the margin, but do not let it grow past one page. The page constraint is the discipline.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why the 30-slide weekly deck is the wrong artifact for AI work
- The seven sections of the one-page status report
- Section 1: shipped (PRs merged, with eval delta)
- Section 2: eval delta vs prior week
- Section 3: cost-per-call delta
- Section 4: blockers, named with owner
- Section 5: risks, with mitigation
- Section 6: next-week plan
- Section 7: link to full traces
- Cadence, distribution, and read-time
Why the 30-slide weekly deck is the wrong artifact for AI work
The 30-slide weekly status deck was an artifact of large-firm consulting in an era when the buyer was paying for analysis and the analysis itself had to be visualized for a non-technical sponsor. The structure makes sense in that context; an executive summary, a workstream tracker, a stakeholder map, a risk register, a financial dashboard, a Gantt chart, an appendix with detailed analysis. The deck demonstrated effort, framed the work for a sponsor who could not read code, and produced a paper trail.
Two things changed for AI engagements. First, the work product is software, and the executive summary of software is a passing or failing eval suite; a single number, with a delta from the prior week. The deck was structured to compress complexity into visualization, but the eval delta has already done the compression. There is nothing for the deck to add. Second, the buyer’s sponsor is no longer a non-technical executive who needs the work translated. The 2026 buyer sponsor, by procurement default, is an engineering leader (CTO, head of platform, head of AI/ML) who can read code and read traces. Translating for them is a cost the engagement is paying without earning.
The compounding effect is that the 30-slide deck consumes 8–12 hours of agency time per week to produce; usually pulling a senior engineer off the actual work to populate slides; and 5 minutes of the buyer’s time to read, mostly the title slide and the risk register. The signal-to-effort ratio is catastrophic. A one-page status report consumes 15–30 minutes of agency time per week and earns 3–5 minutes of buyer attention because the page is dense, scannable, and contains only signal. The buyer reads it. The engineers stay on the work.
The seven sections of the one-page status report
The template has exactly seven sections, in this order, and nothing else. Section 1: shipped. PRs merged this week, with eval-delta per PR. Section 2: eval delta vs prior week. Aggregate eval-suite pass-rate change, by suite. Section 3: cost-per-call delta. Mean and p95 cost-per-call this week vs prior week, by feature. Section 4: blockers. Named blockers with named owner and target unblock date. Section 5: risks. Risks with mitigation, severity, and owner. Section 6: next-week plan. Three to five concrete commits or eval-threshold targets. Section 7: link to full traces. A single hyperlink to the engagement’s tracing dashboard for the buyer to drill into. No section eight. No “executive summary,” no “team highlights,” no “client testimonial.” The constraint is the discipline.
Formatting rules. One page, period. If the report runs to a second page, the agency cuts content until it fits. Discipline is the value. Tables, not prose. Five of the seven sections are tables. The two that are prose (blockers and risks) are bullet lists with named owners and dates, not narrative paragraphs. Numbers, not adjectives. No “significant improvement” or “minor regression”; most claim is a number with a delta. Owner names, not initials. The buyer should know who to ping. Live links, not screenshots. The buyer should be able to click into the traces, the eval-log, the PR, the cost dashboard.
Section 1: shipped (PRs merged, with eval delta)
A table with five columns: PR title, PR ID, author (named engineer), eval suite touched, eval-delta on that suite. Maximum eight rows. If the team merged more than eight PRs, list the eight most consequential and link to a complete log in section seven.
The “eval-delta on that suite” column is the load-bearing column. For each merged PR, what did it do to the eval suite? Three possible entries: a number (e.g., “+0.012 faithfulness”), “no change” (PR was infrastructure or refactor), or “regression caught” (PR was reverted or hot-fixed). The eval-delta column is the engagement’s evidence that the PRs merged this week were value-additive rather than effort-additive. A PR with no eval-delta either does not need one (refactor) or quietly indicates that the engagement merged code without proof of quality improvement; which is itself a signal worth surfacing.
A useful row from a real engagement: “PR #847 / Switch retrieval to Voyage-3-Lite for L1 cache / [Engineer Name] / retrieval-eval-suite / +0.027 cosine similarity, −18% inference cost”. Five columns, one line, complete signal; the engineer shipped a model swap that improved retrieval quality and cut cost. The buyer’s CTO can read that line in 4 seconds and ask zero follow-up questions.
Section 2: eval delta vs prior week
A table with four columns: eval suite name, threshold, this week’s pass rate, delta vs prior week. One row per active eval suite (typically 3–6 suites in a mature engagement). The delta column is signed and sized: “+0.4pp” or “−1.2pp”, not “improved” or “stable.”
The section is the engagement’s bottom line for any week. If most suite is above threshold and the delta is non-negative, the engagement is on track and the executive summary writes itself. If a suite is below threshold or the delta is meaningfully negative, the next two sections (blockers and risks) explain why and what is being done about it.
Two rules. The threshold column shows the contractual threshold from the SOW, not the agency’s internal target. The buyer agreed to this number; the agency is held to it. The pass-rate column is the run ID’s pass rate, with the run ID linked. Most eval-suite report is reproducible; the buyer can re-run the same suite against the same commit and reach the same number.
Section 3: cost-per-call delta
A table with four columns: feature, mean cost-per-call, p95 cost-per-call, weekly delta. The cost data comes from the model-provider usage exports (Anthropic, OpenAI, Google) tagged by feature via project tags or metadata, rolled up weekly. One row per feature; typically 4–8 features per engagement.
The section’s role is to surface cost regressions before they become invoice surprises. A retrieval-prompt rewrite that improves quality by 0.04 faithfulness but doubles cost-per-call may not be a net win for the buyer; the cost line surfaces the trade. If a feature’s cost drifted upward over three weeks without an offsetting eval improvement, the section is the buyer’s notice that something needs investigation.
The section also disciplines the agency’s engineering choices. If most section-1 PR ships an eval improvement but the cost-per-call drifts upward week-over-week, the agency is buying quality with money; sometimes worth it, sometimes not. The buyer should be the one deciding when it is worth it. The section makes the trade visible.
Section 4: blockers, named with owner
A short bullet list (zero to four items, ideally) of active blockers preventing the engagement from advancing. Each bullet has four parts. What’s blocked. A specific deliverable or eval-threshold pursuit. Owner. A named human, agency-side or buyer-side. What unblocks it. A concrete action, not a vague request. Target unblock date. A date, not “ASAP.”
A blocker without many four parts is a complaint, not a blocker. The agency’s job is to convert ambient complaints into action items with owners and dates. If the buyer’s data team owes the agency access to a production database, the bullet says: “Eval-suite extension to live ticketing data / Owner: [Buyer-side data lead] / Unblocks: read-only Postgres credential issued to agency service account / Target: [Date 5 business days out].”
The blocker section should be honest about agency-side blockers as well. If an agency engineer is on PTO and the eval-threshold work has stalled, the bullet says so. The buyer’s tolerance for honest blockers is high; the buyer’s tolerance for surprise stalls discovered via missed milestones is low.
Section 5: risks, with mitigation
A short bullet list (zero to four items) of risks that are not yet blockers but might become blockers. Each bullet has four parts. Risk. What might go wrong. Severity. Low, medium, or high; and a one-line justification. Mitigation. What is being done now to prevent it. Owner. Who owns the mitigation.
The section’s purpose is to give the buyer’s leadership lead time. Risks named in week 6 with low severity are easy to mitigate; the same risks discovered in week 11 with high severity are engagement-threatening. The discipline is to name risks early and proactively, even when they feel speculative; the cost of a named risk that did not materialize is a single bullet line that stops appearing on the report; the cost of an unnamed risk that materialized is a postmortem.
A useful framing: any model deprecation with a published end-of-life date in the next six months is automatically a medium-severity risk on the engagement’s status report until the migration is shipped. Provider deprecation timelines are public and the engagement should be tracking them. The same pattern applies to deprecation of agent frameworks, embedding models, or vector database SDKs.
Section 6: next-week plan
A bullet list of three to five concrete commitments for the coming week. Each bullet is a specific PR target (“ship PR #N: faithfulness improvement on edge-case suite, target +0.02”), a specific eval-threshold target (“clear ≥0.90 relevance on customer-support suite v3”), or a specific operational target (“stand up tracing for the new feature in production, p95 propagation latency under 2 seconds”).
The rule is that each bullet has to be falsifiable by next week’s status report. If the bullet says “make progress on the retrieval system,” next week’s report cannot tell the buyer whether the bullet was met. If the bullet says “ship PR #N with +0.02 faithfulness on the retrieval eval suite,” next week’s report can. The falsifiability rule does the work of converting agency intent into agency commitment, and converts the next-week plan into a tool the buyer can use to hold the engagement accountable.
Section 7: link to full traces
A single line at the bottom of the page, with a hyperlink to the engagement’s tracing dashboard (LangSmith, Langfuse, Helicone, Arize, or a custom platform). The link is scoped to the buyer; they can click it and see most model call from the production system this week, with inputs, outputs, latency, cost, and the eval-suite annotation.
The link’s purpose is to give the buyer’s engineers an escape hatch from the one-page report. If a buyer-side engineer wants to understand a specific PR’s behavior in production, they click the link, filter by feature and time range, and see the actual traces. The agency does not need to anticipate most drill-down the buyer might want; the agency provides the link and gets out of the way.
The link is also a transparency commitment. An agency that is reluctant to share live trace access with the buyer is signaling something the buyer should investigate.
Cadence, distribution, and read-time
The status report is generated and distributed most Friday by 17:00 in the buyer’s timezone. Distribution is to a fixed list of buyer-side recipients (engineering sponsor, business sponsor, two named engineers) and the agency-side leads. The report is a single shared document; Notion page, Markdown in the buyer’s repository, or PDF rendered from the same; not a slide deck and not an email body that buries the format.
Buyer read-time targets four minutes. If the buyer’s engineering sponsor cannot read the report in four minutes, the page is too dense or the sections are misordered. Agency authoring time targets 30 minutes. If the report takes longer to author than that, the team is either using the report as a reflection exercise (good but not the report’s purpose) or the underlying data is not yet instrumented (which is itself a finding worth fixing).
A discipline this template enforces: most section is generated from the engagement’s source-of-truth systems, not authored from memory. Section 1 is a query against the merged-PR list. Section 2 is a query against the eval-log database. Section 3 is a query against the model-provider usage export. Sections 4–6 are authored, but they reference live tickets in the engagement’s tracker. Section 7 is a static link. When the report is generated rather than written, the agency’s authoring time drops to validation rather than composition; and the report’s accuracy increases because there is no transcription step that introduces errors.
The one-page status report is, in the end, an assertion about what is worth saying weekly in a 2026 AI engagement. If a fact does not fit one of the seven sections, it is either not worth saying weekly or it belongs somewhere else (a postmortem, an architectural note, the structured staffing rhythm of the engagement). Constrain the report. The constraint is the value.
Frequently asked questions
Why should an AI agency status report fit on one page?
Because the page constraint forces the agency to surface only signal. A 30-slide deck demonstrates effort; a one-page report communicates state. The buyer’s sponsor reads one page in four minutes and the deck in zero.
What are the seven sections of the one-page status report?
Shipped (PRs with eval delta), eval delta vs prior week, cost-per-call delta, blockers (named with owner), risks (with mitigation), next-week plan (falsifiable), and a link to full traces.
Why is the eval-delta column the load-bearing column of the shipped section?
Because it provides evidence that merged PRs were value-additive rather than effort-additive. A PR without an eval-delta either does not need one (refactor) or quietly indicates code merged without proof of quality improvement; itself a signal.
What replaces the executive summary in the one-page format?
The eval-delta table in section two. If most suite is above threshold and the delta is non-negative, the engagement is on track and the summary writes itself. If not, the next two sections explain why.
How often is the status report distributed?
Most Friday by 17:00 buyer-timezone. Distribution is to a fixed list (engineering sponsor, business sponsor, two named engineers) and agency-side leads, as a shared document; not an email body or slide deck.
How long should the report take the agency to write?
Thirty minutes if the underlying systems are well-instrumented. The five tabular sections are queries against the eval log, the merged-PR list, and the cost dashboard; only blockers, risks, and next-week plan are authored manually.
What is the falsifiability rule for the next-week plan?
Each bullet must be checkable by next week’s report. “Make progress on retrieval” is not falsifiable; “ship PR #N with +0.02 faithfulness on the retrieval eval suite” is. Falsifiability converts agency intent into agency commitment.
Why does the report include a link to live traces?
Because no one-page summary can anticipate most drill-down the buyer’s engineers might want. The link is the buyer’s escape hatch; they click into LangSmith, Langfuse, Helicone, or Arize and see most production call with inputs, outputs, latency, cost, and eval annotation.
What blockers and risks belong on the report?
Active blockers with a named owner, target unblock date, and concrete unblocking action. Risks with severity, mitigation, and owner. Anything without those four parts is a complaint, not a blocker, and should not appear.
Where does this template fit in the AI agency manifesto framework?
It is the weekly operational expression of the manifesto’s eval-as-the-contract commitment. The eval thresholds in section two are the contract; the merged-PR eval-deltas in section one are the receipts; sections three through seven keep the engagement honest about cost, blockers, and risk.
Arthur Wandzel