Inside the SFAI Labs Operating Cadence: Weekly Demos, Evals, and Roadmap Reviews

Most clients of an AI dev studio rarely see the operating cadence behind their engagement. They see Friday demos and merged PRs; the rest is invisible; a working studio runs on rituals so reliable they recede into the background. But for any CTO evaluating an AI development partner, the cadence is the single most diagnostic artifact. It tells you what gets reviewed, what gets shipped, and what falls through the cracks.

This is a walk through the operating cadence SFAI Labs runs; Monday plan-and-demo through Friday client Loom; described as a reusable studio operating model. The shape is portable: any senior-heavy AI team can adopt it.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why cadence is the load-bearing layer

A studio’s operating system has four layers; rituals, roles, artifacts, tools. Roles are fixed by hiring; tools by the stack. Artifacts are downstream of rituals; the Loom demo exists because Friday demands it. The cadence is where studios compound or quietly decay.

Decay looks specific. A team skips eval review for three weeks; on the fourth, a model upgrade ships a 9-point quality drop and nobody notices for a sprint. A team replaces architecture review with Slack threads; six months later the codebase has three different LLM-call wrappers and no decision memo explaining why. A team folds the Friday demo into a status email; the client stops forwarding it internally and the engagement loses its evidentiary spine.

A working cadence is a forcing function for the artifacts a serious client should expect. The week below is the version SFAI runs; five days, five rituals, and an explicit rule about what is not on the calendar.

Monday: plan-and-demo

Cadence: weekly, 60 minutes, on-camera, full team plus founder. Attendees: most engineer on the engagement, the founder or a senior delivery lead, and (optionally) one client representative as a fly-on-the-wall. Artifacts produced: a plan-and-demo memo committed to the repo, links to merged PRs from the prior week, and a written list of this week’s PR commitments.

Monday is a single meeting that serves two functions back-to-back. First fifteen minutes: the prior week’s shipped PRs, demoed in code, not slides. Each engineer who shipped a feature pulls it up live, runs it against a staging eval, and shows the diff in the eval dashboard. Anything that did not ship gets one sentence and rolls into the current week.

The next forty-five minutes are this week’s plan. Each engineer commits; out loud, on the recording; to the specific PRs they expect to land by Friday. The plan is PR-shaped, not story-point-shaped. PRs are countable; story points are vibes. The committed list goes into a plan-2026-W19.md file in the repo, alongside the prior week’s plan-2026-W18.md for diff-level accountability.

Two rules make Monday work. Plan-and-demo is a single meeting; splitting it into “retro” and “planning” loses the connective tissue. And the founder attends; founder presence is the signal that the cadence is not theatre.

What success looks like: by minute sixty, most engineer knows what they own, most shipped feature has been seen by most teammate, and the client (if present) has heard the weekly story in the team’s own voice.

Tuesday: eval review

Cadence: weekly, 45 minutes, on-camera, engineers shipping AI features plus the eval owner. Attendees: most engineer whose code touched an LLM call in the prior week, plus the team member who owns the eval test sets. Artifacts produced: an eval-review memo summarizing regressions, threshold-drift notes per test set, and a triage list of regressions to fix this week or accept with a written rationale.

Tuesday is the meeting most studios do not run, and its absence is the one of the largest reason AI engagements quietly degrade. The agenda is mechanical: open the eval dashboard, walk most shipping AI feature against its test set, and surface most metric that moved more than the threshold the team set on day one.

Three classes of regression get triaged:

Quality regressions. Accuracy, faithfulness, or rubric scores below the floor on any test set. Blocking; the feature does not ship until resolved or the case is explicitly downgraded with a written rationale.
Latency regressions. P95 latency above the contracted ceiling, usually after a prompt change or model upgrade. Often resolved by streaming, caching, or routing to a smaller model for the easy 80% of cases.
Cost regressions. Per-call cost up more than ~15% week-over-week. Usually traceable to a grown prompt, a fanned-out tool call, or a model upgrade without a cost re-baseline.

Threshold drift is the subtle category. A team shipping at 94% against a 92% floor is healthy. A team that quietly moved the floor down to 88% to keep regressions from blocking merges is in trouble; and Tuesday is where that drift gets seen, written down, and either reversed or formally accepted. The full quality system is unpacked in the AI agency quality system.

What success looks like: most shipping AI feature has a written status; green, regression triaged, or threshold drift acknowledged. The memo is committed; nobody has to remember.

Wednesday: architecture review

Cadence: weekly, 60 minutes, on-camera, full senior engineering plus founder. Attendees: many senior engineers, the founder, and any junior engineer whose ADR is on the agenda. Artifacts produced: newly proposed ADRs (architecture decision records), approved or rejected ADRs from the prior week, and notes on any model-upgrade tests run since the last review.

Wednesday is the meeting that prevents the codebase from accumulating three different LLM wrappers and four conflicting retry policies. Three standing sections:

Proposed ADRs. Any engineer needing an architectural decision; vector store, retrieval strategy, prompt templating, agent framework; writes a one-page ADR (context, options, decision, consequences) before Wednesday. Approval is binary and recorded; rejection comes with a written reason.

Approved ADRs from last week. A re-read of decisions made the prior week, with any implementation findings that should amend the original. ADRs are living documents.

Model-upgrade tests. A standing section because frontier models ship monthly and quietly change the behavior of most feature that calls them. The team runs a fixed protocol; full eval suite on the new model, latency and cost re-baseline, rubric spot-check on twenty edge cases; and reviews results before any production switch. The protocol itself is an ADR.

Two rules. ADRs are short; one page is the target, two is the limit. And the founder attends, because architectural drift is the failure mode hardest to detect from outside the code.

What success looks like: most architectural decision is traceable to a one-page memo. New engineers can be onboarded by reading the ADR folder. The deeper structure surrounding architecture review is laid out in decoding the AI agency stack.

Thursday: deep-work day

Cadence: weekly, no meetings, full team. Attendees: none; that is the point. Artifacts produced: the week’s largest pull requests, typically the ones the engineer committed to on Monday.

Thursday is on the calendar as a no-meeting day. No standups, no architecture reviews, no client calls. Engineers work on the PRs they committed to Monday, paired with their AI coding agent of choice; Claude Code, Cursor, Codex, Aider; and a senior reviewer on Slack for blocking questions only.

Thursday exists as a named ritual rather than a default state because without an explicit no-meeting day, calendar entropy guarantees the deep-work block disappears. A studio that protects Thursday ships substantially more code per engineer per week. The rule is enforced by the founder; if a client asks for a Thursday meeting, the answer is Wednesday or Friday.

A useful pattern is the Thursday review buddy: most engineer paired with one teammate for the day, available on Slack for sub-five-minute reviews of small diffs as they emerge. This keeps PRs small and merged the same day.

What success looks like: by Friday morning, the PRs each engineer committed to on Monday are open, reviewed, and either merged or one round of feedback away.

Friday: client demo via Loom

Cadence: weekly, 5–12 minute Loom recording, asynchronous. Attendees: one engineer or the founder records; the client receives the link. Artifacts produced: a Loom recording, a one-paragraph written summary, and links to the merged PRs covered in the recording.

Friday’s demo is not a meeting. It is a recorded, forwardable, time-stamped artifact that the client’s executive sponsor can send to a CFO without scheduling a sixty-minute call. Loom (or Tella, or any equivalent) gives the client a permanent record that survives reorgs, vendor reviews, and contract renewals.

The structure is consistent. Sixty seconds of context; what shipped and why it matters. Three to seven minutes of live walkthrough; the feature running against staging or, where contracts allow, production. One to two minutes on what ships next week, with PR-level specificity. Thirty seconds to close: links to merged PRs, the eval dashboard, contact for questions.

The Loom replaces the steering-committee deck that 50-person consultancies build into their cost structure. It is cheaper to produce, harder to fake, and asymmetrically useful; it forwards. Six months of Fridays is a video archive of delivery, indexed by week, browsable by anyone with the link.

One rule: if a Friday Loom cannot be recorded because nothing shipped, the founder records a one-minute Loom explaining what happened. An honest Loom about a hard week is one of the highest-trust artifacts an agency can deliver.

What the cadence deliberately excludes

What is not on the calendar is as deliberate as what is.

No daily standup. Async Slack covers the same surface area without the meeting tax.
No weekly client status meeting. Replaced by Friday Loom plus a Slack channel.
No PM checkin. Monday is the planning meeting.
No quarterly business review. Replaced by the rolling Loom archive plus a written quarterly memo.
No many-hands under thirty engineers. Monday is the many-hands.

Calendar minimalism is the point. Each ritual that survives must justify itself by the artifact it produces.

How to evaluate a vendor cadence in 15 minutes

A CTO can audit a vendor’s cadence in fifteen minutes by asking for five artifacts. The vendor either has them or does not.

Last Monday’s plan memo. A real cadence has a plan-2026-W18.md in a repo. A studio without one will offer a deck.
Last Tuesday’s eval review notes. A studio that runs eval review has a per-feature status memo. A studio that does not will explain that evals are “ongoing.”
A recent ADR. A studio with architecture review has a folder of one-page memos. A studio without one will send a Lucidchart diagram from 2024.
Last Friday’s Loom. A studio with a real Friday demo has a six-month archive. A studio without one will offer a sample steering-committee deck.
What does Thursday look like? A studio with a deep-work day has a documented no-meeting policy. A studio without one will describe a calendar full of meetings.

Five artifacts, fifteen minutes, no proposal required. The cadence either compounds week over week or it does not.

Frequently asked questions

What is an AI studio operating cadence?

An AI studio operating cadence is the set of weekly rituals; Monday plan-and-demo, Tuesday eval review, Wednesday architecture review, a protected Thursday deep-work day, and a Friday client demo recorded as Loom; that govern how a senior-heavy AI team plans, ships, reviews, and reports. Each ritual produces a re-runnable artifact: plan memo, eval memo, ADR, merged PRs, Loom. The cadence is load-bearing because it forces the artifacts a serious client should expect.

Why does Tuesday’s eval review matter?

It detects quiet degradation. Without it, threshold drift, prompt regressions, and model-upgrade quality drops can ship to production for weeks unnoticed. A working eval review walks most shipping AI feature against its test set, flags any metric that moved more than the agreed threshold, and produces a triage list of regressions to fix or formally accept. Most AI engagements that fail in production fail because no one was running this meeting.

What is an ADR in an AI engineering studio?

An ADR; architecture decision record; is a one-page memo capturing context, options, decision, and consequences for any significant architectural choice. Typical examples: vector store, retrieval strategy, prompt templating, agent framework, model-upgrade test protocol. ADRs live in the repository, are versioned, and are reviewed at Wednesday’s meeting. A studio with a real ADR practice can onboard new engineers by handing them the ADR folder.

Why a Loom demo on Friday instead of a meeting?

A Loom replaces the synchronous Friday status call with a recorded, forwardable artifact. The client’s executive sponsor can send it to a CFO or board member without scheduling a meeting. It is cheaper to produce, harder to fake, and asymmetrically useful; it forwards. A six-month archive is a defensible record of delivery that survives reorgs and contract renewals.

How is the studio cadence different from agile or scrum?

The cadence shares scrum’s weekly rhythm but inverts several defaults. Planning and review collapse into one Monday meeting tied to merged PRs rather than story points. Standups are async and written. It adds a dedicated eval review day and a dedicated architecture review day. And it adds a protected no-meeting deep-work day. Closer to a weekly engineering review than a sprint.

Does a small AI studio need a project manager to run this cadence?

A senior-heavy studio of eight to twelve engineers runs this cadence with no dedicated PM. Monday is planning; eval and architecture meetings are run by their owners; Thursday is no-meeting; Friday is one engineer with Loom. Information a PM would relay is communicated directly by the engineer doing the work, on the recording. Removing the PM layer is a velocity decision; it eliminates a translation step between the people writing the code and the client receiving it.

How does the cadence handle model upgrades like a new Claude or GPT release?

Model upgrades are handled by a standing section of Wednesday’s architecture review. The team runs a fixed protocol; full eval suite on the new model, latency and cost re-baseline, rubric spot-check on twenty edge cases; and reviews results before any production switch. The protocol is itself an ADR. Without this discipline, a frontier-model upgrade can quietly change the behavior of most feature, and the regression often surfaces only after a client complaint.

What signals that a vendor’s cadence is theatre rather than real?

The vendor sends a status deck instead of a plan memo. They cannot produce a recent eval review memo. Architecture decisions live in Slack threads rather than ADRs. The Friday demo is a thirty-minute call rather than a ten-minute Loom. Thursday is full of meetings. The founder does not attend Monday. Any one is a yellow flag; three or more is red. A studio with a real cadence can produce many five artifacts on fifteen minutes’ notice.

How long does it take to install this cadence in an existing team?

Two to three weeks. Week one: Monday plan-and-demo and the weekly plan memo. Week two: Tuesday eval review against whatever test sets exist. Week three: Wednesday architecture review and the first three ADRs. Thursday is enforced by blocking the calendar. Friday Loom starts the first Friday after Monday is in place. Each week’s artifacts make the next week’s meetings more useful.

Inside the SFAI Labs Operating Cadence: Weekly Demos, Evals, and Roadmap Reviews

Decision Scope

Why cadence is the load-bearing layer

Monday: plan-and-demo

Tuesday: eval review

Wednesday: architecture review

Thursday: deep-work day

Friday: client demo via Loom

What the cadence deliberately excludes

How to evaluate a vendor cadence in 15 minutes

Frequently asked questions

What is an AI studio operating cadence?

Why does Tuesday’s eval review matter?

What is an ADR in an AI engineering studio?

Why a Loom demo on Friday instead of a meeting?

How is the studio cadence different from agile or scrum?

Does a small AI studio need a project manager to run this cadence?

How does the cadence handle model upgrades like a new Claude or GPT release?

What signals that a vendor’s cadence is theatre rather than real?

How long does it take to install this cadence in an existing team?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources