Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 13 min read

Rethinking the AI Agency Project Manager Role in an LLM-First Stack

Rethinking the AI Agency Project Manager Role in an LLM-First Stack

The traditional agency project manager; the one who runs the Monday status, grooms the JIRA backlog, and assembles the Friday client deck; is the wrong shape for an LLM-first engagement. The role does not disappear; the work it should be doing has migrated. What an AI agency now needs is an artifact orchestrator: someone who owns the eval suite, curates the client-context bundle, edits the prompt registry, and facilitates regression triage. The human-coordination work stays. The JIRA-status work goes away.

This is a spoke under the AI agency manifesto, which argues that a 2026 AI development agency is a forward-deployed engineering team, not a strategy practice. If that thesis is right, the PM role has to change shape. This piece is the decomposition.

Why the legacy PM shape stopped fitting

The agency PM role was designed for a delivery shape that no longer describes AI work. The classical 2018-era SOW assumed scope was decided up front, decomposed into tickets, executed by engineers, and reported back in weekly status increments. The PM owned three artifacts: the JIRA board, the status report, the change-order log. The role was load-bearing because engineering work was deterministic enough that “where are we against the plan” had a numeric answer.

LLM-first work breaks that assumption at the artifact level. The system being built is non-deterministic. Its quality is a distribution over inputs, not a checkbox against a spec. Its behavior changes when the model is silently updated; and frontier models update continuously. McKinsey’s State of AI put adoption at 78% of organizations in early 2025; what those organizations buy is not “build feature X by date Y” but “operate a probabilistic system that meets a quality bar against an evolving workload.”

That shift breaks the PM artifacts in three places at once. JIRA tickets cannot represent “the regulatory-eval threshold went from 84% to 91%.” Status meetings cannot replace a Friday demo against a fixed eval bar. Change orders cannot price the discovery that the agent’s tool-use depth has to go from two hops to five. Each of these is a measurement that happens in the artifact; the eval suite, the trace viewer, the prompt registry; not in the project plan.

The Standish Group’s CHAOS dataset has held a 30% software project failure rate steady for two decades, with the cause overwhelmingly misaligned scope, missed requirements, and stakeholder churn; coordination breakdown. AI engagements inherit that base rate and compound it with continuous re-decisioning. A different shape is required. Not less coordination; more, applied to different objects.

What the role keeps

It is a mistake to read this as “agencies do not need PMs anymore.” Several pieces of the legacy role are more important than before:

  • Stakeholder cartography. Who at the buyer signs off on the eval bar, who owns the data, who blocks at security review, who is the human-in-the-loop for ambiguous outputs; human-coordination work, mapped in week one.
  • Cadence design. Setting the rhythm; when the demo runs, who attends, what happens when an eval threshold is missed; is judgement work, not status-running.
  • Risk and dependency calling. “We need legal to clear PII before we wire retrieval over the customer warehouse” is the kind of cross-functional call senior engineers under-prioritize and PMs catch.
  • Scope-boundary writing. Translating “the agent handles account questions” into “handles billing, address changes, plan upgrades; defers refunds, escalations, disputes; rarely modifies state without explicit confirmation” is product-PM work that prevents the dispute spiral.

These four threads are why the role survives. They are also what most agencies under-staff because the legacy role spent its calendar on JIRA grooming and status decks.

What the role sheds

What goes away is the stack of activities that track determinate progress against a decomposed plan:

  • Sprint-by-sprint ticket grooming. When quality is a distribution, “tickets closed” stops correlating with “the system got better.” Ship-against-eval is the truer signal.
  • Weekly written status as the primary cadence. The Friday demo against a fixed eval set replaces it. Buyers see the system run, see which inputs improved, see which regressed.
  • Per-feature change-order logs. Discovery is constant in AI work. Change-order accounting at the feature level produces a dispute spiral. Scope is fixed at the artifact level; eval set, latency budget, failure-mode taxonomy; and re-baselined when those move.
  • Status-deck assembly. Slides built from JIRA exports describe the project plan, not the system under construction. The deck habit consumes senior calendars and produces nothing a five-minute demo would not show better.

The cleaner build-vs-outsource framing for what the buyer is purchasing is in AI agency vs in-house team decision. Once that framing lands, the legacy PM role falls out of the SOW naturally.

The four artifact-orchestrator functions

What the role becomes is an orchestrator over four artifacts, each load-bearing in an LLM-first delivery. None are JIRA boards. Many four are checked into version control, owned by a single person, and updated weekly. The functions overlap with what a senior engineer can do but rarely has the calendar for, and with what a product manager can do but rarely has the engineering literacy for.

Function 1: Eval suite and trace ownership

The eval suite is the contract between buyer and agency. Once the suite exists, someone has to own it day to day. That someone is the orchestrator. The work:

  • Curating the input set as new failure modes appear in production traces.
  • Versioning the suite alongside prompts and code so a regression caught next quarter can be traced to the commit that introduced it.
  • Re-setting the threshold with the buyer as the system matures; the threshold ratchets, not a constant.
  • Owning the trace viewer (Langfuse, LangSmith, Arize, Helicone, or rolled-in OpenTelemetry) and pulling representative failure traces into the weekly demo.

This is the artifact that replaces the JIRA board as the primary status object. The deeper practice is in stop scoping AI projects in features, scope them in evaluations and the AI agency quality system.

Function 2: Client-context curation

The largest performance lever after model choice is the quality of the context window: system prompts, retrieved documents, tool descriptions, few-shot examples, and the structured representation of buyer-specific knowledge; domain glossaries, escalation policies, edge cases. The buyer owns that knowledge. They cannot hand it over in one document. It accumulates, contradicts itself, goes stale.

Client-context curation converts messy knowledge into a structured, versioned bundle the system consumes reliably:

  • Discovery interviews that surface the rules nobody wrote down; the ones in a senior CSM’s head.
  • A living glossary, often a YAML file in the repo, pinning buyer-specific terms.
  • Mapping each prompt-injection point to its source; when the buyer changes a refund policy, the orchestrator knows which prompt, which retrieval source, and which eval input need updating.
  • Owning the spec sheet for persona, tone, and refusal posture in a single canonical document the buyer signs off on.

This is the function that makes an LLM-first system feel like it works at the buyer’s company rather than at a generic SaaS company.

Function 3: Prompt registry editorship

A production AI system in 2026 does not have one prompt. It has a registry; dozens of named prompts, each with a version, an owner, a test set, and a performance history. Each is code.

Prompt-registry editorship prevents the registry from becoming legacy code nobody dares change:

  • Locating each prompt in a single registry (a directory of Markdown files with YAML front matter, or Promptfoo, LangSmith, Helicone, Pezzo).
  • Reviewing prompt changes like an engineering manager reviews code; what is the diff, what eval catches the regression, who is on-call if it breaks.
  • Deprecation discipline: when replaced, the old version stays with a tombstone explaining why.
  • Per-prompt README; what this prompt is for, what it must rarely do, what it depends on.

This function is the biggest reason a senior engineer’s calendar is freed up. The orchestrator handles the editorial layer; the engineer handles the architecture layer. Editor and writer, applied to prompts.

Function 4: Regression facilitation

LLM-first systems regress for reasons alien to traditional software. The model is updated. A retrieval schema changes. A prompt edit interacts badly with a tool description. A new failure mode shows up in production.

Regression triage is partly forensic, partly editorial, partly facilitation. The orchestrator runs each session the way an incident commander runs a postmortem:

  • Convene fast; the prompt owner, the buyer-side stakeholder, the eval owner, the data owner if retrieval is implicated.
  • Anchor in traces, not opinions; pull the failing trace, side-by-side a passing one from before the regression.
  • Drive the loop: hypothesis → eval input added → fix → re-run → close. Each loop closure documented in registry and suite.
  • Write the postmortem; what changed, what was missed in the eval suite, what is now in the suite to prevent a repeat.

Postmortems become an artifact buyers ask for in the next vetting cycle; the case the field guide to evaluating an AI agency makes explicitly.

Hiring against this role

The orchestrator profile is not a classical PMP-track delivery PM, and not a senior engineer who happens to write status updates. It is a third profile most agencies under-staff because they are not yet hiring against it.

The traits that matter:

  • Editorial discipline. Comfort with “this prompt has the wrong tone” being a serious finding, written up like a code review comment.
  • Light technical literacy. Reads YAML, Markdown, Git diffs, JSON schemas, eval reports. Does not need to write production Python but needs to read it.
  • Stakeholder navigation. Lands the hard conversation with the buyer’s compliance officer, the on-call engineer, and the senior engineer on the agency side without losing trust.
  • Bias toward the artifact. Answers “what is the status” with a link to the eval-run dashboard or the latest prompt-registry diff, not a written summary.

Strong candidates come out of editorial backgrounds, AI-native product roles, eval-design work at frontier labs, and a long tail of senior PMs at engineering-led companies who have already started doing this work informally.

The operating-model context; how the rest of the agency stack changes to support this role; is in inside the AI agency operating system. The hire-versus-build comparison between agency PMs and in-house sits in AI project management: client-agency collaboration. Sprint-cadence variants are contrasted in agile AI development sprint planning.

What buyers should ask

Three high-yield questions about the PM-equivalent role:

  1. “Show me the prompt registry from your last shipped engagement.” No registry, no function. Chaotic registry, title without discipline.
  2. “Who owns the eval suite week to week; and who are they not?” The right answer is a named human who is neither the senior engineer nor the account manager. “We many do” is the wrong answer.
  3. “Walk me through your last regression postmortem.” Postmortems are the single strongest positive signal in agency vetting. Absence is the only signal a buyer needs to keep looking.

Three questions, under twenty minutes, and the operating-model truth is on the table.

Frequently asked questions

Does this mean we should fire our project manager?

No. The role’s calendar should shift. The same human can absorb most of this if the agency commits to retiring the legacy artifacts; JIRA-as-status, weekly written reports, per-feature change orders. If the agency keeps those, hiring a second person will not help; they will be pulled into status work by the surrounding system.

How is this different from a TPM (technical program manager)?

The orchestrator is more editorial and less program-architectural than a TPM at an infra company. A TPM coordinates teams across long-running technical programs; an orchestrator owns the four LLM-specific artifacts for a small number of engagements. A senior TPM with editorial instincts is one of the strongest candidate pools.

Does this work on small engagements, or only large ones?

It works at engagement-of-one scale. On a two-engineer pilot the function is roughly half a senior person; on a longer engagement with multiple buyer-side teams it fills a full calendar.

What if the buyer insists on JIRA and weekly status decks?

Run a parallel artifact track for sixty days. Keep the JIRA board if contractually required, but layer the four artifacts on top. Ship a Friday demo against the eval set. Send the prompt-registry diff alongside the status email. Within two months the demo and the diff become the cadence the buyer asks for; the JIRA board becomes a residual log.

How does this map to in-house AI teams?

Cleanly. In-house teams pay the same coordination cost in calendar time rather than billable hours, which makes it less visible. The four artifacts apply identically. The structural lessons are in the AI agency tax.

Where does the buyer’s product manager fit?

The buyer’s PM is the partner, not a duplicate. The buyer’s PM owns “what should this system do for our customers.” The orchestrator owns “what it does, evidenced in evals and traces, against the spec the PM wrote.” That split is the working model.

How do we measure whether the role is working?

Three signals. The senior engineer’s calendar moves from ~30% status to ~70% engineering inside two months. Regression-to-fix loop time; failure in an eval input within forty-eight hours, fix shipped within a week. Buyer trust on the eval threshold; willingness to negotiate ratchets at the cadence demo without dispute.

Is this role full-time or part-time at most agencies?

Full-time on any engagement above roughly 600 hours per quarter. Part-time on smaller ones, but not zero; under-staffing this function is the most common operating-model failure in agencies that have nominally adopted LLM-first delivery.

Key takeaways

  • The legacy agency PM role; JIRA, status, decks; fits 2018-era deterministic delivery, not LLM-first work where quality is a distribution.
  • The role does not disappear; its calendar migrates. Stakeholder cartography, cadence design, risk-calling, scope-boundary writing many stay.
  • The new shape is the artifact orchestrator: a single owner of four artifacts; eval suite and traces, client-context bundle, prompt registry, regression postmortems.
  • Hire from editorial backgrounds, AI-native product roles, eval-design work, and senior TPMs with editorial instincts; not classical PMP delivery tracks.
  • Buyers ask for the prompt registry, the named eval owner, the last regression postmortem.

Last Updated: May 26, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles