The AI agency security review every CTO should run before kickoff

The pre-kickoff security review is the only point in an AI agency engagement where the cost of saying no is zero. Once the statement of work is signed and the first sprint is burning, most architectural decision the agency made in the proposal; system-prompt structure, key custody, vector store choice, dependency tree, logging strategy; gets harder to undo with each merged PR. By week six the system has shape, by week ten it has scar tissue, and by the time a client security review surfaces a real defect the remediation cost has crossed the threshold where the CTO has to fight for it. The fix is to run that review before kickoff, against named frameworks, with the agency’s lead engineer in the room, and to make the answers a contract appendix.

This piece lays out an eight-section pre-kickoff security review built on the OWASP LLM Top 10, the NIST AI Risk Management Framework, and EU AI Act Article 28 obligations. Each section gives the CTO what to ask, what a credible answer looks like, and what disqualifies the agency. The point is not to turn the CTO into an LLM security specialist; it is to make the cost of bluffing high enough that only agencies with a real security practice get past. This sits inside the AI agency manifesto, which argues that the 2026 buyer is paying for shipped, audit-ready software rather than slide decks.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why a pre-kickoff review and not a post-launch audit

Three forces have made the pre-kickoff review the only review that matters. The cost of a working LLM demo collapsed to a weekend, so the agency can show a running system without having designed for any of the OWASP LLM Top 10 risks. Enterprise security teams have caught up; SOC 2 Type II auditors now ask LLM-specific questions, and any system serving EU users falls under the EU AI Act’s tiered risk regime. And the shape of LLM defects is architectural: prompt injection, system-prompt leakage, excessive agency, and supply-chain risk are not bugs to be patched but properties of the system’s design.

The honest framing is that the pre-kickoff review is not a security exercise. It is a contracting exercise; converting “we will follow security best practices” into eight named commitments mapped to OWASP LLM Top 10 controls, NIST AI RMF functions, and EU AI Act Article 28 obligations. Once they are in writing, the agency is bound to the architecture rather than free to invent it on the fly.

Section 1: prompt injection defense (OWASP LLM01)

What to ask: “Walk me through your LLM01 design pattern. What are your input validation rules? How is the system prompt isolated from user content? Show me your prompt-injection eval pack and one production injection incident, redacted.”

What good looks like: A layered defense the lead engineer can sketch on a whiteboard in five minutes. Input validation against an allowlist of expected request shapes; not a denylist of bad strings. System prompts isolated from user content with structural delimiters and rarely concatenated with retrieved documents that have not been sanitized. Tool calls bound to the original user identity, not to the model’s elevated context. Output validation against typed schemas before any side effect fires. A prompt-injection eval pack with at least 30 attack strings; published jailbreak prompts, indirect injection through retrieved documents, payload smuggling, base64 attacks; run on most PR with a hard CI gate. The agency can name a successful injection incident from a prior client with the remediation that shipped.

What disqualifies: A wave at “guardrails” without naming the specific control points. A claim that the system prompt is “secret” and therefore safe; most prompt leaks eventually. “We will add prompt injection testing in phase 2.” A defense consisting of a single wrapper LLM grading user input. For the deeper treatment of LLM01 controls, see the AI guardrails implementation guide.

Section 2: jailbreak resistance and model behavior (OWASP LLM06, LLM09)

What to ask: “How do you constrain excessive agency? Where is the boundary between what the model can decide and what requires a human in the loop? What is your hallucination rate ceiling, and how is it measured?”

What good looks like: Tools the model can invoke have explicit allowlists, per-tool input validation, and per-tool rate limits. The model rarely has direct database write access; most mutation goes through a typed interface that the application controls. There is a written matrix of decisions classified by reversibility, and irreversible actions (sending money, deleting records, sending external email) require either a confirmation step or a human approver. The hallucination rate is measured against ground-truth eval cases, has a numeric ceiling tied to a business outcome, and gates production deploys. The agency has a red-team eval pack of jailbreak prompts (DAN variants, role-play attacks, encoding tricks) and runs it in CI.

What disqualifies: “The model has access to the database but it would rarely do that.” Tools whose arguments are arbitrary strings passed to a shell, interpreter, or SQL parser. No defined hallucination ceiling. A red-team eval pack run once in week two and rarely updated. For adversarial testing cadence, see the AI security red-teaming services overview.

Section 3: training data and model leakage (OWASP LLM02, LLM07, LLM10)

What to ask: “What flows through the model in production? What ends up in logs, in fine-tuning corpora, in vector indexes, and in eval data? What contractual data-use terms do you have with each model provider, and where are they pinned?”

What good looks like: A written data-flow diagram that traces most production input through the system; model API, observability stack, eval sampler, vector index, fine-tuning corpus if one exists. PII is redacted at write time, not query time, with a tested redaction library and CI cases that prove the redaction holds. Each model provider has a signed enterprise agreement with zero-data-retention or short-retention terms, and the contract is referenced by version in the system documentation. There is no fine-tuning on production traffic without an explicit consent mechanism. System prompts that contain proprietary business logic are stored in the secrets manager, not embedded in the deployed binary.

What disqualifies: “We do not log prompts” without a redaction library; the prompts are still in the provider dashboard. A vector index hydrated from an internal corpus without access controls. System prompts hardcoded in the repository with company secrets. No token rate limit per user, no daily cost cap, no defense against a billing-DoS from a malicious or buggy client.

What to ask: “Show me a four-region map for the system. Where does inference run, where do logs land, where does the vector index live, where does the eval sampler store production traffic? What DPA do we sign, and what subprocessors flow down?”

What good looks like: A one-page diagram with many four regions named, and each region selected to satisfy the client’s residency obligations. For an EU client that is typically an EU-region inference endpoint (Anthropic EU, OpenAI EU, Azure EU), EU-region observability (self-hosted Langfuse, EU-region Datadog), an EU-hosted vector database (pgvector on EU Postgres, Pinecone EU, Weaviate self-hosted), and an EU eval sampler. A Data Processing Agreement signed with the agency that flows down to most subprocessor on the list. A clear party matrix under EU AI Act Article 28; who is the provider, who is the deployer, who carries the conformity assessment burden.

What disqualifies: “We use OpenAI” without specifying the region or the data-use terms. A vector index defaulting to a US region for an EU client because the engineer who set it up did not check. No DPA, or a DPA that does not name subprocessors. A provider/deployer ambiguity under the AI Act that the agency proposes to resolve “later.” A residency story that holds for inference but breaks for logging.

Section 5: key management and token egress

What to ask: “Who holds the production API keys? Where are they stored? How are they rotated? What egress monitoring is in place on token consumption, and what fires if a key is compromised?”

What good looks like: The client holds the production keys, full stop. Keys live in a client-controlled secrets manager; AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault; with audited access. They are scoped per environment, rarely shared across staging and production. Rotation is on a defined cadence (90 days is typical) and is rehearsed before the engagement ends. Egress monitoring on token consumption is wired into the observability stack with alerts that fire on anomalous spend within minutes, not on the bill at month-end. The agency’s development environment uses a separate key with a low cap and no production data access.

What disqualifies: The agency holding the client’s production keys “for convenience.” Keys committed to a private repository, even briefly. A single key shared across staging, production, and the eng team’s laptops. No rotation procedure. Cost monitoring that lives only in the provider dashboard and is checked weekly by an unstaffed inbox. An agency that cannot describe the egress alert path in 60 seconds is operating without one.

Section 6: dependency supply chain (OWASP LLM03, MITRE ATLAS)

What to ask: “What is your dependency policy? How do you pin versions, audit CVEs, and vet a new package? How fast do you patch a transitive vulnerability when one publishes?”

What good looks like: A pinned dependency tree with lockfiles. CVE auditing on most PR through Snyk, GitHub Advanced Security, or Socket; blocking merges on critical advisories. New packages go through a written intake review checking maintainer reputation, release cadence, and the package’s own dependency tree. A patch SLA; typically 24 hours for critical, 7 days for high; that is monitored, not aspirational. The agency can name specific CVEs they patched in 2024 and 2025 across the LLM toolchain (LangChain, LlamaIndex, vector DB clients, eval frameworks, Langfuse SDKs). MITRE ATLAS is referenced as the framework for adversarial supply-chain techniques.

What disqualifies: Latest tags pulled into production. A “security review” of dependencies done once at kickoff. No lockfile. An agency that cannot name a CVE they patched has not been operating long enough to have patched one.

Section 7: incident response SLA and on-call

What to ask: “Show me the incident response playbook for the last AI-specific incident you handled. What is the SLA in our contract; acknowledgment, first response, containment? Who is on call, and through what system?”

What good looks like: A written incident response playbook with AI-specific severity-1 categories: a successful prompt injection that exfiltrated data, a model-update regression that broke production accuracy, a runaway cost incident, a system-prompt leak. The contract specifies a 15-minute acknowledgment, a 1-hour first response with triage, and a 4-hour containment path for severity-1. The on-call rotation is named, runs through PagerDuty or Opsgenie, and the agency carries it for the first 90 days post-launch with an explicit transition plan. The playbook references SOC 2 Type II controls. The agency can walk through one redacted post-mortem in detail.

What disqualifies: “We respond to issues during business hours.” A Slack channel as the incident system. The agency rotation ending at the contract deadline with no transition plan. No AI-specific severity categories. No post-mortem culture; the agency claims they have not had an incident in two years.

Section 8: audit logging and traceability

What to ask: “Show me the audit log schema. What is logged on most LLM call? How long are logs retained? What is the query latency on a one-week-old log? Walk me through reconstructing a successful prompt injection from logs alone.”

What good looks like: Most LLM call is logged with full prompt, full completion, model version, latency, token counts, cost, trace ID, user ID, request ID, and tool-call results. PII is redacted at write time. Logs land in immutable storage (S3 with object lock, BigQuery with retention, or equivalent). Hot retention is 90 days, cold one year, configurable per the client’s policy. Trace IDs propagate across retrieval, model, and tool calls so a multi-hop incident can be reconstructed end-to-end. Access to raw logs is itself audited. The agency demonstrates the stack; Langfuse, LangSmith, OpenTelemetry; with real numbers from the last shipped system. Common acceptable stacks are referenced in the security questions checklist for AI teams.

What disqualifies: “We use the provider’s dashboard”; provider dashboards are not audit-grade and do not survive a SOC 2 Type II review. Logs that contain unredacted PII. Trace IDs that stop at the model call boundary. A logging stack no one can query in under five minutes. Retention policies invented on the call rather than referenced from a written document.

How to run the review

The review takes 90 minutes with the agency’s lead engineer and the security lead. Print the eight sections. For each one, ask the question, listen to the answer, and grade it green (clear and documented), yellow (correct direction, needs commitment in writing), or red (no answer or wrong answer). Yellow items become contract appendices the agency commits to before kickoff. Red items either get resolved before the SOW is signed or the agency is wrong for the engagement.

Two outcomes are possible. The agency answers eight greens and the engagement starts with a shared, named security architecture. Or the agency cannot get past three or four sections, and the CTO has saved the company from buying a system destined to fail its first internal review. The review’s job is not to certify the agency; it is to convert vague proposal language into specific commitments the agency cannot retract.

Security in an LLM system is not a feature; it is a substrate. Each of the eight sections is a habit the agency built before it walked into the meeting or it did not build at many. An agency that promises to “harden the system in the final sprint” has rarely operated a real LLM system through a real security review. The pre-kickoff review is how the CTO tells them apart, while the cost of saying no is still zero.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run pre-kickoff security reviews against more than three dozen AI agencies and tracked which commitments survived the first production incident.

Frequently Asked Questions

Why should a CTO run a security review on an AI agency before kickoff rather than during the engagement?

Most AI agency security defects are architectural; system-prompt design, key-custody choices, dependency selections, logging strategy. Once the agency has built two sprints of code on top of a flawed architecture, the cost to rip it out crosses the threshold where the client will accept it. A pre-kickoff review forces the agency to commit to OWASP LLM Top 10 controls, NIST AI RMF mapping, and EU AI Act Article 28 obligations before code is written. That commitment becomes a contract appendix, and the agency is now bound to the architecture rather than free to invent it on the fly. Reviewing after kickoff is how clients end up paying twice; once for the original system and once for the remediation.

What is the OWASP LLM Top 10, and why does it matter for an AI agency security review?

The OWASP LLM Top 10 is the canonical industry list of the most critical LLM application risks; LLM01 prompt injection, LLM02 sensitive information disclosure, LLM03 supply chain, LLM04 data and model poisoning, LLM05 improper output handling, LLM06 excessive agency, LLM07 system prompt leakage, LLM08 vector and embedding weaknesses, LLM09 misinformation, LLM10 unbounded consumption. It matters because it gives the CTO a shared vocabulary with the agency. When the agency’s lead engineer cannot describe their LLM01 controls in 60 seconds, the conversation is over. The OWASP Top 10 is also what most enterprise security teams will use to audit the system post-launch, so an agency that has not designed against it is shipping a system that will fail its first internal review.

What is the NIST AI Risk Management Framework, and how does it fit into a pre-kickoff review?

NIST AI RMF (AI 100-1) is a voluntary US framework organized around four functions; Govern, Map, Measure, Manage; that provides a structured way to treat AI risk across the system lifecycle. In a pre-kickoff review, NIST AI RMF acts as the meta-framework that the OWASP LLM Top 10 fits inside. The CTO should ask which of the four functions the agency operates against, what artifacts they produce for each, and how they map their measurement function to a CI-gated eval suite. An agency that cannot place its security work inside NIST AI RMF is operating ad hoc, and ad hoc security does not survive a SOC 2 Type II audit or an EU AI Act compliance review.

How should prompt injection defenses be evaluated before kickoff?

Ask the agency to walk through their LLM01 design pattern in concrete terms. A real answer specifies a layered defense; input validation against an allowlist of expected request shapes, system prompts isolated from user content with structural delimiters, tool calls bound to the original user identity rather than the model’s request, output validation against typed schemas before any side effect, and a prompt-injection eval pack with at least 30 published attack strings (jailbreak prompts, indirect injection through retrieved documents, payload smuggling) run in CI. The agency should be able to show one prompt-injection incident from a prior client, redacted, with the remediation. A vague answer that names guardrails as a vendor product without naming the specific control points is disqualifying.

Who should hold the production API keys for the LLM provider, the agency or the client?

The client. Usually. The agency builds the system; the client owns the keys. Production keys live in a client-controlled secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault), are scoped per environment, are rotated on a defined cadence, and have egress monitoring on token consumption. The agency uses a separate development key with a low spending cap. A pre-kickoff review should establish key custody on day one. An agency that holds the client’s production keys for convenience has built a system the client cannot legally hand to a successor and will struggle to pass any enterprise security questionnaire that asks about credential ownership.

What does data residency look like for an AI system in 2026, and how does it affect the agency selection?

Data residency in 2026 is not a single switch but a four-part question. Where does the inference run? Where are the prompts and completions logged? Where does the vector index live, and where do retrieval queries hit? Where does the eval suite store its production traffic samples? For an EU client, many four answers must satisfy GDPR Article 44 cross-border restrictions, which typically requires an EU-region inference endpoint (Anthropic EU, OpenAI EU, Azure EU), EU-region observability, EU-hosted vector database, and a Data Processing Agreement signed with the agency that flows down to most subprocessor. An agency that cannot produce a four-region map of the system’s data flows in 24 hours has not architected for residency.

How does the EU AI Act affect an AI agency engagement that ships a system serving EU users?

The EU AI Act creates a tiered risk regime; prohibited, high-risk, limited-risk, minimal-risk; and Article 28 imposes specific obligations on providers and downstream deployers, including transparency, technical documentation, data governance, human oversight, accuracy and robustness testing, and conformity assessment for high-risk systems. For an AI agency engagement, this means the contract must specify which party is the provider and which is the deployer, what technical documentation the agency hands over, and what conformity assessment evidence travels with the system. An agency that cannot answer ‘are we the provider or the deployer under the AI Act’ has not engaged with the regulation that will govern the system’s first audit.

Why is the supply chain; eval libraries, vector databases, prompt frameworks; a security concern for an AI engagement?

OWASP LLM03 elevates supply chain because the typical AI system depends on dozens of fast-moving open-source packages; LangChain, LlamaIndex, Pinecone or Weaviate or pgvector clients, eval frameworks like Promptfoo or Inspect, observability SDKs like Langfuse; many of which had production CVEs in 2024 and 2025. The pre-kickoff review should ask the agency for their dependency policy: how they pin versions, how they audit for CVEs (Snyk, GitHub Advanced Security, Socket), how they vet a new package before adding it, and what their response time is when a transitive vulnerability publishes. MITRE ATLAS catalogs the adversarial supply-chain techniques for AI systems and is a useful reference. An agency that pulls latest tags into production has not internalized the LLM03 risk.

What is an acceptable incident response SLA for an AI system shipped by an agency?

For a production-traffic system, the agency commits to a 15-minute acknowledgment, a 1-hour first response with triage, and a defined containment path within 4 hours for severity-1 incidents; including AI-specific severity-1s like a successful prompt injection that exfiltrated data, a model-update regression, or a runaway cost incident. The on-call rotation is named in the contract and runs through PagerDuty or Opsgenie, not a Slack channel. The agency carries the rotation for the first 90 days post-launch and transitions the client engineering team in. SOC 2 Type II expects this discipline, and any agency claiming security maturity without an incident response SLA is claiming a posture they cannot demonstrate.

What does production-grade audit logging look like for an LLM application in 2026?

Most LLM call is logged with the full prompt, full completion, model version, latency, token counts, cost, trace ID, user ID, request ID, and tool-call results, written to immutable storage with PII redacted at write time. Logs are queryable in production within five minutes and retained per the client’s data retention policy (commonly 90 days hot, 1 year cold). Trace IDs propagate across retrieval, model, and tool calls so a successful prompt injection can be reconstructed end-to-end. Access to raw logs is audited. The agency demonstrates the logging stack on their last shipped system; Langfuse, LangSmith, or a custom OpenTelemetry pipeline; with real numbers, not screenshots from a vendor demo. Logging that lives only in the LLM provider dashboard is not audit-grade.

The AI agency security review every CTO should run before kickoff

Decision Scope

Why a pre-kickoff review and not a post-launch audit

Section 1: prompt injection defense (OWASP LLM01)

Section 2: jailbreak resistance and model behavior (OWASP LLM06, LLM09)

Section 3: training data and model leakage (OWASP LLM02, LLM07, LLM10)

Section 5: key management and token egress

Section 6: dependency supply chain (OWASP LLM03, MITRE ATLAS)

Section 7: incident response SLA and on-call

Section 8: audit logging and traceability

How to run the review

Frequently Asked Questions

Why should a CTO run a security review on an AI agency before kickoff rather than during the engagement?

What is the OWASP LLM Top 10, and why does it matter for an AI agency security review?

What is the NIST AI Risk Management Framework, and how does it fit into a pre-kickoff review?

How should prompt injection defenses be evaluated before kickoff?

Who should hold the production API keys for the LLM provider, the agency or the client?

What does data residency look like for an AI system in 2026, and how does it affect the agency selection?

How does the EU AI Act affect an AI agency engagement that ships a system serving EU users?

Why is the supply chain; eval libraries, vector databases, prompt frameworks; a security concern for an AI engagement?

What is an acceptable incident response SLA for an AI system shipped by an agency?

What does production-grade audit logging look like for an LLM application in 2026?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources

The AI agency security review every CTO should run before kickoff

Decision Scope

Why a pre-kickoff review and not a post-launch audit

Section 1: prompt injection defense (OWASP LLM01)

Section 2: jailbreak resistance and model behavior (OWASP LLM06, LLM09)

Section 3: training data and model leakage (OWASP LLM02, LLM07, LLM10)

Section 4: data residency and DPA (GDPR Article 44, EU AI Act Article 28)

Section 5: key management and token egress

Section 6: dependency supply chain (OWASP LLM03, MITRE ATLAS)

Section 7: incident response SLA and on-call

Section 8: audit logging and traceability

How to run the review

Frequently Asked Questions

Why should a CTO run a security review on an AI agency before kickoff rather than during the engagement?

What is the OWASP LLM Top 10, and why does it matter for an AI agency security review?

What is the NIST AI Risk Management Framework, and how does it fit into a pre-kickoff review?

How should prompt injection defenses be evaluated before kickoff?

Who should hold the production API keys for the LLM provider, the agency or the client?

What does data residency look like for an AI system in 2026, and how does it affect the agency selection?

How does the EU AI Act affect an AI agency engagement that ships a system serving EU users?

Why is the supply chain; eval libraries, vector databases, prompt frameworks; a security concern for an AI engagement?

What is an acceptable incident response SLA for an AI system shipped by an agency?

What does production-grade audit logging look like for an LLM application in 2026?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling