Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 13 min read

The Case for the In-House AI Tiger Team Alongside an External Agency

The Case for the In-House AI Tiger Team Alongside an External Agency

The two team structures we see most often in AI orgs are wrong alone and right when paired. Pure in-house teams of 2 to 3 cannot cover the rails layer at production quality and ship the moat layer at the depth the org needs. Pure agency engagements of 5 to 8 cover the rails layer well and leave the moat layer hollow. The hybrid; a 2 to 3 person in-house tiger team alongside a 5-person external agency; covers both layers at the right depth. It is not a compromise; it is the shape the AI build-vs-buy-vs-hire matrix produces when the principles are applied honestly. This piece names the five reasons the hybrid dominates either pure structure, what each body owns, and how the cost shape evolves over three years.

This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s eighth principle says compose: buy the rails, build the moat, hire the judgment. The tiger team plus agency hybrid is the operational shape of that principle. Anything tighter on either side fails predictably; anything looser under-uses the structure the matrix demands.

The two pure structures and why each fails alone

Pure in-house AI team, 2 to 3 people. Strong on the moat layer because the team has institutional context, eval-set authorship, and the architectural decisions that depend on it. Weak on the rails layer because 2 to 3 people cannot cover model gateway integration, observability backend, vector indexing, agent-framework wiring, prompt-registry stack, deployment automation, and the operational throughput required to ship the first 18 months of work. The team produces a deeply-considered slice of the system and a stack that drifts because no one is operating the rest.

Pure agency engagement, 5 to 8 people. Strong on the rails layer because the agency has reusable patterns, capability density, and a delivery cadence that produces output at the rate the org needs. Weak on the moat layer because the moat layer encodes judgment the agency cannot absorb; eval-set tacit knowledge, regulatory accountability, IP-critical defensibility, the institutional context per the refusal conditions for outsourcing AI. The agency produces a competent system that fails on the moat axis 12 to 18 months in.

The hybrid pairs the two structures and covers both layers at the right depth. The tiger team is small by design; judgment, not throughput; and the agency is sized to deliver throughput against the tiger team’s judgment.

Reason 1: capability coverage at the right depth per layer

A moderately mature AI stack has roughly 35 distinct capabilities. About 10 are moat-layer (eval set authorship, prompt content, agent architecture, model selection criteria, regulated policy logic, kill switches, observability rules, model-routing configuration, retrieval logic over proprietary data, agent orchestration logic). About 25 are rails-layer (model gateway, observability backend, vector index, agent framework wiring, eval harness, prompt registry stack, deployment automation, tracing storage, dashboards, and the rest).

The two layers require different work modes. Moat-layer capabilities require deep judgment, days of debate per decision, daily exposure to the org’s data and customers. Rails-layer capabilities require pattern reuse, integration velocity, and the willingness to pick the boring choice fast.

A 2 to 3 person in-house team has bandwidth for one layer at production quality. Choosing the moat layer is correct per the matrix; the rails layer then drifts. A 5 to 8 person agency has bandwidth for both layers at production quality but does not have the institutional context for the moat layer. The hybrid splits the layers between the two bodies. Each body works in the mode it is structurally suited for. Coverage hits 100 percent at the right depth.

Reason 2: institutional context survives because the tiger team carries it

Institutional context is the year-over-year accumulation of how the org’s data behaves, what its customers tolerate, where its compliance lines fall, and which engineering trade-offs have been litigated. The agency cannot have it on day one and cannot fully acquire it over the engagement.

The tiger team carries it natively. Each member has been inside the org long enough to recognize that “this output looks weird” maps to “the customer is in regulatory cohort B and we don’t ship that pattern to them.” The agency working alone makes the wrong call here because the agency does not have the cohort-B context. The agency working alongside the tiger team makes the right call because the tiger team flags it.

The flagging is the daily work of the hybrid. Per the anatomy of a great AI agency kickoff, a well-run engagement has the in-house judgment surface integrated into the agency’s review cycle from day one; design reviews, eval reviews, prompt reviews, model-selection reviews. The tiger team brings judgment; the agency brings throughput; the system that ships reflects both.

Reason 3: the moat layer gets eval-set authorship that does not transfer

Per the matrix’s fifth principle and per the case for buying the eval stack and building the evaluator, the eval set must be authored inside the org. It encodes tacit judgment that does not survive a contract boundary. The agency-only structure tries to outsource eval-set authorship and produces a credible-looking artifact that scores green on cases the org does not care about.

The tiger team owns the eval set. The agency builds the harness, the regression tooling, the dashboards, the storage. The agency runs the eval against the tiger team’s set on most change. The tiger team reviews the regression output and signs off. The split is precisely the build-with-help split named in the buy-trap analysis.

Without the tiger team, the eval set has no native authorship inside the org and the system ships against the wrong target. With it, the eval set is the org’s native artifact and the system ships against the right target. This single property is the largest single delta the hybrid produces.

Reason 4: model-selection velocity matches the quarterly re-litigation cadence

Per the matrix’s seventh principle, AI sourcing decisions must be re-litigated quarterly. Model selection is the highest-velocity decision in the matrix; the right model on January 1 is rarely the right model on April 1, and the right routing configuration in Q1 is rarely the right routing configuration in Q3.

A pure agency engagement re-litigates model selection on the engagement’s cadence, which is typically monthly inside the engagement and not at many between engagements. A pure in-house team re-litigates model selection in theory but lacks the bandwidth to execute the rebuild. Neither structure matches the quarterly re-litigation cadence at production scale.

The hybrid does. The tiger team writes the new model-selection criteria, the routing rules, and the eval-set additions most quarter. The agency executes the integration changes, the cost-tracking updates, the regression run, and the production rollout. The tiger team’s quarterly lift is roughly two weeks; the agency’s quarterly lift is roughly four weeks. The capability ships against the new model on schedule.

Reason 5: the agency-to-in-house transition is planned, not improvised

Most pure-agency engagements end with a chaotic handoff. The agency’s engagement clock runs out; the in-house team is half-staffed and three months behind on context; the artifact transfer is incomplete. The org spends 2 to 4 quarters reconstructing what the contract said it had.

The hybrid plans the transition from day one. The tiger team is the body that grows over time as the agency contracts. By month 12, the team is at 4 to 5 people and absorbs the agent-framework operations from the agency. By month 18, the team is at 5 to 6 people and absorbs the eval-harness operations. By month 24, the team is at 6 to 8 people and the agency is in a steady-state rails support role.

The transition is planned at the contract level: phase boundaries, capability transfer milestones, knowledge-transfer obligations. The tiger team is where the knowledge lands, because the tiger team has been alongside the agency from day one and the knowledge transfers conversationally rather than ceremonially. Per the AI hybrid playbook, the inside-outside ratio inverts over time: 30 percent inside in the first year, 70 percent inside by year three.

What the tiger team owns and what the agency owns

The split is sharp. Each body has a named scope, and crossing the line requires a portfolio-level decision rather than a project-level improvisation.

Tiger team owns: eval set authorship, prompt content, model-selection criteria, agent architecture decisions, regulatory and policy logic, kill switches, observability rules, and the cross-capability investment thesis. Each is a moat-layer capability per the matrix’s eighth principle. The tiger team makes the decisions and signs off on the agency’s work that exercises them.

Agency owns: model gateway integration, observability backend, vector index plumbing, basic agent framework wiring, eval-harness implementation, dashboard construction, prompt-registry stack, deployment automation, and the throughput of the first 12 to 18 months of build work. Each is a rails-layer capability that benefits from the agency’s reusable patterns and economy of scale.

The named ownership matters. Without it, the tiger team and the agency duplicate work or both leave the same capability uncovered. With it, most capability has a named home and the two bodies operate in their natural modes.

The cost shape over three years

First six months: roughly 1.4x the cost of either pure structure. The tiger team is fully staffed; the agency is at full engagement; the org is paying for both. Output per dollar is lower than either pure structure for this window.

Months 6 to 18: roughly the same cost as either pure structure, with materially higher output. The agency is shipping rails-layer capabilities at full velocity; the tiger team is shipping moat-layer capabilities and exercising sign-off on the agency’s work. The output-per-dollar curve crosses the pure-structure curves in this window.

Months 18 onward: lower than either pure structure as the agency’s footprint contracts and the in-house team operates against a mature stack. The capability ledger has 35 named rows; the in-house team owns 25 of them; the agency runs steady-state rails support on the remaining 10.

The total-cost-over-three-years curve favors the hybrid in roughly 70 percent of the cases we audit. The 30 percent where it does not are the cases where the org needed only the rails layer (a pure-buy product) or only the moat layer (a vertical AI startup with no rails dependency).

What to encode

For orgs deciding the team structure, encode the hybrid as the default and argue exceptions.

  • The named bodies. The tiger team and the agency exist as named entities with separate scopes, separate budgets, and separate decision rights.
  • The capability split. Most named capability sits in either the tiger team’s column or the agency’s column, with no shared ownership and no implicit defaults.
  • The transition plan. The contract names the phase boundaries (months 6, 12, 18, 24) and the capability transfers that happen at each.
  • The quarterly review. The portfolio review covers both bodies’ work; the verbs are re-litigated against both bodies’ capacity.
  • The judgment guarantee. The tiger team’s sign-off is required on most moat-layer change; the agency’s velocity is preserved by making the sign-off cycle short.

The five together produce the structural shape the matrix demands.

Frequently asked questions

What is an AI tiger team?

A 2 to 3 person in-house AI team with deep coverage of the moat layer; eval design, prompt content, agent architecture, model selection; operating alongside an external agency that owns the rails layer. Small by design; judgment, not throughput.

Why does the hybrid beat a pure agency engagement?

The agency leaves five things on the floor: tacit institutional context, eval-set authorship, regulatory accountability, model-selection velocity, and the moat layer. The tiger team picks many five up.

Why does the hybrid beat a pure in-house team?

A 2 to 3 person in-house team cannot cover both layers at production quality. Adding a 5-person agency for the rails layer doubles the system’s coverage without doubling the senior-judgment cost.

What does the tiger team own?

Eval set authorship, prompt content, model-selection criteria, agent architecture, regulatory and policy logic, kill switches, observability rules, cross-capability investment thesis.

What does the agency own?

Model gateway, observability backend, vector index, agent framework wiring, eval-harness implementation, dashboards, prompt-registry stack, deployment automation, the throughput of the first 12 to 18 months.

How does the hybrid handle the agency-to-in-house transition?

By making the transition explicit from day one. The tiger team scales to 4 to 6 over months 12 to 18; the agency contracts proportionally. By month 24, the agency is in a steady-state rails support role.

What is the cost shape of the hybrid?

Months 0–6: roughly 1.4x either pure structure. Months 6–18: roughly the same with higher output. Months 18 onward: lower than either pure structure. Total-cost-over-three-years favors the hybrid in roughly 70 percent of cases.

Does this work for early-stage companies?

Yes, with the tiger team smaller. A seed-stage version is one senior AI engineer plus a 3-person agency engagement. The structure compresses; the principles hold.

What does this principle imply for the build-vs-buy-vs-hire matrix?

It is the operational shape of the matrix’s eighth principle; buy the rails, build the moat, hire the judgment.

Key takeaways

  • Pure in-house teams of 2 to 3 cannot cover the rails layer at production quality.
  • Pure agency engagements cannot cover the moat layer because moat encodes tacit judgment.
  • The hybrid covers both layers at the right depth: tiger team for moat, agency for rails.
  • Eval-set authorship is the one of the largest delta the tiger team contributes.
  • Model-selection velocity matches the quarterly re-litigation cadence only in the hybrid.
  • The agency-to-in-house transition is planned at the contract level, not improvised at handoff.
  • Total cost over three years favors the hybrid in roughly 70 percent of audited cases.

Return to the AI build-vs-buy-vs-hire decision matrix manifesto; the anchor.

Last Updated: Jun 19, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles