Why AI pilots stall — and the boring middle that makes them work

A pattern we see almost every month: a company runs an AI pilot, gets a promising demo, tells the board it’s ready, and then six months later it’s quietly retired. The model wasn’t the problem. The model was rarely the problem.

What kills these pilots is the boring middle — the work between “the prototype answers questions correctly” and “real users get value from it inside their actual workflow.” This is the part that doesn’t make it into vendor demos, and the part that consumes most of a real budget.

The three places pilots die

In our engagements we see three failure modes, in roughly this order of frequency:

Identity and access. The pilot worked because it had a god-mode service account. Production needs row-level permissions, audit trails, and a path for offboarding leavers. Nobody planned for this.
Data freshness. The demo ran on a snapshot. Real users ask about today, not last month. Wiring up a streaming pipeline is a project of its own.
Workflow placement. Users won’t visit a new portal to ask the AI a question — they want it where they already are: in Slack, in their CRM, in the document they’re already editing.

The model is maybe 10% of the cost. The other 90% is plumbing — and that’s where most projects underbudget by a factor of three.

What “production” actually means

When we say a system is production-ready, we mean it has, at minimum:

A clear data contract — what comes in, what goes out, who owns each.
Authentication and authorisation that mirror the rest of your stack, not a parallel universe.
Logs and metrics good enough to debug a regression at 11pm on a Friday.
A rollback path that doesn’t require rebuilding from scratch.
An owner. A specific, named person whose performance review references the system.

If any of these is missing, you don’t have a production AI system — you have a long-running pilot wearing a costume.

A small example

Here’s a sketch of the kind of contract we’d write for a customer-support assistant before any model work begins:

intent: classify-and-route
inputs:
  - message: string (max 4kb)
  - customer_id: uuid
  - channel: enum [email, chat, phone-transcript]
outputs:
  - category: enum [billing, technical, account, other]
  - confidence: float (0..1)
  - escalate: boolean
sla:
  p95_latency_ms: 800
  availability: 99.5

Once that contract is fixed, the model becomes swappable. You can start with a mid-tier model, upgrade later, swap to a self-hosted variant if costs spike — without changing anything downstream. Pilots that skip this step end up with a model and the surrounding code so tightly coupled that swapping anything means a rewrite.

How to spot the pattern early

A few questions we ask in the first week of any AI engagement:

Where does this live? If the answer is “a new portal,” that’s a yellow flag.
Who breaks the glass at 2am? If there’s no answer, the pilot will quietly degrade and nobody will notice.
What does success look like in numbers? If success is measured in vibes (“users seem to like it”), the project will struggle to defend itself at next year’s budget review.

These aren’t AI-specific. They’re the same questions you’d ask of any production system. AI pilots stall, in part, because the AI label gets used as a permission slip to skip them.

The shape of a pilot that ships

The pilots that survive look almost boring from the outside. They scope tightly — one workflow, one team, one metric. They wire into the systems people already use. They have an owner who can describe, in one sentence, what good looks like.

If you’re starting a pilot now, our suggestion is simple: spend the first week on the boring middle. Identity, data contracts, where it lives, who owns it. The model can wait. It almost always can.

If you’re working through any of this, we’d genuinely like to hear about it. Drop us a note — even if you’re not looking to engage, the questions help us write better posts.

Keep reading.

5 May 2026 4 min read

AI Hallucinations Are Becoming a State Regulatory Problem

Pennsylvania has sued Character.AI over a chatbot that posed as a licensed psychiatrist. State AG action signals AI hallucinations are becoming an enforceable consumer-protection issue mid-market firms must plan for.

Read post

5 May 2026 3 min read

Reading OpenAI's '52% Fewer Hallucinations' Claim

OpenAI says GPT-5.5 Instant has 52.5% fewer hallucinated claims on high-stakes prompts. The number is credible. The reason it doesn't tell you whether to deploy is more interesting.