The demo worked brilliantly. The LLM answered questions accurately, generated content that impressed the room, and the whole thing was live in a weekend. Three months later the team is still trying to ship it to production, and nobody is quite sure why it's so hard.
We've seen this pattern repeatedly. AI integration in 2026 has a unique failure mode: the gap between "impressive prototype" and "reliable product" is wider than for almost any other kind of software. Understanding why — and what to do about it — is the difference between shipping and spinning your wheels.
Why the Gap Exists
Traditional software is deterministic. Given the same inputs, it produces the same outputs. You can write tests, deploy with confidence, and debug failures methodically. LLMs are probabilistic. The same prompt, called twice, might produce different outputs. That single property breaks almost every assumption that normal software engineering is built on.
Combine that with:
- Latency that's 10–100x higher than a database query
- Cost that scales with usage in ways that can surprise you at the end of the month
- Failure modes that are qualitative, not binary — the model didn't crash, it just said something wrong
- Context window limits that turn "just pass it the whole document" into an architecture problem
- Model drift as providers update models and your carefully tuned prompts degrade silently
…and you have a class of software that requires new engineering discipline rather than just applying the old discipline more carefully.
The Five Mistakes We See Most Often
1. No evals
Evals are to AI what tests are to software. They're a set of inputs with expected or graded outputs that let you measure whether your system is working — and catch regressions when it stops.
Almost every team we work with starts without them. The workflow is: make a change, manually try a few prompts, it seems better, ship it. This works fine when the team is small and the feature is simple. It falls apart at scale — when there are dozens of edge cases, when you want to switch models, when a prompt change that improves one use case quietly breaks another.
Building evals isn't glamorous, but it's the single highest-leverage investment in AI reliability. Even a small set of 20–50 representative test cases, manually graded, changes your iteration speed dramatically. You stop being afraid of changes because you can measure the effect.
2. Treating prompts as code without versioning them like code
Prompts are logic. They determine what your system does, how it handles edge cases, and what guardrails exist. Yet many teams store them as string literals in application code, tweak them in production when something breaks, and have no audit trail of what changed and when.
Prompts should be versioned, reviewed, and tested before deployment — with the same rigour as any other code change. At minimum: store them separately from application code, review changes in pull requests, and run your eval suite before merging prompt changes.
3. Ignoring output structure
LLMs produce text. If your application needs structured data — a JSON object, a list, a yes/no decision — you need to engineer for that explicitly. The naive approach is to ask for JSON and parse the response. This works most of the time. It fails unpredictably, and "most of the time" isn't good enough for a production feature.
The right approaches depend on the model and use case: structured output modes (most major API providers now offer these), JSON schema enforcement, function/tool calling for discrete decisions, or a secondary validation layer that catches and retries malformed responses. Each adds engineering complexity that wasn't visible in the prototype.
4. No cost controls
Token costs are easy to underestimate. In a demo, you're sending small, hand-crafted prompts. In production, users will paste in long documents, trigger features repeatedly, and find use cases you didn't anticipate. We've seen AI features that looked like they'd cost £500/month in testing end up costing £8,000/month in production.
Engineering for cost means: input length limits with user-facing feedback, context window management strategies for long documents, caching for repeated or near-identical queries, model tiering (use a smaller, cheaper model for simple tasks), and monitoring that alerts you when costs spike unexpectedly.
5. No guardrails
What happens when a user asks your customer service bot about something it shouldn't answer? What if someone discovers a prompt injection that makes it say something embarrassing? What if the model hallucinates a fact in a context where that matters?
Guardrails aren't optional for production AI. They include: topic scope enforcement, input validation and sanitisation, output review layers for high-stakes decisions, rate limiting, PII detection, and clear user-facing communication about what the AI can and cannot do. Most of these are straightforward to implement once you've decided they're necessary. The mistake is deciding they're not necessary until after something goes wrong in production.
What a Production-Ready AI Feature Actually Requires
Here's the engineering checklist we work through before shipping any AI-powered feature to production:
- Eval suite — minimum 25–50 test cases covering the core use cases and known edge cases, with a grading mechanism (automated where possible, human-reviewed otherwise)
- Prompt versioning — prompts stored in version control, changes reviewed before deployment
- Structured output strategy — explicit approach to extracting structured data from LLM responses, with error handling for malformed outputs
- Cost monitoring — per-request token tracking, monthly cost projections, alerts for unusual spend
- Input/output guardrails — length limits, topic enforcement, PII handling, content moderation where relevant
- Latency budgets — explicit SLOs for response time, with streaming where latency is user-facing
- Fallback behaviour — what happens when the LLM API is down? Slow? Returns garbage? The feature needs defined degraded behaviour
- Observability — logging of inputs, outputs, latencies, costs, and error rates in a format that lets you debug failures and track quality over time
None of this is exotic. Most of it is the same discipline that good software engineering always required — applied to a new class of component that has different failure modes than the components you're used to.
The Model Isn't the Hard Part
The most common misconception we encounter is that AI integration is primarily a model selection problem. Pick the right model, and everything else follows. In reality, model choice is one of the smaller decisions. GPT-4o, Claude, Gemini, and Llama are all capable of excellent results on most business tasks. The hard part is the engineering around the model: the eval pipeline, the prompt management, the cost controls, the guardrails, and the observability infrastructure that makes the whole thing maintainable.
Teams that focus obsessively on finding the best model and neglect the surrounding engineering end up with a fragile system that they're afraid to touch. Teams that invest in the engineering scaffolding first can swap models in an afternoon and immediately know whether the change was an improvement.
Starting Points That Actually Work
If you're trying to move an existing AI prototype to production, here's the order we'd recommend tackling the engineering work:
- Build evals first. Even a rough set of 20 test cases with manual grading changes everything that comes after.
- Add observability. You can't improve what you can't measure. Log everything from the start.
- Add cost tracking. You'll need it before you think you'll need it.
- Version your prompts. This costs almost nothing and saves you repeatedly.
- Add guardrails. Start with the obvious ones — input length, topic scope, basic output validation.
- Define fallback behaviour. What does the feature do when the LLM fails? Document it and implement it.
The good news: none of this requires months of work. A team that takes a week to add proper evals, observability, and cost tracking to an existing prototype ends up in a dramatically stronger position than one that shipped the prototype and is now firefighting in production.
F5 Dev helps product teams take AI features from prototype to production — including eval pipelines, RAG architectures, and the engineering scaffolding that makes AI features maintainable. If you're working through this challenge, we'd be glad to talk.