What I Learned Putting AI Agents in Production

Last year I helped build an AI travel assistant from scratch. It looked impressive in demos within a week. Getting it production-ready took months. Here's what I wish I'd known earlier.

Demos Lie

The first version was a straightforward prompt-in, text-out setup. You'd ask it to plan a trip, and it would generate a plausible-sounding itinerary. The problem: it was making things up. Hotel names that didn't exist, flights on routes that had been cancelled, restaurants that had closed years ago.

This is the fundamental gap with LLM-based products. The model is great at generating plausible text, but you need accurate text. Bridging that gap is where most of the engineering work happens.

Function Calling Changed Everything

The breakthrough was OpenAI's function calling API. Instead of asking the model to generate an itinerary as text, we gave it access to actual data sources through typed function definitions. The model would decide what to look up, and our code would do the actual looking up.

User: "Plan a 3-day trip to Lisbon in April"

Model decides to call: search_hotels(city="Lisbon", dates="2026-04-01..2026-04-03")
Our code: queries real hotel API, returns results
Model decides to call: search_activities(city="Lisbon", category="sightseeing")
Our code: queries activity database, returns results
Model: assembles itinerary from real data

This pattern — the model as an orchestrator, not a data source — is what makes AI products actually usable. The model is good at understanding intent and structuring a response. The data comes from sources you trust.

Prompt Management Is a Real Problem

By the time we launched, we had dozens of prompts across different flows. Each one had been iterated on, tested, and tuned. Managing them was painful.

What worked for us:

Prompts as versioned files — not hardcoded strings in the codebase. Treat them like configuration.
A/B testing prompts — small wording changes can dramatically affect output quality. You need a way to test this.
Regression testing — when you change a prompt, run it against a set of known inputs and compare outputs. LLMs aren't deterministic, but you can catch obvious regressions.

What didn't work:

Letting the model "figure it out" with a vague system prompt. The more specific and constrained the prompt, the better the output.
Chaining too many model calls — latency adds up fast. Users notice when your app takes 15 seconds to respond.

Error Handling Is Different

Traditional error handling is straightforward: the function either succeeds or throws an error. With LLMs, there's a third category: the function succeeds but the output is wrong.

You need validation layers that check the model's output before passing it to the user. For our travel assistant, this meant:

Verifying that referenced hotels actually exist in our database
Checking that suggested flights are on valid routes
Ensuring dates in the itinerary match what the user asked for

This validation code is boring to write but essential. Without it, you'll ship hallucinations to users.

Cost Matters More Than You Think

GPT-4 calls are expensive when you're making multiple per user request. We spent real time optimizing which calls used GPT-4 (complex reasoning) vs GPT-3.5 (simple extraction). Token counting, response caching, and prompt compression became part of our regular optimization work.

Takeaway

AI products are 20% model integration and 80% everything else — data validation, error handling, prompt management, cost optimization, and the infrastructure to support it all. The "AI" part is the easy part. Making it reliable is the real work.