Jettro is a software architect, search relevance expert, and data enthusiast who enjoys discussing his job, hobbies, and other topics that inspire people. Jettro truly believes in the Luminis mantra that the only thing that grows by sharing is knowledge. After more than ten years of creating the best search engines for multiple customers, Jettro is very active in the Generative AI domain. He has extensive experience with Retrieval Augmented Generation and AI Agents.
Abstract
Would you let a stranger handle your customer data?
Would you let a new hire talk to a client on their first day?
Would you put your kid in a self-driving car and just say "Have fun at school."
Then why do we trust our shiny new AI Agents to behave correctly in production without testing them?
In this talk, we share our journey of exploring how to evaluate Agentic Systems before and after deployment. We’ll walk through how to move from “it works in the demo” to trustworthy and observable systems that you can confidently run in production.
We’ll show practical examples of building evaluation pipelines, and how we experiment with simple, measurable ways to understand an agent’s behavior over time. We’ll share what we’ve learned so far, where things go wrong, what helps, and what’s still an open challenge as we build toward more mature evaluation practices.
Expect real experiences, not just theory. Expect live examples, and ideas you can take home to build trust into your own agents.
Key Takeaways
- Why testing AI Agents is different from traditional software testing
- How to design evaluation frameworks that fit your use case
- How to combine offline testing with live production observation
Target Audience
Developers, architects, and AI practitioners who are experimenting with or building agent-based systems and want to learn how to evaluate and test them effectively.
Searching for speaker images...
