From demos to dependability: Practical evaluation strategies for AI applications
Description
The advent of Generative AI has transformed how we build AI models and software, but the speed we gain comes at a cost. Large language models (LLMs) let us add "intelligence" to an app in minutes, yet most prototypes stall before production because no one trusts their behaviour. We trace the shift from the traditional data-first, then model pipeline to today’s app-first, then API workflow and explain why it leaves teams blind, often without benchmark datasets or KPIs to judge performance.
The talk presents ways Research Software Engineers can close this gap by making evaluation a first-class citizen. We survey three families of evaluation for AI applications: manual prototypes, data-gathering tools, and synthetic benchmark generation, giving an overview of the field. We do a deep dive into the latest research and reflect on lessons learned over the past few years. To ground the theory, we walk through concrete examples from our own projects.
Attendees will leave with practical insights for matching evaluation strategy to project maturity, regulatory constraints, and available resources. Evaluation has become the critical bottleneck and RSEs are uniquely placed to turn it into an enabler for reproducible, sustainable science.
A recording of this session is available on YouTube: https://youtu.be/dCHzTfT1zw8
Files
slides_117_-_Roman_Luca_Wixinger.pdf
Files
(1.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:4d1b80c5525ccdef80a7d983e16b6987
|
1.3 MB | Preview Download |