Published December 16, 2025 | Version 1.0.0
Presentation Open

From demos to dependability: Practical evaluation strategies for AI applications

  • 1. 0000-0001-7113-4370

Description

The advent of Generative AI has transformed how we build AI models and software, but the speed we gain comes at a cost. Large language models (LLMs) let us add "intelligence" to an app in minutes, yet most prototypes stall before production because no one trusts their behaviour. We trace the shift from the traditional data-first, then model pipeline to today’s app-first, then API workflow and explain why it leaves teams blind, often without benchmark datasets or KPIs to judge performance.

The talk presents ways Research Software Engineers can close this gap by making evaluation a first-class citizen. We survey three families of evaluation for AI applications: manual prototypes, data-gathering tools, and synthetic benchmark generation, giving an overview of the field. We do a deep dive into the latest research and reflect on lessons learned over the past few years. To ground the theory, we walk through concrete examples from our own projects.

Attendees will leave with practical insights for matching evaluation strategy to project maturity, regulatory constraints, and available resources. Evaluation has become the critical bottleneck and RSEs are uniquely placed to turn it into an enabler for reproducible, sustainable science.


A recording of this session is available on YouTube: https://youtu.be/dCHzTfT1zw8

Files

slides_117_-_Roman_Luca_Wixinger.pdf

Files (1.3 MB)

Name Size Download all
md5:4d1b80c5525ccdef80a7d983e16b6987
1.3 MB Preview Download