Large language models have transformed the AI industry and became part of our everyday lives, used by individuals, businesses and governments. Yet the same qualities that make LLMs stand out also make them notoriously hard to evaluate. Given an LLM, how do know what it can or cannot do? How do we know if this behavior holds across different application contexts? What evidence should we accept as a proof of LLM capability or lack thereof? What shortcuts should we avoid when thinking and talking about intelligent machines? Against the backdrop of anecdotal evidence, media hype, ungrounded worries and legitimate concerns about the LLM capabilities, this years' seminar takes a deep dive into the topic of LLM evaluation. Grounded in work on measurement theory and state-of-the-art LLM evaluation practice, we will explore the basic LLM evaluation paradigms, fundamental challenges in making robust statements about the behavior of LLMs and other intelligent agents, and the potential ways forward.
- Dozent*in: Ilia Kuznetsov