The course provides an introduction to Multimodal Artificial Intelligence, a research area at the intersection of Computer Vision, Natural Language Processing, and Deep Learning. We will cover approaches to modeling multiple input and output modalities (with an emphasis on text, images, and video): from the early ones to the modern-day cutting-edge AI technology. The topics may include:
- Multimodal Architectures: E.g. joint Embedding Models, Multimodal Transformers, Neural Modular Approaches
- Applications such as Image and Video Description, Visual Question Answering, Text-to-Image Synthesis, Vision and Language Navigation, Multimodal Dialog
- Multimodal Generative Models
- Foundational Multimodal Large Language Models (LLMs)
- Open issues such as Bias, Compositionality, Explainability, and Scaling Laws
- Emergent topics in Multimodal AI
- Dozent*in: Anna Rohrbach
- Dozent*in: Marcus Rohrbach