The Web contains more than 10 billion indexable web pages, which can be retrieved via search queries. The lecture will present Natural Language Processing (NLP) methods to (1) automatically process large amounts of unstructured text from the web and (2) analyze the use of Web data as a resource for other NLP tasks.

Processing of unstructured web content

  • Introduction
  • NLP Basics - Tokenisation, Part of Speech Tagging, Chunking, Stemming, Lemmatization
  • Web contents and their characteristics - diverse genres of web content, e.g. personal websites, news sites, blogs, forums, wikis
  • Web contents and their characteristics - continued

NLP applications for the web

  • Information retrieval - introduction to the basics of information retrieval
  • Web information retrieval - natural language interfaces for web information retrieval
  • Question answering (QA): Factoid QA, Knowledge Base QA, Community QA
  • Crowdsourcing
  • Reproducibility