Web Information Extraction and Retrieval

Web is almost an unlimited source of information. Using search engines such as Google, Bing and similar we can easily find web pages with possibly relevant information. The number of returned pages would usually however be very large which does not allow for manual processing. The solution to this are computer programs that are able to find and extract relevant information from possibly very large number of non-structured or semi-structured documents and return results in structured form.

COURSE GOAL

The main objective of this course is to teach students about how to develop programs for web search (including surface web and deep web search) and for extraction of structural data from both, static and dynamic web pages. Beside basic concepts of the web search and retrieval, students will learn about relevant techniques and approaches. After the course, if successful, students will be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).

COURSE CONTENT

The main topics that will be addressed within the course are:

Information Retrieval and Web Search (Basic Concepts of Information Retrieval, Information Retrieval Models, Relevance Feedback, Evaluation Measures, Text and Web Page Pre-Processing, Inverted Index and Its Compression, Latent Semantic Indexing, Web Search, Meta-Search...)
Web Crawling (A Basic Crawler Algorithm, Implementation Issues, Universal Crawlers, Focused Crawlers, Topical Crawlers),
Structured Data Extraction (Wrapper Induction, Instance-Based Wrapper Learning, Automatic Wrapper Generation, String Matching and Tree Matching, Multiple Alignment, Building DOM Trees, Extraction Based on a Single List Page or Multiple Pages...)

REQUIRED KNOWLEDGE

It is expected from students that they know at least basics of program languages and technologies such as, Java, JavaScript, Python, HTML, CSS, web page structure.

COURSE GRADING

For a positive grade at this course students are expected to successfully finish three projects (seminars) and written examination (at least 50% of all points) .