The web contains semi-structured information in HTML. To extract structured data from a web page, which gets constant user interface updates, a non-breaking method is required. Building a robust and fast record-level wrapper from a single annotated web page is the subject of this project. Current state of the art methods include mining data regions to recognize template generated areas on page, probabilistic wrapper induction to extract data from a single data region in a robust way, and partial tree alignment to repeatedly extract data from multiple regions. In this thesis, we combine these three ideas into a new method and design a system for robust data extraction. Experimental results using a large number of web pages from multiple domains show that the proposed approach works with a high precision and within reasonable execution time on commodity hardware.
Refer to status updates for more information about the project.