Summary

The web contains semi-structured information in HTML. To extract structured data from a web page, which gets constant user interface updates, a non-breaking method is required. Building a robust and fast record-level wrapper from a single annotated web page is the subject of this project. Current state of the art methods include mining data regions to recognize template generated areas on page, probabilistic wrapper induction to extract data from a single data region in a robust way, and partial tree alignment to repeatedly extract data from multiple regions. In this thesis, we combine these three ideas into a new method and design a system for robust data extraction. Experimental results using a large number of web pages from multiple domains show that the proposed approach works with a high precision and within reasonable execution time on commodity hardware.

Refer to status updates for more information about the project.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml
todo.txt		todo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

About

Releases

Packages

Languages

mantask/thesis-wrapper

Folders and files

Latest commit

History

Repository files navigation

Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages