Skip to content

Latest commit

 

History

History

A__Web_Harvester

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

DEplain: Web Harvester

We have built an open-source web harvester in Python to download, align and extract text of parallel documents of given web pages (including paragraphs). We used this web crawler to download the parallel documents od DEplain-web. For reproducibility, we made the code and the list of web pages available. Please, use this code to crawl the web documents with a closed license to extend the document simplification of DEplain-web. If you use one of the alignment methods, you can also extend the sentence simplification data of DEplain-web.

Installation

You can find instruction on how to install and use the web harvester here: https://github.com/rstodden/data_collection_german_simplification.

License

This code is licensed under GPL-3.0 license.