GitHub - ksylva/WikiExtractorG6_2021: Wikipedia tables extractor

Scraping tables on Wikipedia (2020-2021)

This project is a school work intended for students in Master1 MIAGE classic of the ISTIC, University of Rennes1.

Context of the project

This work is a scraping of tables from the universal and multilingual encyclopedia, Wikipedia realized by the group6 of M1 MIAGE classic of the school year (2020-2021).

Scraping Wikipedia?

As its name suggests, in this project, we will be led to extract tables in Wikipedia pages.

Process

1. Go to Wikipedia pages through their URLs
2. Collect and extract tables (simple or nested)
3. Resend their contents in files under CSV
Note: We only process Wikipedia pages in HTML format.

Objectives

*Make work simple and accessible to everyone with the python programming language  
*Develop scraping methods  
*Test these methods

Actual functionality

The software takes a file (wikiurl.txt) containing a list of wikipedia page titles and processes each of them to get the HTML URL of the page with a https://en.wikipedia.org/wiki/ prefix. After testing the URL: it processes all the HTML code of each page and tries to extract as many tables as possible in CSV.

We only process tables marked as "class =" wikitable " ".
In the conversion to CSV, for a cell concerned by a collspan AND a rowspan at the same time, the pandas library breaks this merge and then copies the informations on each row/column.
When we have a link in a table, we keep it after conversion, Also, for each image in a table, we gather the link of the image to put it in the concerned cell (instead of the image).

Functionnality to develop

Generate URLS from URIs (wikipedia page names)
Correctly analyze all tables by improving the extractor.
Accelerate the process to detect different tables.
Have a better approach for extraction of tables and conversion en CSV.

Technologies used

Git - The distributed version control system used.
IntelliJ IDEA + Python plugin - The IDE mainly used by our crew.
Beautifulsoup - The Python-based HTML parser.
unittest - The unit test framework used.
Pandas - Python library.
Word - The document editor used to create specifications.

Authors

Sylvanus KONAN, Jean-Théodore ESSOH, Mariama TAHA, Ange SIBOMANA, Bénédicte AHOUA

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
groundTruth		groundTruth
images		images
test		test
.gitignore		.gitignore
DESIGN.md		DESIGN.md
INSTALL.md		INSTALL.md
LICENCE.md		LICENCE.md
README.md		README.md
WikiExtractorG6_2021.iml		WikiExtractorG6_2021.iml
converter.py		converter.py
htmlTablesExtractor.py		htmlTablesExtractor.py
main.py		main.py
requirements.txt		requirements.txt
wikiurls.txt		wikiurls.txt
wikiurlsNotFound.txt		wikiurlsNotFound.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

ksylva/WikiExtractorG6_2021

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages