First project of web scrapping on Wikipedia's page 'List of scientific journals', as Wikepedia approuves of scrapping
requests
BeautifulSoup
They are included in the Anaconda distribution. Else you have to install them
You need a python environnement.
As it is scrapping a web page you need to have a good internet connection
Download the repository through Clone Repository or Download Zip
git clone https://github.com/Clair1234/Scrapping-List-of-scientific-journals.git
After download, go to 'cmd' and navigate to the project folder directory
cd project
If you are on VS Code run (Ctrl+Alt+N)
Once you run the project, it will try to go through the Wikepedia page Two .json files will be as outputs:
- all_journals.json : which have the hierarchy of journals (here only one level)
- _all_journals_parsed.json : which have all the information gathered on the Wikipedia page
To evaluate the program, there is the variable STATISTICS. Each page of the List of Scientific journals is assumed to have the way of being built in HTML.
The HTML part used as an anchor of the diferent page is the infoxbox on the right of the page (See example page). The information gathered is ['Discipline', 'Language', 'History', 'Publisher', 'Frequency'] The STATISTICS variable is of the following form
STATISTICS = {
'journals_checked':0,
'discipline_null':0,
'language_null':0,
'history_null': 0,
'publisher_null': 0,
'frequency_null': 0,
}
where :
journals_checked
is the number of journals chekceddiscipline_null
is the number of journals with no Discipline in the infoboxlanguage_null
is the number of journals with no Language in the infoboxhistory_null
is the number of journals with no History in the infoboxpublisher_null
is the number of journals with no Publisher in the infoboxfrequency_null
is the number of journals with no Frequency in the infobox
As of March, 1st 2024, the STATISTICS variable is :
{'journals_checked': 77,
'discipline_null': 13,
'language_null': 13,
'history_null': 13,
'publisher_null': 13,
'frequency_null': 13}
From this we can see that out of the 77 pages checked out, 13 of them do not have a discipline section, language section, history section, publisher section and frequency section in an infobox (table on the right of the page. Upon looking at the pages without those information, those pages do not have the infobox table at all.