Scrapping-List-of-scientific-journals

First project of web scrapping on Wikipedia's page 'List of scientific journals', as Wikepedia approuves of scrapping

Prerequisite

Libraries used

requests  
BeautifulSoup

They are included in the Anaconda distribution. Else you have to install them

Else

You need a python environnement.

As it is scrapping a web page you need to have a good internet connection

TODO

Download

Download the repository through Clone Repository or Download Zip

git clone https://github.com/Clair1234/Scrapping-List-of-scientific-journals.git

Installation

After download, go to 'cmd' and navigate to the project folder directory

cd project

Run

If you are on VS Code run (Ctrl+Alt+N)

Description

Once you run the project, it will try to go through the Wikepedia page Two .json files will be as outputs:

all_journals.json : which have the hierarchy of journals (here only one level)
_all_journals_parsed.json : which have all the information gathered on the Wikipedia page

To evaluate the program, there is the variable STATISTICS. Each page of the List of Scientific journals is assumed to have the way of being built in HTML.

The HTML part used as an anchor of the diferent page is the infoxbox on the right of the page (See example page). The information gathered is ['Discipline', 'Language', 'History', 'Publisher', 'Frequency'] The STATISTICS variable is of the following form

STATISTICS = {
    'journals_checked':0,
    'discipline_null':0,
    'language_null':0,
    'history_null': 0,
    'publisher_null': 0,
    'frequency_null': 0,
}

where :

journals_checked is the number of journals chekced
discipline_null is the number of journals with no Discipline in the infobox
language_null is the number of journals with no Language in the infobox
history_null is the number of journals with no History in the infobox
publisher_null is the number of journals with no Publisher in the infobox
frequency_null is the number of journals with no Frequency in the infobox

Output

As of March, 1st 2024, the STATISTICS variable is :

{'journals_checked': 77,
 'discipline_null': 13,
 'language_null': 13,
 'history_null': 13,
 'publisher_null': 13,
 'frequency_null': 13}

From this we can see that out of the 77 pages checked out, 13 of them do not have a discipline section, language section, history section, publisher section and frequency section in an infobox (table on the right of the page. Upon looking at the pages without those information, those pages do not have the infobox table at all.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
_all_journals_parsed.json		_all_journals_parsed.json
all_journals.json		all_journals.json
scientific_journals.py		scientific_journals.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapping-List-of-scientific-journals

Prerequisite

Libraries used

Else

TODO

Download

Installation

Run

Description

Output

About

Releases

Packages

Languages

Clair1234/Scrapping-List-of-scientific-journals

Folders and files

Latest commit

History

Repository files navigation

Scrapping-List-of-scientific-journals

Prerequisite

Libraries used

Else

TODO

Download

Installation

Run

Description

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages