A tool to rank a company/institute based on EC contributions using Cordis dataset.
The tool has been tested on Ubuntu 18.04, Windows 10, and mac OS Catalina. It requires
- Python 3.6+
- pandas
- pytest
It's a typical python3 setup. Once you installed Python 3.6+ ,
open a terminal, e.g. in your $HOME
directory and follow these steps
git clone https://github.com/fabriziomiano/cordis-rank.git
sudo apt install -y python3-venv
xcode-select --install
sudo easy_insall virtualenv
virtualenv
is shipped with the Python3.6+ installation setup
Then, let's create a new directory in e.g. $HOME/.envs/cordis-rank
mkdir -p ~/.envs/cordis-rank
Assuming you're still in a terminal in your $HOME
directory
python3 -m venv ~/.envs/cordis-rank
source ~/.envs/cordis-rank/bin/activate
python3 -m venv cordis-rank
cordis-rank\Scripts\activate.bat
Check that now you have (cordis-rank)
at the beginning of your command line
Update pip
and install the requirements in requirements.txt
pip install --upgrade pip
pip install -r requirements.txt
You're now ready to run it
Although the tool accepts user input parameters, the file constants.py
contains a number of constans that can be modified according to the type
of data to use or analysis to carry out.
In particular, here are some of the parameters:
- COMPANY_NAME: the name of the company to rank
- ACTIVITY_TYPE_FILTER: e.g. "PRC" to consider only companies
- APPLY_PRC_FILTER: boolean: apply the activity-type filter if True
- BUDGET_COLUMN_NAME: name of the budget / EC contribution column, e.g. "ecContribution"
- INTERESTING_COLUMNS: the list of columns to filter the raw Cordis dataset with
- DEFAULT_LOCAL_DATA_PATH: the default path of the Cordis dataset in this repo
- DEFAULT_URL: the default URL used to get the Cordis 2020 CSV file
- COLUMNS_MAP: a dict to rename the processed data frame to pretty print ranking results
- Navigate to the working copy of the repo you previously cloned
cd cordis-rank
- Run the tool by giving
python rank.py
An initialization output should show up, saying
config - [INFO] - --------------------------------------------------
config - [INFO] - Initializing with the following configuration
config - [INFO] - Check constants.py to change any of the following
config - [INFO] - --------------------------------------------------
config - [INFO] - COMPANY_NAME: THE UNIVERSITY OF SUSSEX
config - [INFO] - ACTIVITY_TYPE_FILTER: HES
config - [INFO] - APPLY_ACTIVITY_FILTER: True
config - [INFO] - --------------------------------------------------
config - [INFO] - Assuming an input dataset with the following features
config - [INFO] - --------------------------------------------------
config - [INFO] - BUDGET_COLUMN_NAME: ecContribution
config - [INFO] - COMPANY_COLUMN_NAME: name
config - [INFO] - ACTIVITY_COLUMN_NAME: activityType
config - [INFO] - COUNTRY_COLUMN_NAME: country
config - [INFO] - --------------------------------------------------
config - [INFO] - Fallback data sources
config - [INFO] - --------------------------------------------------
config - [INFO] - DEFAULT_URL: https://cordis.europa.eu/data/cordis-h2020organizations.csv
config - [INFO] - DEFAULT_LOCAL_DATA_PATH: cordis-h2020organizations.csv
config - [INFO] - --------------------------------------------------
at the end of which you will be prompted to whether download the data or run on a local CSV file
Read Cordis data_tools from URL? [y/n]: n
in this example the cordis-h2020organizations.csv file within this repo (leave blank)
Data file path (default: cordis-h2020organizations.csv):
and you should get the following output
data_tools - [INFO] - Reading data_tools from cordis-h2020organizations.csv
data_tools - [INFO] - Data frame loaded in 0.5 seconds
then, if you set the activity-type filter to true in constants.py
,
you'll get a message informing you about the filter being applied
data_tools - [INFO] - Considering only activityType = HES
lastly, you should get the following results
printer - [INFO] - --------------------------------------------------
printer - [INFO] - Ranking:
Rank Company / Institute Country EC Contribution
0 124 THE UNIVERSITY OF SUSSEX UK 43154405.56
printer - [INFO] - --------------------------------------------------
printer - [INFO] - Overall company budget: 43154405.56
printer - [INFO] - Company Ranking: 124 out of 1753
printer - [INFO] - Done
Note: if you choose to read the data from the default 2020 Cordis URL:
https://cordis.europa.eu/data/cordis-h2020organizations.csv
the process may take a while as pandas need to download the data. Furthermore, the final results may vary, as the CSV file might have been updated with respect to the one in this repo.
That's it!
To run the tests from the home of the repo, e.g. $HOME/cordis-rank
, simply run
pytest
Tests may take a while as the data have to be downloaded twice to run the various fixtures.
Do not forget to rerun the test if you change any of configuration parameters in
constants.py