Make sure you have python3 installed and that the contents of Requirements.txt are installed
Knowledge graph (KG) dumps can be found here.
NOTE: Last test was run on the 07/27/18 KG dump.
After importing the KG use a cypher query (such as match (n) where n.id =~ "(?i)CHEMBL.*" return n.id limit 5
) to look at the format of the chembl curie ids. If you get result like CHEMBL.COMPOUND:CHEMBL153
then you can don't have to do anything. If, however, you get a result like ChEMBL:153
then you are using an older version of the knowledge graph and need to rename the files: source_map_old.csv, target_map_old.csv, ndf_tp_old.csv, and ndf_tn_old.csv. Just remove "_old" from the end of the file names (replacing the other files of the same name) and you will be good to go.
Node2vec is a program developed by researchers at stanford to vectorize nodes in a graph. It can be found here. And information about it is listed here.
Download can be found here: https://skr3.nlm.nih.gov/SemMedDB/download/download.html
Download the PREDICATION table from the above link and place it into the data directory.
NOTE: Don't unzip
There are several variables at the top of the MLDR.sh bash script. They are seperated into groups and have decriptions of what they do written above them. You will need to edit some of these (like neo4j log in credentials and url:port) to fit your local system before running MLDR.sh.
-
py_name
- how you call python 3 using the terminal in the enviroment you will be running MLDR.sh. (most likely python or python3) -
semmed
- The path to the SemMedDB PREDICATION table mysql dump. -
neo4j_user
- The username used in the neo4j instance hosting the knowledge graph. (This defaults to neo4j) -
neo4j_pass
- The password for the neo4j instance hosting the knowledge graph. (This defaults to neo4j) -
neo4j_url
- The bolt url and port for access to the neo4j instance. (This defaults to bolt://localhost:7687) -
PVAR
,QVAR
,EVAR
,DVAR
,LVAR
, andRVAR
are the parameters p, q, e, d, l, & r for node2vec and decriptions for these can be found on github. Further reading about how node2vec works and more in depth descriptions of what these parameters do can be found here. -
cutoff
- The number of times a relationship need to be encountered in SemMedDB for it to be included in the training data. This helps cut down on some of the noise in SemMedDB. -
roc
- True of False depending on if you want a roc curve generated for the model. -
data_file
- Thepath to the data file you want to predict on. Decription on how to set this up in the next section.
If you want to make predictions on a data set format it in a csv with the headers source and target using curie ids (CHEMBL for drugs and DOID/OMIM/HP) as identifiers formatted as such:
source | target |
---|---|
ChEMBL:1622 | DOID:399 |
ChEMBL:730 | OMIM:613658 |
ChEMBL:1484 | OMIM:601367 |
ChEMBL:1237022 | DOID:0060407 |
ChEMBL:504 | DOID:0050275 |
ChEMBL:714 | DOID:9675 |
ChEMBL:504 | DOID:0050638 |
ChEMBL:1909300 | HP:0000421 |
Then if you wish to make a prediction at the same time simply uncomment the predictor.py commands at the end of MLDR.sh and it do it all in one run. Otherwise, you can simply follow the skeleton of the command and just run it in the termainal after MLDR.sh has generated the model.