Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading BioKG in Neo4j #4

Open
DimitrisAlivas opened this issue Feb 3, 2022 · 2 comments
Open

Loading BioKG in Neo4j #4

DimitrisAlivas opened this issue Feb 3, 2022 · 2 comments
Assignees

Comments

@DimitrisAlivas
Copy link
Contributor

Hey folks,

First of all, I'd like to thank you for this contribution. Having a unified biomedical KG is an essential resource for research in this domain.

I would like to use BioKG in my work. Specifically, we would like to train a link predictor to perform the task of drug-target interaction prediction and utilise the benchmarks you so thoughtfully include, in order to compare the performance of our DTI approach vs others.

For this, I thought it would be useful to have the BioKG final data (in /data/biokg/) uploaded to a Neo4j property graphstore, to enable querying for specific benchmarks (using hyper-relations for example: a relation DTI with qualifier (benchmark: 'FDA') or DDI with qualifier (benchmark: MINERAL)). Furthermore, having BioKG as a Neo4j ready graph could increase usability and visibility, so I plan on making it public once I manage to get it done.

The 2 issues I'm facing:

  1. The number of unique entities/relations that I see after loading the .tsv data in Pandas is different than the ones reported in the paper, so I've been looking into what could've gone wrong.

  2. The way I create the Neo4j graph is as follows:

  • Load all entity types from metadata + properties.
  • Get unique id's and use them to create nodes with Cypher
  • Load the links
  • Match on the (already) created nodes + node_id and if both subject and object match - create the link.

Following the logic above everything runs smoothly up to the point where I try to load the links that include COMPLEXES + PATHWAYs for which I cannot find any matches for.

If I understand the data model correctly, complex_ids exist only as part of the LINKS file and do not appear in the properties + metadata files (?).

Which identifiers are the ones that I should use to create the unique Complex nodes?

Apologies for the lengthy post and for potential inaccuracies on my end.

Minor comment:
A typo I found while reading your documentation:

<uniprot_acc> MEMBER_OF_COMPLEX <mesh_id>

The relation should be PROTEIN_DISEASE if I'm not mistaken.

Thank you again for your great contribution! I would greatly appreciate any help :-)

Cheers!

@samehkamaleldin
Copy link
Contributor

Hi Dimitris,

Thanks a lot for the great description and details mentioned in this issue, It has been a good year since I have last touched on this project and I have moved a few jobs now. However, I want to provide you with some support in relation to the issues you have.

I am going to give you a very lazy answer now and probably in a few days I can look at this more carefully and give you a better answer.

In relation to issue 1, have you tried to use the ready-produced KG located in the releases section? It should be the same as in the paper.
My quick guess is that this problem can be caused by a change in a source dataset. I vaguely remember for example that DrugBank and Reactome made some changes after we published this changed the output of our script which was reported in the paper.

I could not get issue 2 properly, so I will try to look at it again later and try to give you an answer.

In relation to the typo, Thanks for noticing that. It is a small thing I know, but would you be kind and change it and make a pull request? I will accept it immediately.

Thanks a lot.
Sameh

@samehkamaleldin samehkamaleldin self-assigned this Feb 8, 2022
@DimitrisAlivas
Copy link
Contributor Author

Hey Sameh,

Thank you for taking the time and for your answer! I think it makes a lot of sense given the frequency of updates in a lot of the integrated data sources (e.g. DrugBank as you mentioned)

I used the instruction on the readme of the repository to compile biokg in order to make sure it includes the latest versions of the sources. I will also check the version in the releases as you suggest, the goal here is to get the most accurate, hence up-to-date graph for our experiments.

Regarding issue 2, it's related to the semantics of pathways, complexes in relation to proteins, cause they do affect the way I convert the data for Neo4j. Thanks for taking some more time to look into it.

Best,
Dimitrios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants