Skip to content

VaidhyaMegha/vaidhyamegha-knowledge-graphs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repo will host open knowledge graphs from VaidhyaMegha.

Open Knowledge Graph on Clinical Trials

VaidhyaMegha has built an open knowledge graph on clinical trials.

  • This repository contains the source code along with instructions to generate and use this knowledge graph.
  • More information, including references, is available in article and also here

Knowledge graph for technical decision making

VaidhyaMegha is building an open knowledge graph on technical decision making.

  • This repository contains the source code along with instructions to use this periodically curated knowledge graph.
  • More information, including references, is available in article and also here

Getting Started

  • Pre-requisite steps

    • Create a folder 'lib'. Download algs4.jar file from here and place in 'lib' folder.
    • Download hypergraphql jar file from here and place in 'lib' folder.
    • Dowload 'vocabulary_1.0.0.ttl' file from here and place in 'data/open_knowledge_graph_on_clinical_trials' folder.
    • Download mesh2022.nt.gz from here and unzip it. Place mesh2022.nt file 'data/open_knowledge_graph_on_clinical_trials' folder.
    • Download PheGenI from here and place PheGenI_Association_full.tab file 'data/open_knowledge_graph_on_clinical_trials' folder.
    • Download detailed_CoOccurs_2021.txt.gz from here and unzip it. Place detailed_CoOccurs_2021.txt file in 'data/open_knowledge_graph_on_clinical_trials' folder.
      • Generate detailed_CoOccurs_2021_selected_fields.txt and detailed_CoOccurs_2021_selected_fields_sorted.txt files using following commands. Place both detailed_CoOccurs_2021_selected_fields.txt and detailed_CoOccurs_2021_selected_fields_sorted.txt files in 'data/open_knowledge_graph_on_clinical_trials' folder.
      cut -d '|' -f1,9,15 data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021.txt > data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields.txt
      
      sort -u  data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields.txt > data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields_sorted.txt
      
  • To compile and package

    mvn clean package assembly:single -DskipTests
    
  • To build RDF

    java -jar -Xms4096M -Xmx8192M target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar
    
  • To query using SparQL

    java -jar -Xms4096M -Xmx8144M target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar -m cli -q src/main/sparql/1_count_of_records.rq
    ...
    Results:
    -------- 
    5523173^^http://www.w3.org/2001/XMLSchema#integer
    
  • To query using GraphQL (via HyperGraphQL)

    java -cp "target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar:lib/*" com.vaidhyamegha.data_cloud.kg.App -m server
    
    • From Postman with ntriples response ntriples
    • From Postman with json response ntriples
    • In a separate terminal execute GraphQL query using curl (alternatively use Postman)
      $ curl --location --request POST 'http://localhost:8080/graphql' --header 'Accept: application/ntriples' --header 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8,kn;q=0.7' --header 'Content-Type: application/json' --data-raw '{"query":"{\n  trial_GET(limit: 30, offset: 1) {\n    label\n  }\n \n}","variables":{}}'
      <https://www.who.int/clinical-trials-registry-platform/EUCTR2007-006072-11-SE> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://vaidhyamegha.com/open_kg/ct> .
      <https://www.who.int/clinical-trials-registry-platform/EUCTR2007-006072-11-SE> <http://www.w3.org/2000/01/rdf-schema#label> "EUCTR2007-006072-11-SE"^^<http://www.w3.org/2001/XMLSchema#string> .
      <https://clinicaltrials.gov/ct2/show/NCT02954757> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://vaidhyamegha.com/open_kg/ct> .
      <https://clinicaltrials.gov/ct2/show/NCT02954757> <http://www.w3.org/2000/01/rdf-schema#label> "NCT02954757"^^<http://www.w3.org/2001/XMLSchema#string> .
      <https://www.who.int/clinical-trials-registry-platform/EUCTR2014-005525-13-FI> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://vaidhyamegha.com/open_kg/ct> .
      <https://www.who.int/clinical-trials-registry-platform/EUCTR2014-005525-13-FI> <http://www.w3.org/2000/01/rdf-schema#label> "EUCTR2014-005525-13-FI"^^<http://www.w3.org/2001/XMLSchema#string> .
      <https://clinicaltrials.gov/ct2/show/NCT02721914> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://vaidhyamegha.com/open_kg/ct> .
      <https://clinicaltrials.gov/ct2/show/NCT02721914> <http://www.w3.org/2000/01/rdf-schema#label> "NCT02721914"^^<http://www.w3.org/2001/XMLSchema#string> .
      ...
      <http://hypergraphql.org/query> <http://hypergraphql.org/query/trial_GET> <https://www.who.int/clinical-trials-registry-platform/EUCTR2016-002461-66-IT> .
      <http://hypergraphql.org/query> <http://hypergraphql.org/query/trial_GET> <https://www.who.int/clinical-trials-registry-platform/CTRI/2020/08/027368> .
      <http://hypergraphql.org/query> <http://hypergraphql.org/query/trial_GET> <https://www.who.int/clinical-trials-registry-platform/EUCTR2013-001294-24-DE> .
      

Features as on current release - 0.9

Summary : Using any trial id from across the globe find the associated diseases/interventions, research articles and genes. Also discover relationships b/w various medical topics through co-occurrences in articles. Query the graph using SparQL from cli or GraphQL using any API client tool ex: Postman or curl

Feature list :

  • Using GraphQL API knowledge graph can be queried using any API client tool ex: curl or Postman.
  • Graph includes trials from across the globe. Data is sourced from WHO's ICTRP and clinicaltrials.gov
  • Links from trial to MeSH vocabulary are added for conditions and interventions employed in the trial.
  • Links from trial to PubMed articles are added. PubMed's experts curate this metadata information for each article.
  • Added MRCOC to the graph for the selected articles linked to clinical trials.
  • Added PheGenI links i.e. links from phenotype to genotype as links between MeSH DUI and GeneID.
  • Added SparQL query execution feature. Adding CLI mode. Adding a count SparQL query for demo.
  • 5 co-existing bi-partite graphs b/w trial--> condition, trial--> intervention, trial --> articles, article --> MeSH DUIs, gene id --> MeSH DUIs together comprise this knowledge graph.

Changes in this release : Server mode of execution is added.

Release notes

  • v0.9
      java -cp "target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar:lib/*" com.vaidhyamegha.data_cloud.kg.App -m server
    
  • v0.8
    • Enable GraphQL interface to the knowledge graph using HyperGraphQL
    java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -jar lib/hypergraphql-3.0.1-exe.jar --config src/main/resources/hql-config.json
    
  • v0.7
    • Enable SparQL queries
      $ cat src/main/sparql/1_count_of_records.rq 
      SELECT (count(*) as ?count)
      where { ?s ?p ?o}
    
      $ sparql --data=data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt --query=src/main/sparql/1_count_of_records.rq
      -----------
      | count   |
      ===========
      | 4766048 |
      -----------
    
      $ wc -l data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 
      4766048 data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt
    
  • v0.6.1
    • Externalize the Entrez API invocation threshold probability
    • Patch for below issue
      $ sparql --data=data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt --query=src/main/sparql/example.rq
      04:33:04 ERROR riot            :: [line: 1085476, col: 71] Bad character in IRI (Tab character): <https://www.who.int/clinical-trials-registry-platform/SLCTR/2020/014[tab]...>
      Failed to load data
    
      $ grep "SLCTR/2020/014" data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 
      <https://www.who.int/clinical-trials-registry-platform/SLCTR/2020/014	> <TrialId> "SLCTR/2020/014\t" .
    
    
  • v0.6
    • Added PheGenI links i.e. links from phenotype to genotype as links between MeSH DUI and GeneID.
    <https://www.ncbi.nlm.nih.gov/gene/10014> <Gene> <http://id.nlm.nih.gov/mesh/2022/T046007> .
    <https://www.ncbi.nlm.nih.gov/gene/10014> <GeneID> "10014" .
    <https://www.ncbi.nlm.nih.gov/gene/6923> <Gene> <http://id.nlm.nih.gov/mesh/2022/T032324> .
    <https://www.ncbi.nlm.nih.gov/gene/6923> <GeneID> "6923" .
    <https://www.ncbi.nlm.nih.gov/gene/3198> <Gene> <http://id.nlm.nih.gov/mesh/2022/T032324> .
    <https://www.ncbi.nlm.nih.gov/gene/3198> <GeneID> "3198" .
    
  • v0.5
    • Adding MRCOC to the graph for the selected articles linked to clinical trials.
    <https://pubmed.ncbi.nlm.nih.gov/20926522> <MeSH_DUI> <https://meshb.nlm.nih.gov/record/ui?ui=D064451> .
    <https://pubmed.ncbi.nlm.nih.gov/17404119> <MeSH_DUI> <https://meshb.nlm.nih.gov/record/ui?ui=D008297> .
    <https://pubmed.ncbi.nlm.nih.gov/17404119> <MeSH_DUI> <https://meshb.nlm.nih.gov/record/ui?ui=D006801> .
    
  • v0.4
    • List of trial ids to be incrementally bounced against Entrez API to generate the necessary incremental mappings b/w trials and PubMed articles
    $ grep "Pubmed_Article" data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 
    <https://clinicaltrials.gov/ct2/show/NCT00400075> <Pubmed_Article> "25153486" .
    <https://clinicaltrials.gov/ct2/show/NCT03934957> <Pubmed_Article> "34064657" .
    
  • v0.3
    • Adding links between trials and interventions in addition to trials and conditions.
    • conditions and interventions are fetched from database (instead of files). Corresponding edges b/w trials and conditions, trials and interventions are added to RDF. For example :
      <https://clinicaltrials.gov/ct2/show/NCT00093782> <Condition> <http://id.nlm.nih.gov/mesh/2022/T000687> .
      <https://clinicaltrials.gov/ct2/show/NCT00093782> <Intervention> <http://id.nlm.nih.gov/mesh/2022/T538652> .
    
    • All global trial's - 756,169 - are added to RDF. For example :
    <https://clinicaltrials.gov/ct2/show/NCT00172328> <TrialId> "NCT00172328" .
    <https://www.who.int/clinical-trials-registry-platform/CTRI/2021/05/033487> <TrialId> "CTRI/2021/05/033487" .
    
    • Starting with a fresh model for final RDF. MeSH ids that are not linked to any trial not considered. This reduces the graph size considerably.
    • Trial records are fetched from ICTRP's weekly + periodic full export and AACT's daily + monthly full snapshot.
    • Trials are written down to a file (will be used later) : vaidhyamegha_clinical_trials.csv
      $ wc -l vaidhyamegha_clinical_trials.csv
      755272 vaidhyamegha_clinical_trials.csv
    
    • Download the RDF from here.
  • v0.2
    • Clinical trials are linked to the RDF nodes corresponding to the MeSH terms for conditions. For example :
    • Download the enhanced RDF from here.

Documentation

More information, including references, is available in article and also here

Prequels to this project

VaidhyaMegha's prior work on

  • clinical trial registries data linking.
  • symptoms to diseases linking.
  • phenotype to genotype linking.
  • trials to research articles linking.

Last 3 are covered in the "examples" folder here. They were covered in separate public repos here earlier.

Next steps

  • Complete article
  • Full list of trial ids to be used in combination with id_information table to generate a final list of unique trials using WQUPC algorithm
  • Add secondary trial ids to graph (this may increase graph size considerably). However, it could be of utility.
  • Build SparQL + GraphQL version of API to allow direct querying of the graph. Provide some reasonable examples that are harder in SQL.
  • Snowmed CT, ICD 10.
  • Host Knowledge graph on Ne04j's cloud service, Aura DB.
  • Use Neo4j's GraphQL API from Postman to demonstrate sample queries on clinical trials.