Convert SSSOM TSV to nodes and edges CSV files that can be ingested by neo4j-admin import
.
To build:
mvn clean package
To run, assuming you have some mappings called mappings.sssom.tsv
:
java -jar target/sssom2neo-1.0-SNAPSHOT.jar \
--input mappings.sssom.tsv \
--output-edges edges.csv \
--output-nodes nodes.csv
You can also run over a directory containing lots of mappings files, like the OLS SSSOM dataset:
java -jar target/sssom2neo-1.0-SNAPSHOT.jar \
--input ./mappings/ \
--output-edges edges.csv \
--output-nodes nodes.csv
Now you have two files, nodes.csv
and edges.csv
.
Let's load them into Neo4j! Assuming you already have Docker installed, we can do this quite easily. We will populate a new
folder called neo
with our neo4j database. First we use neo4j-admin
to import the CSV:
docker run \
-v $(pwd)/neo:/data \
-v $(pwd)/nodes.csv:/mnt/nodes.csv \
-v $(pwd)/edges.csv:/mnt/edges.csv \
neo4j:4.4.20-community \
neo4j-admin import --force --database=neo4j --array-delimiter="u+0000" --nodes=/mnt/nodes.csv --relationships=/mnt/edges.csv
If everything worked correctly, the neo
folder should now contain a neo4j database populated with the SSSOM mappings
from nodes.csv
and edges.csv
generated by the code in this repo. We can now start Neo4j:
docker run \
-v $(pwd)/neo:/data \
-p 7474:7474 \
-p 7687:7687 \
--env=NEO4J_AUTH=none \
neo4j:4.4.20-community
Hit up http://localhost:7474 to go forth and cypher!
This query returns all mappings to/from MONDO:0005015
(diabetes mellitus). Note the syntax (a)<-[mapping]->(b)
goes both ways, so both
outgoing mappings (defined by MONDO) and incoming mappings (defined by other ontologies) are included in the results.
MATCH (a)<-[mapping]->(b) WHERE a.id="MONDO:0005015" RETURN *
We can use an arbitrary level of depth, e.g. to search for mappings up to 3 levels deep:
MATCH (a)<-[mapping*0..3]->(b) WHERE a.id="MONDO:0005015" RETURN *
This result set includes transitive mappings e.g. MONDO:0005015
-hasDbXref
->UMLS:C0011849
<-hasDbXref
-ORDO:101952
-hasDbXref
->UMLS:C0011860
.
Therefore UMLS:C0011860
(Type 2 diabetes mellitus) is included in the result set. Note that this is a more specific term than we started with!
This is a limitation of the lacking semantics of hasDbXref
, and a good example of why ontologies should use richer mapping metadata.