-
Notifications
You must be signed in to change notification settings - Fork 11
Building Knowledge Networks
The KnetMiner web application requires, as input, a genome-scale knowledge graph (KG). More introductory background information about the KnetMiner KGs can be found in our paper Hassani-Pak et al. 2016. Here we provide an overview of the steps involved in building KGs for new species. Knowledge Graphs can be created using the Ondex CLI (AKA KnetBuilder). This guide will use Ondex CLI to build KGs and the application called Ondex Desktop to inspect them.
- Software and Hardware Requirements
- Data Requirements
- Building networks from your data
- Homology to Arabidopsis KG
- Troubleshooting
- Linux server with 24 GB RAM or more
- JAVA 8 (Java 11 if you want use the snapshot/development version)
- ondex-knet-builder v3.0 download
- Ondex-Desktop download
We are going to use Solanum tuberosum (potato) as an example organism.
- Potato GFF3 download
- Potato peptide FASTA download
- Potato gene-protein mapping download - Save as mapping.txt
- Potato protein domains download - Save as pep.all.fa
- Potato-Arabidopsis orthologs download
- Arabidopsis KG v45 download (not part of tutorial-data.zip)
Download and unzip ondex-knet-builder
which will create a top-level (root) folder called something like "ondex-mini". Now create a tutorial-data
folder anywhere on your file system. You can download the individual datasets to the tutorial-data
folder or download and unzip the tutorial-data.zip data bundle (Note: Does not contain the arabidopsis-45.oxl which needs to be downloaded separately).
Ondex contains parsers (or importers) for a range of data formats including FASTA, GFF3, Tabular, UniProt-XML, Pubmed-XML, OWL etc. The role of an Ondex parser is to transform the raw data into the graph model using the standardized Ondex metadata. Here we describe how to build a core network of genes, proteins and domains for a particular organism.
Download the gff3 and protein fasta files, and unzip if they are zipped. There is a Ondex parser plugin, called fastagff
, that we can use to create a gene-protein network. The parser has the following parameters:
Fastagff parser parameters:
- GFF3 File: Path to GFF3
- Fasta File: Path to peptide FASTA
- Mapping File: Path to tabular gene and protein id mapping file. Required if protein ids are not equal to gene_id.x
- TaxId [Int]: Taxonomy ID of your organism
- Accession [String]: Cross-reference database (xref)
- DataSource [String]: Data origin (provenance)
We are now going to create an Ondex workflow file (my_workflow.xml) that instructs Ondex-CLI to run the fastagff parser and export the graph into OXL (Ondex Exchange Language) and create some basic stats (XML).
<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
<Workflow>
<Graph name="memorygraph">
<Arg name="GraphName">default</Arg>
<Arg name="graphId">default</Arg>
</Graph>
<!-- Gene-Protein -->
<Parser name="fastagff">
<Arg name="GFF3 File">${baseDir}/gff3</Arg>
<Arg name="Fasta File">${baseDir}/protein_fa</Arg>
<Arg name="Mapping File">${baseDir}/mapping.txt</Arg>
<Arg name="TaxId">4113</Arg> <!-- Set to TAXID of your organism -->
<Arg name="Accession">ENSEMBL-PLANTS</Arg>
<Arg name="DataSource">ENSEMBL</Arg>
<Arg name="Column of the genes">0</Arg>
<Arg name="Column of the proteins">1</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<Export name="oxl">
<Arg name="pretty">true</Arg>
<Arg name="ExportIsolatedConcepts">true</Arg>
<Arg name="GZip">true</Arg>
<Arg name="ExportFile">${baseDir}/kg_1.oxl</Arg>
<Arg name="graphId">default</Arg>
</Export>
<Export name="graphinfo">
<Arg name="ExportFile">${baseDir}/kg_1_stats.xml</Arg>
<Arg name="graphId">default</Arg>
</Export>
</Workflow>
</Ondex>
To run the workflow, you need to load Java 8, go to the ondex-mini root folder and execute the runme.sh
:
module load Java/1.8.0_192
cd /home/data/knetminer/software/ondex-mini-3.0/
export JAVA_TOOL_OPTIONS="-Xmx8G"
echo $JAVA_TOOL_OPTIONS
./runme.sh /home/data/knetminer/pub/tutorial-data/workflow.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"
Note about memory: most of datasets probably will require that you tell Java to use more memory (RAM) than the small default we usually set in the launching scripts above. In Bash, this can be done by setting an environment variable by typing this command in your terminal before running either ondex-mini or Ondex Desktop:
export JAVA_TOOL_OPTIONS="-Xmx8G"
This tells the Java runtime to allocate 8 Gb of RAM for Java (and hence, Ondex). Don't set this to more than 80% of the RAM you have in your computer, because that could make it unstable and even crash. The instruction above is needed every time you start a new bash session (or once only, if you put it in your Bash configuration file).
Once the workflow has completed, it should create a new file named kg_1.oxl
in the folder specified by the OXL Exporter. This file can be opened and viewed in Ondex Desktop and it should generate a graphic like that shown in the Figure below. One can use the Ondex Metagraph and Legend to explore options, browse some useful information and do basic data sanity checks. Check: Are the gene and protein numbers same as in the gff and fasta file? Are the gene and protein concepts connected via a relation? Search for certain gene names and check if gene/protein names are correct?
You can also check the kg_1_stats.xml report that was generated by the graphinfo
Exporter.
If everything looks OK, congratulations, you have your beginner's network of genes connected to the proteins they encode.
Download the protein-domain information from BioMart, choose "Solanum tuberosum". Click on "Features", unselect everything under "Attributes" and select only "Protein stable ID". Open "Protein Domains" and select "InterPro ID", "InterPro short description" and "InterPro description". Under "Filters->Protein Domains" you can select "Limit to genes with Interpro ID(s)".
The downloaded tabular file looks should look like this:
Protein stable ID | Interpro ID | Interpro Short Description | Interpro Description |
---|---|---|---|
PGSC0003DMT400092517 | IPR025558 | DUF4283 | Domain of unknown function DUF4283 |
PGSC0003DMT400092522 | IPR009518 | PSII_PsbX | Photosystem II PsbX |
PGSC0003DMT400092528 | IPR003105 | SRA_YDG | SRA-YDG |
Fortunately, Ondex has a flexible generic parser for tabular files, called tabParser2
, that can be configured via XML. The XML schema can be found here (human-readable version).
The tabParser2
configuration for the above protein-domain table could look like this:
<?xml version = "1.0" encoding = "UTF-8" ?>
<parser
xmlns = "http://www.ondex.org/xml/schema/tab_parser"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<delimiter>\t</delimiter>
<quote>"</quote>
<encoding>UTF-8</encoding>
<start-line>1</start-line>
<concept id = "prot">
<class>Protein</class>
<data-source>ENSEMBL</data-source>
<accession data-source="ENSEMBL-PLANTS">
<column index='0' />
</accession>
</concept>
<concept id = "protDomain">
<class>ProtDomain</class>
<data-source>ENSEMBL</data-source>
<name preferred="true">
<column index='3' />
</name>
<name>
<column index='1' />
</name>
<accession data-source="IPRO">
<column index='2' />
</accession>
<attribute name="Description" type="TEXT">
<column index='4' />
</attribute>
</concept>
<relation source-ref="prot" target-ref="protDomain">
<type>has_domain</type>
</relation>
</parser>
We are now going to create a new workflow (my_workflow_2.xml) with instructions to parse the tabular file and export to OXL:
<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
<Workflow>
<Graph name="memorygraph">
<Arg name="GraphName">default</Arg>
<Arg name="graphId">default</Arg>
</Graph>
<Parser name="tabParser2">
<Arg name="InputFile">${baseDir}/protein_domains.txt</Arg>
<Arg name="configFile">${baseDir}/protein_domain_config.xml</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<Export name="oxl">
<Arg name="pretty">true</Arg>
<Arg name="ExportIsolatedConcepts">true</Arg>
<Arg name="GZip">true</Arg>
<Arg name="ExportFile">${baseDir}/kg_2.oxl</Arg>
<Arg name="graphId">default</Arg>
</Export>
</Workflow>
</Ondex>
As before, we now run Ondex-CLI again with the above workflow:
bash runme.sh /home/data/knetminer/pub/tutorial-data/my_workflow_2.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"
Our next goal is to connect our organisms data to a rich knowledge graph for Arabidopsis that can be licensed from Rothamsted.
Download the full Ensembl homologies dataset and filter rows that contain your species of interest (eg. solanum_tuberosum) and arabidopsis_thaliana. Alternatively you can download smaller subsets using Ensembl BioMart.
An example compara.txt
file along with the matching tabParser2 compara_config.xml
is provided in the tutorial-data.
AT5G01150 AT5G01150.1 arabidopsis_thaliana 29.7405 ortholog_many2many PGSC0003DMG400020332 PGSC0003DMT400052378 solanum_tuberosum 30.9129 NULL NULL NULL 0.00 0 181051656
AT5G01160 AT5G01160.2 arabidopsis_thaliana 48.6111 ortholog_one2one PGSC0003DMG402020324 PGSC0003DMT400052354 solanum_tuberosum 40.3226 NULL NULL NULL 81.34 1 145692276
AT5G01170 AT5G01170.1 arabidopsis_thaliana 49.8239 ortholog_many2many PGSC0003DMG400020347 PGSC0003DMT400052416 solanum_tuberosum 42.9439 NULL NULL NULL 0.00 0 173225540
(etc.)
We want to transform this data into a (Protein)-[ortholog]->(Protein) graph with some properties added to the homology relationship. The 'tabParser2' configuration for the above tabular file would look like this:
<?xml version = "1.0" encoding = "UTF-8" ?>
<parser
xmlns = "http://www.ondex.org/xml/schema/tab_parser"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<delimiter>\t</delimiter>
<quote>"</quote>
<encoding>UTF-8</encoding>
<start-line>1</start-line>
<concept id="protL">
<class>Protein</class>
<data-source>EnsemblCompara</data-source>
<accession data-source="TAIR">
<column index='1' />
</accession>
</concept>
<concept id="protR">
<class>Protein</class>
<data-source>EnsemblCompara</data-source>
<accession data-source="ENSEMBL-PLANTS">
<column index='6' />
</accession>
</concept>
<relation source-ref="protL" target-ref="protR">
<type>ortho</type>
<evidence>EnsemblCompara</evidence>
<attribute name="Homology_type" type="TEXT">
<column index='4' />
</attribute>
<attribute name="%Identity_Arabidopsis" type="NUMBER">
<column index='3' />
</attribute>
<attribute name="%Identity_Potato" type="NUMBER">
<column index='8' />
</attribute>
</relation>
</parser>
You can again construct a workflow similar to the protein-domain workflow and create a network with ortholog relations between potato and Arabidopsis.
We are going to skip building an individual workflow for the compara data and instead assemble all the previous steps into a single workflow. This will connect the potato gene-protein-domain information with a pre-integrated Arabidopis KG that has many types of information including publications, phenotypes and GO annotations.
<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
<Workflow>
<Graph name="memorygraph">
<Arg name="GraphName">default</Arg>
<Arg name="graphId">default</Arg>
</Graph>
<!-- Gene-Protein -->
<Parser name="fastagff">
<Arg name="GFF3 File">${baseDir}/gff3</Arg>
<Arg name="Fasta File">${baseDir}/protein_fa</Arg>
<Arg name="Mapping File">${baseDir}/mapping.txt</Arg>
<Arg name="TaxId">4113</Arg> <!-- Set to TAXID of your organism -->
<Arg name="Accession">ENSEMBL-PLANTS</Arg>
<Arg name="DataSource">ENSEMBL</Arg>
<Arg name="Column of the genes">0</Arg>
<Arg name="Column of the proteins">1</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Protein Domain -->
<Parser name="tabParser2">
<Arg name="InputFile">${baseDir}/protein_domains.txt</Arg>
<Arg name="configFile">${baseDir}/protein_domains_config.xml</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Homology -->
<Parser name="tabParser2">
<Arg name="InputFile">${baseDir}/compara.txt</Arg>
<Arg name="configFile">${baseDir}/compara_config.xml</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Arabidopsis KG from Rothamsted -->
<Parser name="oxl">
<Arg name="InputFile">${baseDir}/arabidopsis_45.oxl</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Mapping -->
<Mapping name="lowmemoryaccessionbased">
<Arg name="IgnoreAmbiguity">false</Arg>
<Arg name="RelationType">collapse_me</Arg>
<Arg name="WithinDataSourceMapping">true</Arg>
<Arg name="graphId">default</Arg>
</Mapping>
<!-- Collapsing -->
<Transformer name="relationcollapser">
<Arg name="CloneAttributes">true</Arg>
<Arg name="CopyTagReferences">true</Arg>
<Arg name="graphId">default</Arg>
<Arg name="RelationType">collapse_me</Arg>
</Transformer>
<!-- Export knowledge graph -->
<Export name="oxl">
<Arg name="pretty">true</Arg>
<Arg name="ExportIsolatedConcepts">true</Arg>
<Arg name="GZip">true</Arg>
<Arg name="ExportFile">${baseDir}/kg-final.oxl</Arg>
<Arg name="graphId">default</Arg>
</Export>
</Workflow>
</Ondex>
Run the workflow like this
module load Java/1.8.0_192
export JAVA_TOOL_OPTIONS="-Xmx24G"
echo $JAVA_TOOL_OPTIONS
cd /home/data/knetminer/software/ondex-mini-3.0/
./runme.sh /home/data/knetminer/pub/tutorial-data/workflow.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"
All data and config files used in this workflow are located in a tutorial-data
folder and the path is provided via baseDir=
to KnetBuilder (ondex-mini). To run this workflow you will require JAVA 8 and 24 GB RAM. The resulting knowledge graph will have over a million relationships but it still can be opened in Ondex if enough memory is available. Ondex won't be able to visualise the entire KG but it can produce some useful information and provide simple search and filter tools for first-pass quality checks of the knowledge graph before deploying it in KnetMiner for further checks.
The final OXL (in this case named tutorial-data/kg-final.oxl
) will be used in the KnetMiner server.
We don't support Windows for all of the tools available from, the KnetBuilder framework. A Linux or macOS system is recommended for such tools. If you are under Windows, the most straightforward way to do so is to use some virtual machine software, such as VirtualBox or VMWARE (or even to use Docker).
For the workflow tools and the desktop application, PowerShell scripts (.ps1
) extension are available to launch these tools.
For other tools (eg, RDF exporter, KnetMiner initialiser), you might be able to run the available .sh
scripts from systems like CygWin or WSL, however, that isn't supported.
In particular, in CygWin:
- you might need to fix line ends for these scripts (eg, via dos2unix)
- You might need to change a
.sh
that launches Java to fix the CLASSPATH format it uses:# Add this in the script, after CLASSPATH has been set export CLASSPATH="`cygpath --path --windows $CLASSPATH:$mydir:$mydir/lib/*`"
RDF Exporter
Neo4j Exporter
New Tab/CSV Importer
BK-Net Ontology
rdf2neo tool for RDF->Neo4j