Skip to content

Building Knowledge Networks

Keywan Hassani-Pak edited this page Aug 25, 2022 · 60 revisions

The KnetMiner web application requires, as input, a genome-scale knowledge graph (KG). More introductory background information about the KnetMiner KGs can be found in our paper Hassani-Pak et al. 2016. Here we provide an overview of the steps involved in building KGs for new species. Knowledge Graphs can be created using the Ondex CLI (AKA KnetBuilder). This guide will use Ondex CLI to build KGs and the application called Ondex Desktop to inspect them.

Summary

Software and Hardware Requirements

  • Linux server with 24 GB RAM or more
  • JAVA 8 (Java 11 if you want use the snapshot/development version)
  • ondex-knet-builder v3.0 download
  • Ondex-Desktop download

Data Requirements

We are going to use Solanum tuberosum (potato) as an example organism.

  • Potato GFF3 download
  • Potato peptide FASTA download
  • Potato gene-protein mapping download - Save as mapping.txt
  • Potato protein domains download - Save as pep.all.fa
  • Potato-Arabidopsis orthologs download
  • Arabidopsis KG v45 download (not part of tutorial-data.zip)

Download and unzip ondex-knet-builder which will create a top-level (root) folder called something like "ondex-mini". Now create a tutorial-data folder anywhere on your file system. You can download the individual datasets to the tutorial-data folder or download and unzip the tutorial-data.zip data bundle (Note: Does not contain the arabidopsis-45.oxl which needs to be downloaded separately).

Building networks from your data

Ondex contains parsers (or importers) for a range of data formats including FASTA, GFF3, Tabular, UniProt-XML, Pubmed-XML, OWL etc. The role of an Ondex parser is to transform the raw data into the graph model using the standardized Ondex metadata. Here we describe how to build a core network of genes, proteins and domains for a particular organism.

Gene-Protein

Download the gff3 and protein fasta files, and unzip if they are zipped. There is a Ondex parser plugin, called fastagff, that we can use to create a gene-protein network. The parser has the following parameters:

Fastagff parser parameters:

  • GFF3 File: Path to GFF3
  • Fasta File: Path to peptide FASTA
  • Mapping File: Path to tabular gene and protein id mapping file. Required if protein ids are not equal to gene_id.x
  • TaxId [Int]: Taxonomy ID of your organism
  • Accession [String]: Cross-reference database (xref)
  • DataSource [String]: Data origin (provenance)

We are now going to create an Ondex workflow file (my_workflow.xml) that instructs Ondex-CLI to run the fastagff parser and export the graph into OXL (Ondex Exchange Language) and create some basic stats (XML).

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    <!-- Gene-Protein -->
    <Parser name="fastagff">
      <Arg name="GFF3 File">${baseDir}/gff3</Arg>
      <Arg name="Fasta File">${baseDir}/protein_fa</Arg>
      <Arg name="Mapping File">${baseDir}/mapping.txt</Arg>
      <Arg name="TaxId">4113</Arg>	<!-- Set to TAXID of your organism -->
      <Arg name="Accession">ENSEMBL-PLANTS</Arg>
      <Arg name="DataSource">ENSEMBL</Arg>
      <Arg name="Column of the genes">0</Arg>
      <Arg name="Column of the proteins">1</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>

    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">${baseDir}/kg_1.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>

    <Export name="graphinfo">
      <Arg name="ExportFile">${baseDir}/kg_1_stats.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

To run the workflow, you need to load Java 8, go to the ondex-mini root folder and execute the runme.sh:

module load Java/1.8.0_192
cd /home/data/knetminer/software/ondex-mini-3.0/
export JAVA_TOOL_OPTIONS="-Xmx8G"
echo $JAVA_TOOL_OPTIONS

./runme.sh /home/data/knetminer/pub/tutorial-data/workflow.xml "baseDir=/home/data/knetminer/pub/tutorial-data/" 

Note about memory: most of datasets probably will require that you tell Java to use more memory (RAM) than the small default we usually set in the launching scripts above. In Bash, this can be done by setting an environment variable by typing this command in your terminal before running either ondex-mini or Ondex Desktop:

export JAVA_TOOL_OPTIONS="-Xmx8G"

This tells the Java runtime to allocate 8 Gb of RAM for Java (and hence, Ondex). Don't set this to more than 80% of the RAM you have in your computer, because that could make it unstable and even crash. The instruction above is needed every time you start a new bash session (or once only, if you put it in your Bash configuration file).

Once the workflow has completed, it should create a new file named kg_1.oxl in the folder specified by the OXL Exporter. This file can be opened and viewed in Ondex Desktop and it should generate a graphic like that shown in the Figure below. One can use the Ondex Metagraph and Legend to explore options, browse some useful information and do basic data sanity checks. Check: Are the gene and protein numbers same as in the gff and fasta file? Are the gene and protein concepts connected via a relation? Search for certain gene names and check if gene/protein names are correct?

Ondex Metagraph and Legend

You can also check the kg_1_stats.xml report that was generated by the graphinfo Exporter.

If everything looks OK, congratulations, you have your beginner's network of genes connected to the proteins they encode.

Protein-Domains

Download the protein-domain information from BioMart, choose "Solanum tuberosum". Click on "Features", unselect everything under "Attributes" and select only "Protein stable ID". Open "Protein Domains" and select "InterPro ID", "InterPro short description" and "InterPro description". Under "Filters->Protein Domains" you can select "Limit to genes with Interpro ID(s)".

The downloaded tabular file looks should look like this:

Protein stable ID Interpro ID Interpro Short Description Interpro Description
PGSC0003DMT400092517 IPR025558 DUF4283 Domain of unknown function DUF4283
PGSC0003DMT400092522 IPR009518 PSII_PsbX Photosystem II PsbX
PGSC0003DMT400092528 IPR003105 SRA_YDG SRA-YDG

Fortunately, Ondex has a flexible generic parser for tabular files, called tabParser2, that can be configured via XML. The XML schema can be found here (human-readable version).

The tabParser2 configuration for the above protein-domain table could look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id = "prot">
		<class>Protein</class>
		<data-source>ENSEMBL</data-source>
		<accession data-source="ENSEMBL-PLANTS">
		       <column index='0' />
		</accession>
	</concept>

	<concept id = "protDomain">
		<class>ProtDomain</class>
		<data-source>ENSEMBL</data-source>
		<name preferred="true">
		        <column index='3' />
		</name>
		<name>
			<column index='1' />
		</name>
		<accession data-source="IPRO">
			<column index='2' />
		</accession>
		<attribute name="Description" type="TEXT"> 
			<column index='4' />
		</attribute>
	</concept>

	<relation source-ref="prot" target-ref="protDomain">
		<type>has_domain</type>
	</relation>

</parser>

We are now going to create a new workflow (my_workflow_2.xml) with instructions to parse the tabular file and export to OXL:

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    <Parser name="tabParser2">
      <Arg name="InputFile">${baseDir}/protein_domains.txt</Arg>
      <Arg name="configFile">${baseDir}/protein_domain_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">${baseDir}/kg_2.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

As before, we now run Ondex-CLI again with the above workflow:

bash runme.sh /home/data/knetminer/pub/tutorial-data/my_workflow_2.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"

Homology to Arabidopsis KG

Our next goal is to connect our organisms data to a rich knowledge graph for Arabidopsis that can be licensed from Rothamsted.

Ensembl Compara Data

Download the full Ensembl homologies dataset and filter rows that contain your species of interest (eg. solanum_tuberosum) and arabidopsis_thaliana. Alternatively you can download smaller subsets using Ensembl BioMart.

An example compara.txt file along with the matching tabParser2 compara_config.xml is provided in the tutorial-data.

AT5G01150	AT5G01150.1	arabidopsis_thaliana	29.7405	ortholog_many2many	PGSC0003DMG400020332	PGSC0003DMT400052378	solanum_tuberosum	30.9129	NULL	NULL	NULL	0.00	0	181051656
AT5G01160	AT5G01160.2	arabidopsis_thaliana	48.6111	ortholog_one2one	PGSC0003DMG402020324	PGSC0003DMT400052354	solanum_tuberosum	40.3226	NULL	NULL	NULL	81.34	1	145692276
AT5G01170	AT5G01170.1	arabidopsis_thaliana	49.8239	ortholog_many2many	PGSC0003DMG400020347	PGSC0003DMT400052416	solanum_tuberosum	42.9439	NULL	NULL	NULL	0.00	0	173225540
(etc.)

We want to transform this data into a (Protein)-[ortholog]->(Protein) graph with some properties added to the homology relationship. The 'tabParser2' configuration for the above tabular file would look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id="protL">
		<class>Protein</class>
		<data-source>EnsemblCompara</data-source>
		<accession data-source="TAIR">
			<column index='1' />
		</accession>
	</concept>
	
	<concept id="protR">
		<class>Protein</class>
		<data-source>EnsemblCompara</data-source>
		<accession data-source="ENSEMBL-PLANTS">
			<column index='6' />
		</accession>
	</concept>
	
	<relation source-ref="protL" target-ref="protR">
		<type>ortho</type>
		<evidence>EnsemblCompara</evidence>
		<attribute name="Homology_type" type="TEXT">
			<column index='4' />
		</attribute>
		<attribute name="%Identity_Arabidopsis" type="NUMBER">
			<column index='3' />
		</attribute>
		<attribute name="%Identity_Potato" type="NUMBER">
			<column index='8' />
		</attribute>
	</relation>
</parser>

You can again construct a workflow similar to the protein-domain workflow and create a network with ortholog relations between potato and Arabidopsis.

We are going to skip building an individual workflow for the compara data and instead assemble all the previous steps into a single workflow. This will connect the potato gene-protein-domain information with a pre-integrated Arabidopis KG that has many types of information including publications, phenotypes and GO annotations.

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    
    <!-- Gene-Protein -->
    <Parser name="fastagff">
      <Arg name="GFF3 File">${baseDir}/gff3</Arg>
      <Arg name="Fasta File">${baseDir}/protein_fa</Arg>
      <Arg name="Mapping File">${baseDir}/mapping.txt</Arg>
      <Arg name="TaxId">4113</Arg>	<!-- Set to TAXID of your organism -->
      <Arg name="Accession">ENSEMBL-PLANTS</Arg>
      <Arg name="DataSource">ENSEMBL</Arg>
      <Arg name="Column of the genes">0</Arg>
      <Arg name="Column of the proteins">1</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Protein Domain -->
    <Parser name="tabParser2">
      <Arg name="InputFile">${baseDir}/protein_domains.txt</Arg>
      <Arg name="configFile">${baseDir}/protein_domains_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>

    <!-- Homology -->
    <Parser name="tabParser2">
      <Arg name="InputFile">${baseDir}/compara.txt</Arg>
      <Arg name="configFile">${baseDir}/compara_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Arabidopsis KG from Rothamsted -->
    <Parser name="oxl">
      <Arg name="InputFile">${baseDir}/arabidopsis_45.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Mapping -->
    <Mapping name="lowmemoryaccessionbased">
      <Arg name="IgnoreAmbiguity">false</Arg>
      <Arg name="RelationType">collapse_me</Arg>
      <Arg name="WithinDataSourceMapping">true</Arg>
      <Arg name="graphId">default</Arg>
    </Mapping>
    
	<!-- Collapsing -->
    <Transformer name="relationcollapser">
      <Arg name="CloneAttributes">true</Arg>
      <Arg name="CopyTagReferences">true</Arg>
      <Arg name="graphId">default</Arg>
      <Arg name="RelationType">collapse_me</Arg>
    </Transformer>

    <!-- Export knowledge graph -->
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">${baseDir}/kg-final.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

Run the workflow like this

module load Java/1.8.0_192
export JAVA_TOOL_OPTIONS="-Xmx24G"
echo $JAVA_TOOL_OPTIONS

cd /home/data/knetminer/software/ondex-mini-3.0/
./runme.sh /home/data/knetminer/pub/tutorial-data/workflow.xml "baseDir=/home/data/knetminer/pub/tutorial-data/" 

All data and config files used in this workflow are located in a tutorial-data folder and the path is provided via baseDir= to KnetBuilder (ondex-mini). To run this workflow you will require JAVA 8 and 24 GB RAM. The resulting knowledge graph will have over a million relationships but it still can be opened in Ondex if enough memory is available. Ondex won't be able to visualise the entire KG but it can produce some useful information and provide simple search and filter tools for first-pass quality checks of the knowledge graph before deploying it in KnetMiner for further checks.

The final OXL (in this case named tutorial-data/kg-final.oxl) will be used in the KnetMiner server.

Troubleshooting

Running KnetBuilder tools under Windows

We don't support Windows for all of the tools available from, the KnetBuilder framework. A Linux or macOS system is recommended for such tools. If you are under Windows, the most straightforward way to do so is to use some virtual machine software, such as VirtualBox or VMWARE (or even to use Docker).

For the workflow tools and the desktop application, PowerShell scripts (.ps1) extension are available to launch these tools.

For other tools (eg, RDF exporter, KnetMiner initialiser), you might be able to run the available .sh scripts from systems like CygWin or WSL, however, that isn't supported.

In particular, in CygWin:

  • you might need to fix line ends for these scripts (eg, via dos2unix)
  • You might need to change a .sh that launches Java to fix the CLASSPATH format it uses:
    # Add this in the script, after CLASSPATH has been set
    export CLASSPATH="`cygpath --path --windows $CLASSPATH:$mydir:$mydir/lib/*`"