diff --git a/index.html b/index.html index f08182f..e829775 100644 --- a/index.html +++ b/index.html @@ -17,11 +17,11 @@

The DIGS Tool

Database-Integrated Genome Screening

- Background - Manual - Download + Background + User Guide + Download GitHub - Publications + Publications Twitter diff --git a/website/user-guide/explore.html b/website/explore.html similarity index 92% rename from website/user-guide/explore.html rename to website/explore.html index ed7207d..3ed69c1 100644 --- a/website/user-guide/explore.html +++ b/website/explore.html @@ -7,10 +7,10 @@ Exploring Genomes Using DIGS - + - - + + @@ -25,7 +25,7 @@

Database-Integrated Genome Screening

Home - Manual + User Guide Download GitHub @@ -54,7 +54,7 @@

- Terra incognita + Terra incognita

@@ -110,7 +110,7 @@

- Bryce Canyon + Bryce Canyon
Bryce Canyon: Geological structures show the impact of erosion over macroevolutionary time scales. @@ -122,7 +122,7 @@


- Palimpsest + Palimpsest
@@ -160,7 +160,7 @@

- Alice and the Red Queen + Alice and the Red Queen
@@ -187,7 +187,7 @@

- Tangled tree + Tangled tree @@ -223,7 +223,7 @@


- Transposition and the embryo + Transposition and the embryo
@@ -291,7 +291,7 @@


-

Here be dragons

+

Here be dragons

@@ -326,7 +326,7 @@

-

Big data Hoskins

+

Big data Hoskins

diff --git a/website/user-guide/publications.html b/website/publications.html similarity index 94% rename from website/user-guide/publications.html rename to website/publications.html index fa9841a..c3f5ae7 100644 --- a/website/user-guide/publications.html +++ b/website/publications.html @@ -7,10 +7,10 @@ DIGS Tool: Publications - + - - + + @@ -24,9 +24,9 @@

The DIGS Tool

Database-Integrated Genome Screening

- Home + Home Background - Manual + User Guide Download GitHub diff --git a/website/user-guide/user-guide.html b/website/user-guide.html similarity index 60% rename from website/user-guide/user-guide.html rename to website/user-guide.html index 686e807..95c6b93 100755 --- a/website/user-guide/user-guide.html +++ b/website/user-guide.html @@ -7,10 +7,10 @@ DIGS: User Guide - + - - + + @@ -71,14 +71,6 @@


-
  • Structure of a DIGS-based investigation
  • -
    -
      -
    1. Overview of a DIGS-based investigation
    2. -
    3. Exploration phase
    4. -
    5. Analysis phase
    6. -
    - @@ -271,7 +263,7 @@

    -

    DIGS target directory structure

    +

    DIGS target directory structure

    @@ -284,7 +276,7 @@


    -

    DIGS target directory structure

    +

    DIGS target directory structure


    @@ -400,7 +392,7 @@

    -

    DIGS target directory structure

    +

    DIGS target directory structure

    @@ -666,7 +658,7 @@

    -

    DIGS screenshot

    +

    DIGS screenshot

    @@ -685,7 +677,7 @@

    -

    DIGS screenshot (counts)

    +

    DIGS screenshot (counts)

    @@ -758,431 +750,12 @@


    - - - - - - - - - - - - - - - - - - - - - - - - -

    - 2.4. Incorporating linked data into a DIGS project -

    -
    - - -

    - Incorporating additional, linked data tables into DIGS project databases allows - users to reference these data in SQL queries. -

    - - -

    - For example, adding a table that contains taxonomic information about the species - screened will allow SQL queries to reference higher taxonomic ranks than species. -

    - - - - -

    - The digs_tool.pl script provides functions for console-based management of - additional tables in DIGS project databases. - For example, to add a table with virus taxonomy, first run DIGS as follows: - -

    -  giff01r@Alpha:~/DIGS/DIGS-tool$ ./digs_tool.pl -d=1 -i ../projects/eve/eve_1_parvo.ctl
    -  
    - -

    - - - - -

    - The console will prompt for input, beginning with the path to a file containing tabular data - -

    -  Connecting to DB:  erv_lenti
    -
    -	 #### WARNING: This function expects a tab-delimited data table with column headers!
    -
    -	 Please enter the path to the file with the table data and column headings
    -
    -	 : ../projects/eve/tabular/ncbi_virus_taxonomy.txt
    - - -

    - -

    - If valid tabular input is received, a breakdown of the column headers will be shown. - If it looks correct enter 'y' (yes), and select appropriate options. For example: -

    - - -

    - -

    -  The following cleaned column headers (i.e. table fields) were obtained
    -
    -		 Column 1: 'Target_species'
    -		 Column 2: 'Target_class'
    -		 Column 3: 'Target_superorder'
    -		 Column 4: 'Target_order'
    -		 Column 5: 'Target_family'
    -		 Column 6: 'Target_genus'
    -
    -	 Is this correct? (y/n): y
    -
    -		 1. Create new ancillary table
    -		 2. Append data to existing ancillary table
    -		 3. Flush existing ancillary table and import fresh data
    -
    -	 Choose an option (1/2/3): 1
    -
    -	 What is the name of the new table? : host_taxonomy
    -
    -	 Creating ancillary table 'host_taxonomy' in eve_1_parvoviridae screening database
    -
    -	 #### IMPORTING to table 'host_taxonomy'
    -
    -	 # Exit
    -
    - -

    - - -

    - NOTE - in the example above I have deliberately prefixed the - column names - which are taxonomic ranks - with 'Target_'. - This avoid conflicting with any of - MySQLs reserved words - - one of which is 'ORDER'. -

    - - - -

    - Now that I have added the host_taxonomy table, I can select - database entries based on any of the taxonomic ranks included in my file, - through reference to the 'host_taxonomy' table, as shown below. -

    - - - -

    DIGS screenshot

    -
    -

    - 2.5. Merging hits in the DIGS results table into larger sequences -

    -
    - -

    - When relying on sequence similarity as a means of recovering the sequences of - related genome features, a limitation is that the sequences of many interesting - genome features are only partially conserved, and large regions of sequence - within these features may be rearranged or divergent. -

    -

    - However, when two or more conserved features occur contiguously, their relationship - can be used to determine the coordinates of a more complete sequence for the - genome feature of interest.

    - - -

    - For example, integrated retroviruses ('proviruses') are comprised of internal coding domains - (gag, pol, env - in that order), flanked by terminal LTRs - that are usually (though not always entirely) non-coding. - However, - endogenous retroviruses (ERVs) frequently have much complex genome arrangements, - with many being fragmentary or mosaic in structure, and large regions of - the integrated provirus often being highly divergent from anything seen previously. -

    - - - -

    - Accordingly, it makes semse to screen first using individual features - (i.e. Gag, Pol, Env polypeptides, plus LTR nucleotide sequences), - as probes and references, then to 'consolidate' the hits to these - probes into a larger sequences comprised of the hits, plus the intervening sequences. - At the same time, we can record the relationship between the component parts of - the merged sequence, where merging occurs. -

    - -

    - The DIGS tool can be used to implement a ‘consolidation’ of this kind. - Contiguous hits in the ‘digs_results’ table are merged based on whether they - are within a user-defined distance of one another. -

    - - -

    Consolidate

    - - -

    - Running the consolidation process produces a set of merged sequences, and - also classifies these sequences using the same approach applied when - generating the digs_results table. - The results - i.e. a non-overlapping set of sequences, merged as determined by - user-specified rules - are entered into the 'loci' table - (see the database schema page for details). - A separate reference sequence library that is appropriate for classifying - the longer sequences should be used for classifying the consolidated results, and is - specified by a distinct parameter (see section 4 in the set-up stages above). -

    - - -

    - The loci table contains most of the same fields as the digs_results table, - but also includes a 'locus_structure' field that records the relationship - between merged hits, including their orientation relative to one another. -

    - - -

    - The locus table includes a field 'locus-structure' that shows the order - and orientation of the individual hits from the digs-results table that - were combined to create the merged hit, as shown below. -

    - - -
    - - - -

    Consolidation - loci table query

    - - - -
    - - - -

    - 2.6. Extracting hit sequences and their flanks using the DIGS tool -

    -
    - - -

    - Working within the framework of the DIGS tool (i.e. using SQL to query DIGS results, - and reclassifying sequences through merging and updates reference libraries) - can provide many useful insights into the distribution and diversity of a given - genome feature. -

    - -

    - For further investigations, however, it will often be necessary to export sequences - from the DIGS screening database so that they can be analysed using other kinds - of bioinformatic and comparative approaches. -

    - -

    - As well as extracting the sequence matches themselves, it is often helpful to extract - the upstream and downstream flanking sequences. -

    - -

    - - To do this, run the digs_tool.pl script using the -d=6 option, - and providing a tabular file containing locus data using the -i option - as illustrated here: - -

    -    giff01r@Alpha:~/DIGS/DIGS-tool$ ./digs_tool.pl -d=6 -i loci.tsv
    - - -

    - -
    - -

    - 3. Structure of a DIGS-based investigation -

    - - -

    - 3.1 Overview -

    -
    - - -

    - Comparative studies using database-integrated genome screening (DIGS) entail - separate 'exploration' and 'analysis' phases, - with each of these phases being split into two component parts as follows: - - -

      - -
    • Exploration: (1) Setting up and (2) running a similarity search-based screen.
    • -
    • Analysis: (3) Inspecting screening output via a relational database, and (4) - performing comparative analysis of exported sequence data.
    • -
    - - - -

    - - -

    Overview - phases

    - -

    - - As shown above, this process is usually iterative - at least to some degree - since - analysis of screening results often reveals new information that can be used to - design more informative or comprehensive screens. - -

    - - -
    - - - -

    - 3.2 Exploration phase: Setting up and running an in silico screen -

    -
    - -

    - - DIGS is a project-based framework in which investigations are centred around - a genome feature of interest. Any genome feature can be investigated in principle, - so long as it contains sufficient sequence conservation to be reliably detected in a similarity - search. - -

    - - -

    - The 'reference sequence library' - is a curated set of sequences relevant to the genome feature under investigation). - Usually this will consist of: - - -

      -
    • A set of conserved DNA or polypeptide sequences derived from the genome feature of interest.
    • -
    - -

    - - -

    - However, depending on the kind of investigation being performed, it may also contain : - - -

      -
    • Sequences that do not derive from the genome feature under investigation, - but can provide useful information about the locus in which it occurs.
    • -
    • Sequences representing genome features that are not relevant to the investigation, - but are sufficiently similar to them to generate 'false positive' matches.
    • -
    - - - Screening entails selecting particular sequences from the reference library for use as - 'probes' in a BLAST search of a specific 'target database'. -

    - - -

    - Sequences that match to the query ('hits') can then be extracted and classified. - A convenient way of rapidly classifying or 'genotyping' hits is via BLAST-based comparison to the - reference library, as indicated in the illustration below. -

    - - - - -

    Exploration phase

    - - - -
    - - Schematic representation of the exploration phase of a DIGS-based investigation. -
    - Here, the genome features being investigated are a set of related genes - In step (1) a sequence from the reference library is selected and used - as a 'probe' or 'query' in a BLAST-based search of a chosen target database. - In step (2), sequences identified in this search are extracted - and classified via BLAST-based comparison to the reference library. - These searches provides a way to effectively 'delve in' to - genomic databanks and recover related sequences and as such, they provide a means - to survey unmapped regions of the genomic 'landscape'. - -
    - - -
    -

    - Analysis phase: Inspecting results and exporting data for comparative analysis -

    -
    - - -

    - In DIGS, a similarity search-based screening pipeline is linked to a relational - database management system (RDBMS), and the outputs of screening are captured - in a project-specific relational database. -

    - -

    - This approach not only provides a convenient and robust - basis for implementing systematic, automated screens that proceed in an efficient, - non-redundant way, it also allows screening data to be interrogated using - structured query language (SQL) - a well-established, powerful approach for - querying relational databases. -

    - -

    - -

      - -
    1. Investigation of output via the relational database.
    2. -
    3. Comparative genomic analysis of exported sequence data
    4. -
      -
    - - -
    - - - -

    Analysis phase

    - - - -
    - Analysing screening output: A schematic representation of the two component - parts of the 'analysis' phase of DIGS-based screen - (some comparative analysis do not require an alignment, but most do). -
    -