diff --git a/index.html b/index.html index f08182f..e829775 100644 --- a/index.html +++ b/index.html @@ -17,11 +17,11 @@
Bryce Canyon: Geological structures show the impact of erosion over macroevolutionary time scales. @@ -122,7 +122,7 @@
- +@@ -160,7 +160,7 @@- +
@@ -187,7 +187,7 @@- + @@ -223,7 +223,7 @@
- +@@ -291,7 +291,7 @@+
-@@ -326,7 +326,7 @@-
+ diff --git a/website/user-guide/publications.html b/website/publications.html similarity index 94% rename from website/user-guide/publications.html rename to website/publications.html index fa9841a..c3f5ae7 100644 --- a/website/user-guide/publications.html +++ b/website/publications.html @@ -7,10 +7,10 @@DIGS Tool: Publications - + - - + + @@ -24,9 +24,9 @@The DIGS Tool
Database-Integrated Genome Screening
- Home + Home Background - Manual + User Guide Download GitHub diff --git a/website/user-guide/user-guide.html b/website/user-guide.html similarity index 60% rename from website/user-guide/user-guide.html rename to website/user-guide.html index 686e807..95c6b93 100755 --- a/website/user-guide/user-guide.html +++ b/website/user-guide.html @@ -7,10 +7,10 @@DIGS: User Guide - + - - + + @@ -71,14 +71,6 @@
-Structure of a DIGS-based investigation -
--
- @@ -271,7 +263,7 @@- Overview of a DIGS-based investigation
-- Exploration phase
-- Analysis phase
--
+ @@ -284,7 +276,7 @@+
-
@@ -400,7 +392,7 @@-
+ @@ -666,7 +658,7 @@-
+@@ -685,7 +677,7 @@-
+@@ -758,431 +750,12 @@
- - - - - - - - - - - - - - - - - - - - - - - - -- 2.4. Incorporating linked data into a DIGS project -
-
- - -- Incorporating additional, linked data tables into DIGS project databases allows - users to reference these data in SQL queries. -
- - -- For example, adding a table that contains taxonomic information about the species - screened will allow SQL queries to reference higher taxonomic ranks than species. -
- - - - -- The digs_tool.pl script provides functions for console-based management of - additional tables in DIGS project databases. - For example, to add a table with virus taxonomy, first run DIGS as follows: - -
-- - - - - - -giff01r@Alpha:~/DIGS/DIGS-tool$ ./digs_tool.pl -d=1 -i ../projects/eve/eve_1_parvo.ctl -
- The console will prompt for input, beginning with the path to a file containing tabular data - -
-- - - - -Connecting to DB: erv_lenti - - #### WARNING: This function expects a tab-delimited data table with column headers! - - Please enter the path to the file with the table data and column headings - - : ../projects/eve/tabular/ncbi_virus_taxonomy.txt
- If valid tabular input is received, a breakdown of the column headers will be shown. - If it looks correct enter 'y' (yes), and select appropriate options. For example: -
- - -- -
-- - - - -The following cleaned column headers (i.e. table fields) were obtained - - Column 1: 'Target_species' - Column 2: 'Target_class' - Column 3: 'Target_superorder' - Column 4: 'Target_order' - Column 5: 'Target_family' - Column 6: 'Target_genus' - - Is this correct? (y/n): y - - 1. Create new ancillary table - 2. Append data to existing ancillary table - 3. Flush existing ancillary table and import fresh data - - Choose an option (1/2/3): 1 - - What is the name of the new table? : host_taxonomy - - Creating ancillary table 'host_taxonomy' in eve_1_parvoviridae screening database - - #### IMPORTING to table 'host_taxonomy' - - # Exit -
- NOTE - in the example above I have deliberately prefixed the - column names - which are taxonomic ranks - with 'Target_'. - This avoid conflicting with any of - MySQLs reserved words - - one of which is 'ORDER'. -
- - - -- Now that I have added the host_taxonomy table, I can select - database entries based on any of the taxonomic ranks included in my file, - through reference to the 'host_taxonomy' table, as shown below. -
- - - - -
-- 2.5. Merging hits in the DIGS results table into larger sequences -
-
- -- When relying on sequence similarity as a means of recovering the sequences of - related genome features, a limitation is that the sequences of many interesting - genome features are only partially conserved, and large regions of sequence - within these features may be rearranged or divergent. -
-- However, when two or more conserved features occur contiguously, their relationship - can be used to determine the coordinates of a more complete sequence for the - genome feature of interest.
- - -- For example, integrated retroviruses ('proviruses') are comprised of internal coding domains - (gag, pol, env - in that order), flanked by terminal LTRs - that are usually (though not always entirely) non-coding. - However, - endogenous retroviruses (ERVs) frequently have much complex genome arrangements, - with many being fragmentary or mosaic in structure, and large regions of - the integrated provirus often being highly divergent from anything seen previously. -
- - - -- Accordingly, it makes semse to screen first using individual features - (i.e. Gag, Pol, Env polypeptides, plus LTR nucleotide sequences), - as probes and references, then to 'consolidate' the hits to these - probes into a larger sequences comprised of the hits, plus the intervening sequences. - At the same time, we can record the relationship between the component parts of - the merged sequence, where merging occurs. -
- -- The DIGS tool can be used to implement a ‘consolidation’ of this kind. - Contiguous hits in the ‘digs_results’ table are merged based on whether they - are within a user-defined distance of one another. -
- - - - - -- Running the consolidation process produces a set of merged sequences, and - also classifies these sequences using the same approach applied when - generating the digs_results table. - The results - i.e. a non-overlapping set of sequences, merged as determined by - user-specified rules - are entered into the 'loci' table - (see the database schema page for details). - A separate reference sequence library that is appropriate for classifying - the longer sequences should be used for classifying the consolidated results, and is - specified by a distinct parameter (see section 4 in the set-up stages above). -
- - -- The loci table contains most of the same fields as the digs_results table, - but also includes a 'locus_structure' field that records the relationship - between merged hits, including their orientation relative to one another. -
- - -- The locus table includes a field 'locus-structure' that shows the order - and orientation of the individual hits from the digs-results table that - were combined to create the merged hit, as shown below. -
- - -
- - - - - - - -
- - - -- 2.6. Extracting hit sequences and their flanks using the DIGS tool -
-
- - -- Working within the framework of the DIGS tool (i.e. using SQL to query DIGS results, - and reclassifying sequences through merging and updates reference libraries) - can provide many useful insights into the distribution and diversity of a given - genome feature. -
- -- For further investigations, however, it will often be necessary to export sequences - from the DIGS screening database so that they can be analysed using other kinds - of bioinformatic and comparative approaches. -
- -- As well as extracting the sequence matches themselves, it is often helpful to extract - the upstream and downstream flanking sequences. -
- -- - To do this, run the digs_tool.pl script using the -d=6 option, - and providing a tabular file containing locus data using the -i option - as illustrated here: - -
-- - - - -giff01r@Alpha:~/DIGS/DIGS-tool$ ./digs_tool.pl -d=6 -i loci.tsv
- -- 3. Structure of a DIGS-based investigation -
- - -- 3.1 Overview -
-
- - -- Comparative studies using database-integrated genome screening (DIGS) entail - separate 'exploration' and 'analysis' phases, - with each of these phases being split into two component parts as follows: - - -
- -
- - - - - - - - -- Exploration: (1) Setting up and (2) running a similarity search-based screen.
-- Analysis: (3) Inspecting screening output via a relational database, and (4) - performing comparative analysis of exported sequence data.
-- - As shown above, this process is usually iterative - at least to some degree - since - analysis of screening results often reveals new information that can be used to - design more informative or comprehensive screens. - -
- - -
- - - -- 3.2 Exploration phase: Setting up and running an in silico screen -
-
- -- - DIGS is a project-based framework in which investigations are centred around - a genome feature of interest. Any genome feature can be investigated in principle, - so long as it contains sufficient sequence conservation to be reliably detected in a similarity - search. - -
- - -- The 'reference sequence library' - is a curated set of sequences relevant to the genome feature under investigation). - Usually this will consist of: - - -
-
- - - - -- A set of conserved DNA or polypeptide sequences derived from the genome feature of interest.
-- However, depending on the kind of investigation being performed, it may also contain : - - -
-
- - - Screening entails selecting particular sequences from the reference library for use as - 'probes' in a BLAST search of a specific 'target database'. - - - -- Sequences that do not derive from the genome feature under investigation, - but can provide useful information about the locus in which it occurs.
-- Sequences representing genome features that are not relevant to the investigation, - but are sufficiently similar to them to generate 'false positive' matches.
-- Sequences that match to the query ('hits') can then be extracted and classified. - A convenient way of rapidly classifying or 'genotyping' hits is via BLAST-based comparison to the - reference library, as indicated in the illustration below. -
- - - - - - - - -- - Schematic representation of the exploration phase of a DIGS-based investigation. -- - -
- Here, the genome features being investigated are a set of related genes - In step (1) a sequence from the reference library is selected and used - as a 'probe' or 'query' in a BLAST-based search of a chosen target database. - In step (2), sequences identified in this search are extracted - and classified via BLAST-based comparison to the reference library. - These searches provides a way to effectively 'delve in' to - genomic databanks and recover related sequences and as such, they provide a means - to survey unmapped regions of the genomic 'landscape'. - -
-- Analysis phase: Inspecting results and exporting data for comparative analysis -
-
- - -- In DIGS, a similarity search-based screening pipeline is linked to a relational - database management system (RDBMS), and the outputs of screening are captured - in a project-specific relational database. -
- -- This approach not only provides a convenient and robust - basis for implementing systematic, automated screens that proceed in an efficient, - non-redundant way, it also allows screening data to be interrogated using - structured query language (SQL) - a well-established, powerful approach for - querying relational databases. -
- -- -
- -
- - -- Investigation of output via the relational database.
-- Comparative genomic analysis of exported sequence data
- -
- - - - - - - -- Analysing screening output: A schematic representation of the two component - parts of the 'analysis' phase of DIGS-based screen - (some comparative analysis do not require an alignment, but most do). --