Skip to content

Determining contiguity

Yu Wan edited this page Jan 28, 2016 · 11 revisions

The 'Colour by contiguity' colour scheme allows you to choose a node and then view which nodes in a graph are likely to be contiguous with it. In this context, 'contiguous' mean that the nodes' sequences are likely to have originated from the same piece of DNA.

Logic

Bandage uses the following logic to determine contiguity: For a node (A) to be contiguous with another node (B), one of the two following statements must be true: one of the edges connected to node A leads unambiguously to node B across all possible paths; or the opposite scenario: one of the edges connected to node B leads unambiguously to node A across all possible paths.

Colours

The default colours are:

  • Bright green = starting node
  • Dark green = contiguous node
  • Light green = possibly contiguous node
  • Grey = not contiguous node

These colours are configurable in the 'Contiguity colour scheme' section of the Bandage settings.

Example

Contiguity example

The above figure shows a graph to illustrate Bandage's contiguity logic. Here are explanations of why particular contiguity statuses are assigned:

  • Node 1 is contiguous with node 5. When leaving node 5 via edge (e), all possible paths encounter node 1.
  • Node 12 is contiguous with node 5. When leaving node 12 via edge (m), all possible paths encounter node 5.
  • Node 2 is possibly contiguous with node 5. When leaving node 5 via edge (e), only some paths encounter node 2. This is also the case for the opposite direction: when leaving node 2 via edge (c), only some paths encounter node 5.
  • Node 6 is not contiguous with node 5. There are no paths connecting those two nodes. Note that a path in the assembly graph must enter a node on one end and leave via the other. For example, the path 5-4-6 via edges (e) and (f) is not valid.

Loops in the graph, as occurs with nodes 9 and 10, complicate the matter. When considering all possible paths leaving node 5 via edge (g), we can compile the following list: 5-7-9-11, 5-7-9-10-9-11, 5-7-9-10-9-10-9-11, 5-7-9-10-9-10-9-10-9-11, etc. This loop creates an infinite number of possible paths, so we must limit our path search to a finite number of steps. However, for whatever depth is chosen, there will be a path that does not reach node 11, instead terminating at node 9 or 10. We would therefore be unable to conclude that all paths reach node 11, and it will be classified as possibly contiguous. Since infinite loops do not exist in real DNA, this behaviour is undesirable. To avoid this pitfall, we only consider paths where the same node is visited a maximum of two times. This allows only the first two paths, 5-7-9-11 and 5-7-9-10-9-11, both of which contain node 11, thus classifying node 11 as contiguous.

An equivalent way to look at the concept of contiguity is to consider the possible underlying sequences. Considering only nodes 5, 7, 9, 10 and 11, there are two principal sets of underlying sequence compatible with the graph. The first is a single sequence, 5-7-9-10-9-11, which contains a repeated sequence in node 9. The second possibility is two separate sequences: a linear sequence of 5-7-9-11 and a circular sequence 9-10. In both of these cases, node 5 always occurs with nodes 7, 9 and 11, hence their classification as contiguous. Node 10 occurs with node 5 only in the first possibility, hence its classification as possibly contiguous.

Search depth

This is the maximum number of nodes on a search path. To keep calculation time reasonable, Bandage limits its search depth when performing contiguity determination. This value (default of 15) can be adjusted in the 'Contiguity colour scheme' section of the Bandage settings. Larger values can potentially find more distant contiguous nodes, at the cost of additional computation time.

Example application: gene context in a metagenome

One application for this functionality is to find the maximal amount of context (neighbouring sequence) for a gene in a metagenome. Using Bandage you can search for the gene of interest using BLAST (see BLAST searches) and use its node(s) as the starting nodes of a contiguity search.

The resulting contiguous nodes are the nodes which are likely to have come from the same piece(s) of DNA as the gene of interest, and as such may be useful in determining the taxonomy of the species which contain that gene.

Node selection

Under the 'Select' menu in Bandage, there is a submenu titled 'Select nodes based on contiguity'. This allows you to quickly select all nodes with a particular contiguity status – useful if you wish to save all contiguous nodes to file in a complex graph.