Skip to content

Latest commit

 

History

History
948 lines (939 loc) · 24.5 KB

nwk.org

File metadata and controls

948 lines (939 loc) · 24.5 KB
 \section{Introduction}
 Phylogenies are trees, which are often written using nested
 parentheses~\cite[p.312]{knu97:ar1}. For example,
 \[
 (A(B)(C(D)(E)))
 \] 
 corresponds to the tree
 \begin{center}
   \pstree[levelsep=1cm]{\Toval{A}}{
     \Toval{B}
     \pstree{\Toval{C}}{
	\Toval{D}
	\Toval{E}
     }
   }
 \end{center}
 Each node has a name and all branches have the same
 length. Phylogenies are often written in a specialized parenthesis
 notation, the Newick
 format\footnote{\ty{evolution.genetics.washington.edu/phylip/newick\_doc.html}}. In
 Newick format, the tree above is
 \[
 (B,(D,E)C)A;
 \]
 Now only internal nodes are delimited by parentheses and the tree is
 terminated by a semicolon, while subtrees are separated by
 commas. Phylogenies typically only contain leaf labels, interpreted as
 extant species, while the internal nodes, the common ancestors usually
 remain anonymous. That would be
 \[
 (B,(D,E));
 \]

 In addition to the topology of a phylogeny, its branch lengths are
 meaningful. In Newick format, branch lengths are delimited by colons,
 for example
 \[
 (B:1,(D:1,E:1));
 \]

 We can summarize the Newick format as a set of rules, where items that
 appear at most once are written in square brackets, items that appear
 between zero and many times in curly brackets:
 \begin{center}
   \begin{tabular}{rcl}
     tree & $\rightarrow$ & children[label][:length];\\
     children & $\rightarrow$ & (child\{,child\})\\
     child & $\rightarrow$ & children[label][:length]\\
     & $\rightarrow$ & label[:length]\\
     label & $\rightarrow$ & unquotedLabel\\
     & $\rightarrow$ & quotedLabel\\
     unquotedLabel & $\rightarrow$ & printingCharacters\\
     quotedLable & $\rightarrow$ & 'printingCharacters'\\
     length & $\rightarrow$ & signedNumber\\
     & $\rightarrow$ & unsignedNumber
   \end{tabular}
 \end{center}

 There are a few additional stipulations:
 \begin{itemize}
 \item Comments are enclosed in square brackets.
 \item White space is ignored everywhere except in quoted labels.
 \item Single quote characters in a quoted label are denoted by two
   single quotes.
 \item Underscores in unquoted labels are converted to blanks.
 \end{itemize}

 Our implementation consists of a scanner that returns phylogenies in
 Go notation. It is based on a scanner for the Go language contained in
 the package
 \[
 \ty{text/scanner}
 \]
\section{Implementation}
!Package \ty{nwk} implements data structures, methods, and functions
!to manipulate phylogenies in Newick format.

The outline of the \ty{nwk} package contains hooks for imports, types,
variables, methods, and functions.
package nwk

import (
	  //<<Imports>>
)
//<<Types>>
//<<Variables>>
//<<Methods>>
//<<Functions>>
\subsection{Data Structure \ty{Node}}
!A \ty{Node} is the basic unit of a Newick tree.
It holds an Id, references to parent, child, and sibling, a label,
a branch length, and an indicator whether it does have a branch
length. The indicator distinguishes the default length of zero from a
genuine branch length. There is also a hook for adding further fields.
type Node struct {
	  Id int
	  Child, Sib, Parent *Node
	  Label string
	  Length float64
	  HasLength bool
	  //<<Node fields>>
}
\subsubsection{Function \ty{NewNode}}
!\ty{NewNode} returns a new node with a unique Id.
func NewNode() *Node {
	  n := new(Node)
	  n.Id = nodeId
	  nodeId++
	  return n
}
The variable \ty{nodeId} is global and initialized to 1.
var nodeId = 1
\subsubsection{Method \ty{AddChild}}
!Method \ty{AddChild} adds a child node to a \ty{Node}.  Inside
\texttt{AddChild}, we set the parent link. The child is either
assigned to the \texttt{Child} link added to the sibling list.
func (n *Node) AddChild(v *Node) {
	  v.Parent = n
	  if n.Child == nil {
		  n.Child = v
	  } else {
		  //<<Scan sibling list>>
	  }
}
We walk along the sibling list and extend it by the new node.
w := n.Child
for w.Sib != nil {
	  w = w.Sib
}
w.Sib = v
\subsubsection{Method \ty{RemoveChild}}
!The method \ty{RemoveChild} removes a direct child node, if
!present. If not, it returns an error.
func (v *Node) RemoveChild(c *Node) error {
	  if v.Child == nil { return errors.New("no children") }
	  //<<Remove first child?>>
	  //<<Remove a sibling?>>
	  return nil
}
If the first child node is to be removed, it is reset to its sibling
and the sibling is detached from the tree.
if v.Child.Id == c.Id {
	  w := v.Child
	  v.Child = v.Child.Sib
	  w.Sib = nil
	  w.Parent = nil
	  return nil
}
If the removable child is among the siblings, the corresponding sibling
link is reset, and the excluded sibling is detached from the rest of
the tree.
w := v.Child
for w.Sib != nil && w.Sib.Id != c.Id {
	  w = w.Sib
}
if w.Sib == nil {
	  return errors.New("child not found")
} else {
	  x := w.Sib
	  w.Sib = w.Sib.Sib
	  x.Sib = nil
	  x.Parent = nil
}
\subsubsection{Method \ty{LCA}}
!The method \ty{LCA} returns the lowest common ancestor of two nodes
!or nil, if none could be found.
func (v *Node) LCA(w *Node) *Node {
	  clearPath(v)
	  clearPath(w)
	  markPath(v)
	  for w != nil && !w.marked {
		  w = w.Parent
	  }
	  return w
}
In \ty{clearPath} we remove the marks along the path from a node to
the root.
func clearPath(v *Node) {
	  for v != nil {
		  v.marked = false
		  v = v.Parent
	  }
}
In \ty{markPath} we mark the path from the node to the root.
func markPath(v *Node) {
	  for v != nil {
		  v.marked = true
		  v = v.Parent
	  }
}
We add the node field \ty{marked}.
marked bool
\subsubsection{Method \ty{UpDistance}}
! The method \ty{UpDistance} returns the distance between the node
! and one of its ancestors.
func (v *Node) UpDistance(w *Node) float64 {
	  s := 0.0
	  x := v
	  for x != nil && x.Id != w.Id {
		  s += x.Length
		  x = x.Parent
	  }
	  //<<Return distance>>
}
If the ancestor wasn't found, we abort with message.
if x == nil {
	  log.Fatal("can't find ancestor")
}
return s
We import \ty{errors}.
"errors"
\subsubsection{Method \ty{UniformLabels}}
! The method \ty{UniformLabels} labels all nodes in the subtree with
! a prefix followed by the node ID.
func (v *Node) UniformLabels(pre string) {
	  label(v, pre)
}
We use the function \ty{label} to recursively label the nodes.
func label(v *Node, pre string) {
	  if v == nil { return }
	  label(v.Child, pre)
	  label(v.Sib, pre)
	  v.Label = pre + strconv.Itoa(v.Id)
}
We import \ty{strconv}.
"strconv"
\subsubsection{Method \ty{String}}
Apart from reading trees, we can write them by implementing the method
\ty{String} on \ty{Node}.
!\ty{String} turns a tree into its Newick string.
func (n *Node) String() string {
	  w := new(bytes.Buffer)
	  writeTree(n, w)
	  return w.String()
}
We import \ty{bytes}.
"bytes"
To convert our tree to Newick, we traverse it and ask four questions
about each node, $v$:
\begin{enumerate}
\item Is $v$ not a first child? Then it is delimited by a comma.
\item Is $v$ an internal node? Then it is a recursive structure in
  parentheses.
\item Is $v$ the root? Then it is marked by a semicolon.
\end{enumerate}
func writeTree(v *Node, w *bytes.Buffer) {
	  if v == nil {
		  return
	  }
	  //<<Is $v$ not a first child?>>
	  //<<Is $v$ an internal node?>>
	  //<<Is $v$ the root?>>
}
A child is not the first child, if its Id is different from that of
its parent's first child.
if v.Parent != nil && v.Parent.Child.Id != v.Id {
	  fmt.Fprint(w, ",")
}
We import \ty{fmt}.
"fmt"
If a node has no child, it is a leaf characterized by its
label. Since internal nodes are also labeled, we use a function here,
\ty{printLabel}. 
if v.Child == nil {
	  printLabel(w, v)
}
In \ty{printLabel}, we first clean up the label. Then we print it,
followed, perhaps, by the branch length. The root has no parent and
hence no branch length.
func printLabel(w *bytes.Buffer, v *Node) {
	  label := v.Label
	  //<<Clean up label>>
	  fmt.Fprintf(w, "%s", label)
	  if v.HasLength && v.Parent != nil {
		  fmt.Fprintf(w, ":%.3g", v.Length)
	  }
}
To clean up the label, we check whether it contains parentheses,
commas, or full stops. In that case we convert single quotes to double
single quotes and print the label in single quotes. Otherwise, we
convert blanks to underscores and print an ordinary label.
if strings.IndexAny(label, "(),.") != -1 {
	  label = strings.ReplaceAll(label, "'", "''")
	  label = fmt.Sprintf("'%s'", label)
} else {
	  label = strings.ReplaceAll(label, " ", "_")
}
We import \ty{strings}.
"strings"
If $v$ is an internal node, we place the subtree rooted on $v$ in
parentheses and print the subsequent label.
if v.Child != nil {
	  fmt.Fprint(w, "(")
}
writeTree(v.Child, w)
printLabel(w, v)
writeTree(v.Sib, w)
if v.Parent != nil && v.Sib == nil {
	  fmt.Fprint(w, ")")
}
If $v$ is the root, we mark it by a semicolon.
if v.Parent == nil {
	  fmt.Fprint(w, ";")
}
\subsubsection{Method \ty{Print}}
! Method \ty{Print} prints nodes indented to form a tree. The code is
! taken from Sedgewick, R. (1998). Algorithms in C, Parts 1-4. 3rd
! Edition, p. 237.
func (v *Node) Print() string {
	  h := 0
	  var b []byte
	  buf := bytes.NewBuffer(b)
	  show(v, h, buf)
	  return buf.String()
}
We recursively traverse the tree in the function \ty{show}.
func show(v *Node, h int, b *bytes.Buffer) {
	  if v == nil { return }
	  show(v.Sib, h, b)
	  printNode(v.Label, h, b)
	  show(v.Child, h+1, b)
}
The function \ty{printNode} takes care of the correct indentation. If
a node has no label, we print it as an asterisk.
func printNode(l string, h int, b *bytes.Buffer) {
	  for i := 0; i < h; i++ {
		  fmt.Fprintf(b, "   ")
	  }
	  if len(l) == 0 { l = "*" }
	  fmt.Fprintf(b, "%s\n", l)
}
\subsubsection{Method \ty{Key}}
! Method \ty{Key} returns a string key for the nodes rooted on its
! receiver. The key consists of the sorted, concatenated labels of the
! nodes in the subtree. The labeles are joined on a separator supplied
! by the caller.
func (v *Node) Key(sep string) string {
	  labels := make(map[string]bool)
	  if v.Label != "" { labels[v.Label] = true }
	  collectLabels(v.Child, labels)
	  var keys []string
	  for k, _ := range labels {
		  keys = append(keys, k)
	  }
	  sort.Strings(keys)
	  key := strings.Join(keys, sep)
	  return key
}
We import \ty{sort}.
"sort"
The function \ty{collectLabels} iterates over the tree and collects
the non-empty labels in a map.
func collectLabels(v *Node, labels map[string]bool) {
	  if v == nil { return }
	  if v.Label != "" {
		  labels[v.Label] = true
	  }
	  collectLabels(v.Child, labels)
	  collectLabels(v.Sib, labels)
}
\subsubsection{Method \ty{RemoveClade}}
!The method \ty{RemoveClade} removes the clade rooted on the node on
!which it is called and sets that node to nil.

We begin by checking $v$, the node on which \ty{RemoveClade} is
called, the root of the clade we wish to remove. Then we check the
parent of $v$, $p$. Then we check the child of $p$, and finally we
check the siblings of $v$.
func (v *Node) RemoveClade() {
	  //<<Check $v$>>
	  //<<Check parent of $v$, $p$>>
	  //<<Check child of $p$>>
	  //<<Check siblings of $v$>>
}
If $v$ is nil, there's nothing to be removed and we return.
if v == nil {
	  return
}
If the parent of $v$ is nil, $v$ is the root and we remove the
clade by setting $v$ to nil before we return.
p := v.Parent
if p == nil {
	  v = nil
	  return
}
If $v$ is the direct child of $p$, we reset the child link of $p$, set
the child nil and return.
w := p.Child
if w.Id == v.Id {
	  p.Child = w.Sib
	  v = nil
	  return
}
Now we walk along the siblings of $v$ until we find a node $w$ whose
sibling link points to $v$. Then we reset the sibling link of $w$ to
the sibling of $v$ and set $v$ nil.
for w.Sib != nil && w.Sib.Id != v.Id {
	  w = w.Sib
}
w.Sib = w.Sib.Sib
v = nil
\subsubsection{Method \ty{CopyClade}}
!The method \ty{CopyClade} copies the clade rooted on the node on
!which it is called and returns a copy of the root of the new clade.

We copying the root using the function \ty{copyNode} before we copy
the rest of the tree using the function \ty{copyTree}.
func (v *Node) CopyClade() *Node {
	  w := copyNode(v)
	  w = copyTree(v, w)
	  return w
}
The function \ty{copyNode} takes as argument a node, $a$, which we
copy to $b$ before returning $b$.
func copyNode(a *Node) *Node {
	  if a == nil {
		  return a
	  }
	  b := NewNode()
	  b.Label = a.Label
	  b.Length = a.Length
	  b.HasLength = a.HasLength
	  b.marked = a.marked
	  return b
}
The function \ty{copyTree} takes as input a node to be copied, $a$,
and the parent of the copy, $p$. It then copies $a$ into $b$ and sets
$b$'s parent, unless $a$ is the root, in which case $b$'s parent
remains the default nil. Then we copy the child of $a$ and its
sibling.
func copyTree(a, p *Node) *Node {
	  b := copyNode(a)
	  if a.Parent != nil {
		  b.Parent = p
	  }
	  //<<Copy child>>
	  //<<Copy sibling>>
	  return b
}
If $a$ has a child, we copy it to the child of $b$ and pass $b$ as the
parent.
if a.Child != nil {
	  b.Child = copyTree(a.Child, b)
}
Similarly, if $a$ has a sibling, we copy it to the sibling of $b$ and
pass $b$'s parent as the parent.
if a.Sib != nil {
	  b.Sib = copyTree(a.Sib, b.Parent)
}
\subsection{Data Structure \ty{Scanner}}
!\ty{Scanner} scans an input file one tree at a time.
We achieve this by wrapping the reader provided by the \ty{bufio}
package. The scanner has a hook for additional fields.
type Scanner struct {
	  r *bufio.Reader
	  //<<Scanner fields>>
}
We import \ty{bufio}.
"bufio"
\subsubsection{Function \ty{NewScanner}}
! \ty{NewScanner} returns a scanner for scanning Newick-formatted
! phylogenies.
func NewScanner(r io.Reader) *Scanner {
	  sc := new(Scanner)
	  sc.r = bufio.NewReader(r)
	  return sc
}
We import \ty{io}.
"io"
\subsubsection{Method \ty{Scan}}
!The method \ty{Scan} advances the scanner by one tree. A tree starts
!at the first opening parenthesis encountered and ends at the next
!semi colon.
\ty{Scan} wraps the \ty{bufio}  method \ty{ReadString}.
func (s *Scanner) Scan() bool {
	  var err error
	  text, err := s.r.ReadString(';')
	  if err != nil {
		  return false
	  }
	  //<<Remove extraneous prefix>>
	  s.text = text
	  return true
}
Any character to the left of the first opening parenthesis must be
superfluous and is removed. If there is no opening parenthesis, we
report that there's no tree.
i := strings.Index(text, "(")
if i == -1 {
	  return false
}
text = text[i:]
We add the scanner field \ty{text}.
text string
\subsubsection{Method \ty{Tree}}
!The method \ty{Tree} returns the most recent tree scanned.
It takes the text-version of a tree and
prepares it for scanning. Then it scans the string and constructs the
tree, which is returned by its root.
func (s *Scanner) Tree() *Node {
	  var root *Node
	  var tokens []string
	  tree := s.Text()
	  //<<Prepare tree string for scanning>>
	  //<<Convert tree string to tokens>>
	  //<<Convert tokens to tree>>
	  return root
}
!The method \ty{Text} returns the text scanned most recently.
It wraps access to the \ty{text}
field.
func (s *Scanner) Text() string {
	  return s.text
}
We scan the tree string using the token scanner from the standard
library. It is intended for tokenizing Go code, so we convert Newick
comments to Go comments, single quotes to double quotes, and we quote
numbers.
//<<Convert comments to Go comments>>
//<<Convert single quotes to double quotes>>
//<<Quote numbers>>
We convert the square brackets of Newick comments to the corresponding
Go markers.
tree = strings.ReplaceAll(tree, "[", "/*")
tree = strings.ReplaceAll(tree, "]", "*/")
We convert the single quotes to double quotes. And since literal
single quotes are marked by two single quotes in Newick, we convert
double-double quotes back to single quotes.
tree = strings.ReplaceAll(tree, "'", "\"")
tree = strings.ReplaceAll(tree, "\"\"", "'")
To quote numbers, we convert the tree into a slice of runes and
iterate over them. For each rune we decide whether it is part of a
number or not.
c1 := []rune(tree)
var c2 []rune
isNum := false
for _, r := range c1 {
	  //<<Is number?>>
	  c2 = append(c2, r)
}
tree = string(c2)
Numbers start with a colon and end with a right parenthesis or a
comma. If the user gave the root a branch length, a number might also
be terminated by a semicolon. We include the starting colon in the
number string to make it easier later on to distinguish between branch
lengths and node labels.
if r == ':' {
	  isNum = true
	  c2 = append(c2, '"')
}
if isNum && (r == ',' || r == ';' || r == ' ' || r == ')') {
	  isNum = false
	  c2 = append(c2, '"')
}
The tree string is now ready to be split into its syntactic tokens.
Some of these tokes come with enclosing quotes, which we remove. The
other tokens might be ordinary labels, where we convert underscores to
blanks.
var tsc scanner.Scanner
tsc.Init(strings.NewReader(tree))
for t := tsc.Scan(); t != scanner.EOF;  t = tsc.Scan() {
	  text := tsc.TokenText()
	  if text[0] == '"' {
		  //<<Unquote token>>
	  } else {
		  //<<Convert underscores to blanks>>
	  }
	  tokens = append(tokens, text)
}
We import \ty{scanner}.
"text/scanner"
We unquote the token and check for errors.
var err error
text, err = strconv.Unquote(text)
if err != nil {
	  log.Fatalf("couldn't unquote %q\n", text)
}
We import \ty{strconv} and \ty{log}.
"strconv"
"log"
We convert all underscores to blanks.
text = strings.ReplaceAll(text, "_", " ")
We iterate across the tokens and classify them into tree elements.
i := 0
v := root
for i < len(tokens) {
	  t := tokens[i]
	  //<<Classify token>>
	  i++
}
root = v
There are five syntactic tokens, left and right parenthesis, comma,
colon, and semicolon. They each cause an operation on the tree:
\begin{center}
  \begin{tabular}{cl}
    \hline
    Token & Action\\\hline
    ( & Add child\\
    ) & Move to parent\\
    , & Add sibling\\
    : & Add branch length\\
    ; & Break from loop and return tree\\\hline
  \end{tabular}
\end{center}
If none of these apply, the token is a node label.
//<<Add child?>>
//<<Move to parent?>>
//<<Add sibling?>>
//<<Add branch length?>>
//<<Tree finished?>>
//<<Add label?>>
On a left parenthesis we add a child to our current node and
move to it. If the current node is nil, we initialize it first.
if t == "(" {
	  if v == nil {
		  v = NewNode()
	  }
	  v.AddChild(NewNode())
	  v = v.Child
}
On a right parenthesis we move to the parent.
if t == ")" {
	  v = v.Parent
}
On a comma, we add a sibling with the same parent as the current node
and move to it.
if t == "," {
	  s := NewNode()
	  s.Parent = v.Parent
	  v.Sib = s
	  v = v.Sib
}
On a leading colon we add a branch length.
if t[0] == ':' {
	  l, err := strconv.ParseFloat(t[1:], 64)
	  if err != nil {
		  log.Fatalf("didn't understand %q\n", t[1:])
	  }
	  v.Length = l
	  v.HasLength = true
}
On a semicolon the tree is finished and we break from the loop.
if t == ";" {
	  break
}
If none of the above, we are dealing with part of a label, if not the
full label.
if strings.IndexAny(t[:1], ")(,:;") == -1 {
	  v.Label += t
}
The package is finished, time to test it.