\section{Introduction}
Phylogenies are trees, which are often written using nested
parentheses~\cite[p.312]{knu97:ar1}. For example,
\[
(A(B)(C(D)(E)))
\]
corresponds to the tree
\begin{center}
\pstree[levelsep=1cm]{\Toval{A}}{
\Toval{B}
\pstree{\Toval{C}}{
\Toval{D}
\Toval{E}
}
}
\end{center}
Each node has a name and all branches have the same
length. Phylogenies are often written in a specialized parenthesis
notation, the Newick
format\footnote{\ty{evolution.genetics.washington.edu/phylip/newick\_doc.html}}. In
Newick format, the tree above is
\[
(B,(D,E)C)A;
\]
Now only internal nodes are delimited by parentheses and the tree is
terminated by a semicolon, while subtrees are separated by
commas. Phylogenies typically only contain leaf labels, interpreted as
extant species, while the internal nodes, the common ancestors usually
remain anonymous. That would be
\[
(B,(D,E));
\]
In addition to the topology of a phylogeny, its branch lengths are
meaningful. In Newick format, branch lengths are delimited by colons,
for example
\[
(B:1,(D:1,E:1));
\]
We can summarize the Newick format as a set of rules, where items that
appear at most once are written in square brackets, items that appear
between zero and many times in curly brackets:
\begin{center}
\begin{tabular}{rcl}
tree & $\rightarrow$ & children[label][:length];\\
children & $\rightarrow$ & (child\{,child\})\\
child & $\rightarrow$ & children[label][:length]\\
& $\rightarrow$ & label[:length]\\
label & $\rightarrow$ & unquotedLabel\\
& $\rightarrow$ & quotedLabel\\
unquotedLabel & $\rightarrow$ & printingCharacters\\
quotedLable & $\rightarrow$ & 'printingCharacters'\\
length & $\rightarrow$ & signedNumber\\
& $\rightarrow$ & unsignedNumber
\end{tabular}
\end{center}
There are a few additional stipulations:
\begin{itemize}
\item Comments are enclosed in square brackets.
\item White space is ignored everywhere except in quoted labels.
\item Single quote characters in a quoted label are denoted by two
single quotes.
\item Underscores in unquoted labels are converted to blanks.
\end{itemize}
Our implementation consists of a scanner that returns phylogenies in
Go notation. It is based on a scanner for the Go language contained in
the package
\[
\ty{text/scanner}
\]
\section{Implementation}
!Package \ty{nwk} implements data structures, methods, and functions
!to manipulate phylogenies in Newick format.
The outline of the \ty{nwk} package contains hooks for imports, types,
variables, methods, and functions.
package nwk
import (
//<<Imports>>
)
//<<Types>>
//<<Variables>>
//<<Methods>>
//<<Functions>>
\subsection{Data Structure \ty{Node}}
!A \ty{Node} is the basic unit of a Newick tree.
It holds an Id, references to parent, child, and sibling, a label,
a branch length, and an indicator whether it does have a branch
length. The indicator distinguishes the default length of zero from a
genuine branch length. There is also a hook for adding further fields.
type Node struct {
Id int
Child, Sib, Parent *Node
Label string
Length float64
HasLength bool
//<<Node fields>>
}
\subsubsection{Function \ty{NewNode}}
!\ty{NewNode} returns a new node with a unique Id.
func NewNode() *Node {
n := new(Node)
n.Id = nodeId
nodeId++
return n
}
The variable \ty{nodeId} is global and initialized to 1.
\subsubsection{Method \ty{AddChild}}
!Method \ty{AddChild} adds a child node to a \ty{Node}. Inside
\texttt{AddChild}, we set the parent link. The child is either
assigned to the \texttt{Child} link added to the sibling list.
func (n *Node) AddChild(v *Node) {
v.Parent = n
if n.Child == nil {
n.Child = v
} else {
//<<Scan sibling list>>
}
}
We walk along the sibling list and extend it by the new node.
w := n.Child
for w.Sib != nil {
w = w.Sib
}
w.Sib = v
\subsubsection{Method \ty{RemoveChild}}
!The method \ty{RemoveChild} removes a direct child node, if
!present. If not, it returns an error.
func (v *Node) RemoveChild(c *Node) error {
if v.Child == nil { return errors.New("no children") }
//<<Remove first child?>>
//<<Remove a sibling?>>
return nil
}
If the first child node is to be removed, it is reset to its sibling
and the sibling is detached from the tree.
if v.Child.Id == c.Id {
w := v.Child
v.Child = v.Child.Sib
w.Sib = nil
w.Parent = nil
return nil
}
If the removable child is among the siblings, the corresponding sibling
link is reset, and the excluded sibling is detached from the rest of
the tree.
w := v.Child
for w.Sib != nil && w.Sib.Id != c.Id {
w = w.Sib
}
if w.Sib == nil {
return errors.New("child not found")
} else {
x := w.Sib
w.Sib = w.Sib.Sib
x.Sib = nil
x.Parent = nil
}
\subsubsection{Method \ty{LCA}}
!The method \ty{LCA} returns the lowest common ancestor of two nodes
!or nil, if none could be found.
func (v *Node) LCA(w *Node) *Node {
clearPath(v)
clearPath(w)
markPath(v)
for w != nil && !w.marked {
w = w.Parent
}
return w
}
In \ty{clearPath} we remove the marks along the path from a node to
the root.
func clearPath(v *Node) {
for v != nil {
v.marked = false
v = v.Parent
}
}
In \ty{markPath} we mark the path from the node to the root.
func markPath(v *Node) {
for v != nil {
v.marked = true
v = v.Parent
}
}
We add the node field \ty{marked}.
\subsubsection{Method \ty{UpDistance}}
! The method \ty{UpDistance} returns the distance between the node
! and one of its ancestors.
func (v *Node) UpDistance(w *Node) float64 {
s := 0.0
x := v
for x != nil && x.Id != w.Id {
s += x.Length
x = x.Parent
}
//<<Return distance>>
}
If the ancestor wasn't found, we abort with message.
if x == nil {
log.Fatal("can't find ancestor")
}
return s
\subsubsection{Method \ty{UniformLabels}}
! The method \ty{UniformLabels} labels all nodes in the subtree with
! a prefix followed by the node ID.
func (v *Node) UniformLabels(pre string) {
label(v, pre)
}
We use the function \ty{label} to recursively label the nodes.
func label(v *Node, pre string) {
if v == nil { return }
label(v.Child, pre)
label(v.Sib, pre)
v.Label = pre + strconv.Itoa(v.Id)
}
\subsubsection{Method \ty{String}}
Apart from reading trees, we can write them by implementing the method
\ty{String} on \ty{Node}.
!\ty{String} turns a tree into its Newick string.
func (n *Node) String() string {
w := new(bytes.Buffer)
writeTree(n, w)
return w.String()
}
To convert our tree to Newick, we traverse it and ask four questions
about each node, $v$:
\begin{enumerate}
\item Is $v$ not a first child? Then it is delimited by a comma.
\item Is $v$ an internal node? Then it is a recursive structure in
parentheses.
\item Is $v$ the root? Then it is marked by a semicolon.
\end{enumerate}
func writeTree(v *Node, w *bytes.Buffer) {
if v == nil {
return
}
//<<Is $v$ not a first child?>>
//<<Is $v$ an internal node?>>
//<<Is $v$ the root?>>
}
A child is not the first child, if its Id is different from that of
its parent's first child.
if v.Parent != nil && v.Parent.Child.Id != v.Id {
fmt.Fprint(w, ",")
}
If a node has no child, it is a leaf characterized by its
label. Since internal nodes are also labeled, we use a function here,
\ty{printLabel}.
if v.Child == nil {
printLabel(w, v)
}
In \ty{printLabel}, we first clean up the label. Then we print it,
followed, perhaps, by the branch length. The root has no parent and
hence no branch length.
func printLabel(w *bytes.Buffer, v *Node) {
label := v.Label
//<<Clean up label>>
fmt.Fprintf(w, "%s", label)
if v.HasLength && v.Parent != nil {
fmt.Fprintf(w, ":%.3g", v.Length)
}
}
To clean up the label, we check whether it contains parentheses,
commas, or full stops. In that case we convert single quotes to double
single quotes and print the label in single quotes. Otherwise, we
convert blanks to underscores and print an ordinary label.
if strings.IndexAny(label, "(),.") != -1 {
label = strings.ReplaceAll(label, "'", "''")
label = fmt.Sprintf("'%s'", label)
} else {
label = strings.ReplaceAll(label, " ", "_")
}
If $v$ is an internal node, we place the subtree rooted on $v$ in
parentheses and print the subsequent label.
if v.Child != nil {
fmt.Fprint(w, "(")
}
writeTree(v.Child, w)
printLabel(w, v)
writeTree(v.Sib, w)
if v.Parent != nil && v.Sib == nil {
fmt.Fprint(w, ")")
}
If $v$ is the root, we mark it by a semicolon.
if v.Parent == nil {
fmt.Fprint(w, ";")
}
\subsubsection{Method \ty{Print}}
! Method \ty{Print} prints nodes indented to form a tree. The code is
! taken from Sedgewick, R. (1998). Algorithms in C, Parts 1-4. 3rd
! Edition, p. 237.
func (v *Node) Print() string {
h := 0
var b []byte
buf := bytes.NewBuffer(b)
show(v, h, buf)
return buf.String()
}
We recursively traverse the tree in the function \ty{show}.
func show(v *Node, h int, b *bytes.Buffer) {
if v == nil { return }
show(v.Sib, h, b)
printNode(v.Label, h, b)
show(v.Child, h+1, b)
}
The function \ty{printNode} takes care of the correct indentation. If
a node has no label, we print it as an asterisk.
func printNode(l string, h int, b *bytes.Buffer) {
for i := 0; i < h; i++ {
fmt.Fprintf(b, " ")
}
if len(l) == 0 { l = "*" }
fmt.Fprintf(b, "%s\n", l)
}
\subsubsection{Method \ty{Key}}
! Method \ty{Key} returns a string key for the nodes rooted on its
! receiver. The key consists of the sorted, concatenated labels of the
! nodes in the subtree. The labeles are joined on a separator supplied
! by the caller.
func (v *Node) Key(sep string) string {
labels := make(map[string]bool)
if v.Label != "" { labels[v.Label] = true }
collectLabels(v.Child, labels)
var keys []string
for k, _ := range labels {
keys = append(keys, k)
}
sort.Strings(keys)
key := strings.Join(keys, sep)
return key
}
The function \ty{collectLabels} iterates over the tree and collects
the non-empty labels in a map.
func collectLabels(v *Node, labels map[string]bool) {
if v == nil { return }
if v.Label != "" {
labels[v.Label] = true
}
collectLabels(v.Child, labels)
collectLabels(v.Sib, labels)
}
\subsubsection{Method \ty{RemoveClade}}
!The method \ty{RemoveClade} removes the clade rooted on the node on
!which it is called and sets that node to nil.
We begin by checking $v$, the node on which \ty{RemoveClade} is
called, the root of the clade we wish to remove. Then we check the
parent of $v$, $p$. Then we check the child of $p$, and finally we
check the siblings of $v$.
func (v *Node) RemoveClade() {
//<<Check $v$>>
//<<Check parent of $v$, $p$>>
//<<Check child of $p$>>
//<<Check siblings of $v$>>
}
If $v$ is nil, there's nothing to be removed and we return.
If the parent of $v$ is nil, $v$ is the root and we remove the
clade by setting $v$ to nil before we return.
p := v.Parent
if p == nil {
v = nil
return
}
If $v$ is the direct child of $p$, we reset the child link of $p$, set
the child nil and return.
w := p.Child
if w.Id == v.Id {
p.Child = w.Sib
v = nil
return
}
Now we walk along the siblings of $v$ until we find a node $w$ whose
sibling link points to $v$. Then we reset the sibling link of $w$ to
the sibling of $v$ and set $v$ nil.
for w.Sib != nil && w.Sib.Id != v.Id {
w = w.Sib
}
w.Sib = w.Sib.Sib
v = nil
\subsubsection{Method \ty{CopyClade}}
!The method \ty{CopyClade} copies the clade rooted on the node on
!which it is called and returns a copy of the root of the new clade.
We copying the root using the function \ty{copyNode} before we copy
the rest of the tree using the function \ty{copyTree}.
func (v *Node) CopyClade() *Node {
w := copyNode(v)
w = copyTree(v, w)
return w
}
The function \ty{copyNode} takes as argument a node, $a$, which we
copy to $b$ before returning $b$.
func copyNode(a *Node) *Node {
if a == nil {
return a
}
b := NewNode()
b.Label = a.Label
b.Length = a.Length
b.HasLength = a.HasLength
b.marked = a.marked
return b
}
The function \ty{copyTree} takes as input a node to be copied, $a$,
and the parent of the copy, $p$. It then copies $a$ into $b$ and sets
$b$'s parent, unless $a$ is the root, in which case $b$'s parent
remains the default nil. Then we copy the child of $a$ and its
sibling.
func copyTree(a, p *Node) *Node {
b := copyNode(a)
if a.Parent != nil {
b.Parent = p
}
//<<Copy child>>
//<<Copy sibling>>
return b
}
If $a$ has a child, we copy it to the child of $b$ and pass $b$ as the
parent.
if a.Child != nil {
b.Child = copyTree(a.Child, b)
}
Similarly, if $a$ has a sibling, we copy it to the sibling of $b$ and
pass $b$'s parent as the parent.
if a.Sib != nil {
b.Sib = copyTree(a.Sib, b.Parent)
}
\subsection{Data Structure \ty{Scanner}}
!\ty{Scanner} scans an input file one tree at a time.
We achieve this by wrapping the reader provided by the \ty{bufio}
package. The scanner has a hook for additional fields.
type Scanner struct {
r *bufio.Reader
//<<Scanner fields>>
}
\subsubsection{Function \ty{NewScanner}}
! \ty{NewScanner} returns a scanner for scanning Newick-formatted
! phylogenies.
func NewScanner(r io.Reader) *Scanner {
sc := new(Scanner)
sc.r = bufio.NewReader(r)
return sc
}
\subsubsection{Method \ty{Scan}}
!The method \ty{Scan} advances the scanner by one tree. A tree starts
!at the first opening parenthesis encountered and ends at the next
!semi colon.
\ty{Scan} wraps the \ty{bufio} method \ty{ReadString}.
func (s *Scanner) Scan() bool {
var err error
text, err := s.r.ReadString(';')
if err != nil {
return false
}
//<<Remove extraneous prefix>>
s.text = text
return true
}
Any character to the left of the first opening parenthesis must be
superfluous and is removed. If there is no opening parenthesis, we
report that there's no tree.
i := strings.Index(text, "(")
if i == -1 {
return false
}
text = text[i:]
We add the scanner field \ty{text}.
\subsubsection{Method \ty{Tree}}
!The method \ty{Tree} returns the most recent tree scanned.
It takes the text-version of a tree and
prepares it for scanning. Then it scans the string and constructs the
tree, which is returned by its root.
func (s *Scanner) Tree() *Node {
var root *Node
var tokens []string
tree := s.Text()
//<<Prepare tree string for scanning>>
//<<Convert tree string to tokens>>
//<<Convert tokens to tree>>
return root
}
!The method \ty{Text} returns the text scanned most recently.
It wraps access to the \ty{text}
field.
func (s *Scanner) Text() string {
return s.text
}
We scan the tree string using the token scanner from the standard
library. It is intended for tokenizing Go code, so we convert Newick
comments to Go comments, single quotes to double quotes, and we quote
numbers.
//<<Convert comments to Go comments>>
//<<Convert single quotes to double quotes>>
//<<Quote numbers>>
We convert the square brackets of Newick comments to the corresponding
Go markers.
tree = strings.ReplaceAll(tree, "[", "/*")
tree = strings.ReplaceAll(tree, "]", "*/")
We convert the single quotes to double quotes. And since literal
single quotes are marked by two single quotes in Newick, we convert
double-double quotes back to single quotes.
tree = strings.ReplaceAll(tree, "'", "\"")
tree = strings.ReplaceAll(tree, "\"\"", "'")
To quote numbers, we convert the tree into a slice of runes and
iterate over them. For each rune we decide whether it is part of a
number or not.
c1 := []rune(tree)
var c2 []rune
isNum := false
for _, r := range c1 {
//<<Is number?>>
c2 = append(c2, r)
}
tree = string(c2)
Numbers start with a colon and end with a right parenthesis or a
comma. If the user gave the root a branch length, a number might also
be terminated by a semicolon. We include the starting colon in the
number string to make it easier later on to distinguish between branch
lengths and node labels.
if r == ':' {
isNum = true
c2 = append(c2, '"')
}
if isNum && (r == ',' || r == ';' || r == ' ' || r == ')') {
isNum = false
c2 = append(c2, '"')
}
The tree string is now ready to be split into its syntactic tokens.
Some of these tokes come with enclosing quotes, which we remove. The
other tokens might be ordinary labels, where we convert underscores to
blanks.
var tsc scanner.Scanner
tsc.Init(strings.NewReader(tree))
for t := tsc.Scan(); t != scanner.EOF; t = tsc.Scan() {
text := tsc.TokenText()
if text[0] == '"' {
//<<Unquote token>>
} else {
//<<Convert underscores to blanks>>
}
tokens = append(tokens, text)
}
We unquote the token and check for errors.
var err error
text, err = strconv.Unquote(text)
if err != nil {
log.Fatalf("couldn't unquote %q\n", text)
}
We import \ty{strconv} and \ty{log}.
We convert all underscores to blanks.
text = strings.ReplaceAll(text, "_", " ")
We iterate across the tokens and classify them into tree elements.
i := 0
v := root
for i < len(tokens) {
t := tokens[i]
//<<Classify token>>
i++
}
root = v
There are five syntactic tokens, left and right parenthesis, comma,
colon, and semicolon. They each cause an operation on the tree:
\begin{center}
\begin{tabular}{cl}
\hline
Token & Action\\\hline
( & Add child\\
) & Move to parent\\
, & Add sibling\\
: & Add branch length\\
; & Break from loop and return tree\\\hline
\end{tabular}
\end{center}
If none of these apply, the token is a node label.
//<<Add child?>>
//<<Move to parent?>>
//<<Add sibling?>>
//<<Add branch length?>>
//<<Tree finished?>>
//<<Add label?>>
On a left parenthesis we add a child to our current node and
move to it. If the current node is nil, we initialize it first.
if t == "(" {
if v == nil {
v = NewNode()
}
v.AddChild(NewNode())
v = v.Child
}
On a right parenthesis we move to the parent.
if t == ")" {
v = v.Parent
}
On a comma, we add a sibling with the same parent as the current node
and move to it.
if t == "," {
s := NewNode()
s.Parent = v.Parent
v.Sib = s
v = v.Sib
}
On a leading colon we add a branch length.
if t[0] == ':' {
l, err := strconv.ParseFloat(t[1:], 64)
if err != nil {
log.Fatalf("didn't understand %q\n", t[1:])
}
v.Length = l
v.HasLength = true
}
On a semicolon the tree is finished and we break from the loop.
If none of the above, we are dealing with part of a label, if not the
full label.
if strings.IndexAny(t[:1], ")(,:;") == -1 {
v.Label += t
}
The package is finished, time to test it.