Skip to content

Latest commit

 

History

History
522 lines (518 loc) · 13.3 KB

dist.org

File metadata and controls

522 lines (518 loc) · 13.3 KB
\section{Introduction}
Distance matrices crop up in many contexts in biology, for example,
phylogeny reconstruction. In phylogeny reconstruction, the most widely
used format for distance matrices is the
PHYLIP\footnote{\texttt{https://evolution.genetics.washington.edu/phylip.html}}
format, which consists of the number of taxa, $n$, followed by $n$
rows of data, each preceded by the taxon name, Table~\ref{tab:dm}
shows an example with columns separated by white space, that is,
blanks or tabs. Notice that the matrix is symmetrical, except for one
pair of entries, the top right cell is 0.186, the bottom let hand cell
0.184.
\begin{table}
  \caption{Distance matrix in PHYLIP format.}\label{tab:dm}
  \begin{center}
    \input{dm}
  \end{center}
\end{table}

There
may be several distance matrices in a data stream, and we ignore rows
that don't contain numbers between them. However, there can be no
extra lines between the number of taxa and the actual data. So we can summarize:

!Package \texttt{dist} implements data structures and functions for
!reading, writing, and manipulating distance matrices in PHYLIP format.
package dist
import (
	  //<<Imports>>
)
//<<Data structures>>
//<<Methods>>
//<<Functions>>
The package is centered on two data structures, \texttt{DistMat} for
storing a distance matrix, and \texttt{Scanner} for reading distance
matrices.

\section{Structure \texttt{DistMat}}
!A \texttt{DistMat} holds a square matrix of distances and a
!slice of taxon names.
type DistMat struct {
	  Matrix [][]float64
	  Names []string
}
\subsection{Function \texttt{NewDistMat}}
!Function NewSequence returns a new n x n \texttt{DistMat}.
func NewDistMat(n int) *DistMat {
	  d := new(DistMat)
	  d.Matrix = make([][]float64, n)
	  for i := 0; i < n; i++ {
		  d.Matrix[i] = make([]float64, n)
	  }
	  d.Names = make([]string, n)
	  return d
}
\subsection{Method \texttt{String}}
Writing a \texttt{DistMat} is delegated to its \texttt{String} method.
!\texttt{String} returns a distance matrix as a table.
We construct this table using a tab writer.
func (d *DistMat) String() string {
	  mat := ""
	  //<<Construct tab writer>>
	  //<<Write number of taxa>>
	  //<<Write matrix>>
	  return mat
}
Our tab writer writes to a byte buffer. The columns of the table
written to the buffer are delimited with blanks.
var buf []byte
buffer := bytes.NewBuffer(buf)
w := new(tabwriter.Writer)
w.Init(buffer, 1, 0, 2, ' ', 0)
We import \ty{bytes} and \ty{tabwriter}.
"bytes"
"text/tabwriter"
We write the number of taxa and flush the tab writer, as the number of
taxa is not part of the matrix table.
n := len(d.Matrix)
fmt.Fprintf(w, "%d\n", n)
w.Flush()
We import \ty{fmt}.
"fmt"
We iterate over the matrix and write its rows. At the end, we convert
the buffer contents to string.
for i := 0; i < n; i++ {
	  fmt.Fprintf(w, "%s", d.Names[i])
	  for j := 0; j < n; j++ {
		  fmt.Fprintf(w, "\t%.3g", d.Matrix[i][j])
	  }
	  fmt.Fprint(w, "\n")
}
w.Flush()
mat = buffer.String()
\subsection{Method \texttt{MakeSymmetrical}}
!The method \texttt{MakeSymmetrical} forms the arithmetic average between
!corresponding entries in a \texttt{DistMat}.
func (d *DistMat) MakeSymmetrical() {
	  n := len(d.Matrix)
	  for i := 0; i < n-1; i++ {
		  for j := i+1; j < n; j++ {
			  m1 := d.Matrix[i][j]
			  m2 := d.Matrix[j][i]
			  a := (m1 + m2) / 2.0
			  d.Matrix[i][j] = a
			  d.Matrix[j][i] = a
		  }
	  }
}
\subsection{Method \texttt{Delete}}
!The method \texttt{Delete} removes one taxon from
!\texttt{DistMat}.
We remove the taxon from the names list and from the distance matrix.
func (d *DistMat) Delete(taxon int) {
	  //<<Delete taxon from names>>
	  //<<Delete taxon from distances>>
}
We delete the taxon from the list of names by reslicing.
j := 0
for i, name := range d.Names {
	  if i != taxon {
		  d.Names[j] = name
		  j++
	  }
}
d.Names = d.Names[:j]
For each row of distances, we check whether it's to be included and,
if so, add it as row \ty{ii}. Afterwards, we adjust the matrix
dimensions.
n := len(d.Matrix)
ii := 0
for i := 0; i < n; i++ {
	  if i == taxon { continue }
	  //<<Add row>>
	  ii++
}
//<<Adjust matrix dimensions>>
In the row we add, we skip the deleted column.
jj := 0
for j := 0; j < n; j++ {
	  if j == taxon { continue }
	  d.Matrix[ii][jj] = d.Matrix[i][j]
	  jj++
}
We use reslicing to remove the last row and column.
n--
for i := 0; i < n; i++ {
	  d.Matrix[i] = d.Matrix[i][:n]
}
d.Matrix = d.Matrix[:n]
\subsection{Method \texttt{DeletePair}}
In phylogeny reconstruction we often need to delete a pair of taxa in
one step.
!Method \texttt{DeletePair} deletes a pair of taxa.
func (d *DistMat) DeletePair(t1, t2 int) {
	  //<<Delete taxon pair from names>>
	  //<<Delete taxon pair from matrix>>
}
We delete the two names.
j := 0
for i, n := range d.Names {
	  if i != t1 && i != t2 {
		  d.Names[j] = n
		  j++
	  }
}
d.Names = d.Names[:j]
We delete the two taxa from the distance matrix by first deleting
their rows, then their columns.
//<<Delete rows>>
//<<Delete columns>>
We keep the rows that don't correspond to the deleted taxa.
r := 0
for i, row := range d.Matrix {
	  if i == t1 || i == t2 { continue }
	  d.Matrix[r] = row
	  r++
}
d.Matrix = d.Matrix[:r]
Similarly, we keep the columns that don't correspond to the deleted
taxa.
for i, row := range d.Matrix {
	  c := 0
	  for j, col := range row {
		  if j == t1 || j == t2 { continue }
		  row[c] = col
		  c++
	  }
	  d.Matrix[i] = row[:c]
}
\subsection{Method \texttt{Append}}
!Method \texttt{Append} appends a taxon consisting of a name and a
!slice of distances to the \texttt{DistMat}.
We first check that the incoming data and the existing matrix have the
same dimensions. Then we append the name, followed by a row and a
column of distances.
func (d *DistMat) Append(name string, data []float64) {
	  //<<Check dimensions>>
	  d.Names = append(d.Names, name)
	  //<<Append distance row>>
	  //<<Append distance column>>
}
If the data supplied doesn't have the same dimension as the matrix,
something has gone wrong and we bail with a friendly message.
n := len(d.Matrix)
if n != len(data) {
	  i := "matrix length (%d) != data length (%d)"
	  log.Fatalf(i, n, len(data))
}
We import \ty{log}.
"log"
We extend the data by the last zero on the main diagonal and append it
as the last row.
data = append(data, 0.0)
d.Matrix = append(d.Matrix, data)
We go through the rows and append the last column to each.
for i := 0; i < n; i++ {
	  d.Matrix[i] = append(d.Matrix[i], data[i])
}
\subsection{Method \texttt{Min}}
!Method \texttt{Min} returns the minimum entry of the distance matrix
!and its indexes.
Each off-diagonal entry is analyzed.
func (d *DistMat) Min() (min float64, minI, minJ int) {
	  min = math.MaxFloat64
	  n := len(d.Names)
	  for i := 0; i < n; i++ {
		  for j := 0; j < n; j++ {
			  if i != j {
				  //<<Analyze entry>>
			  }
		  }
	  }
	  return min, minI, minJ
}
We import \ty{math}.
"math"
An entry is compared to the current minimum.
if d.Matrix[i][j] < min {
	  min = d.Matrix[i][j]
	  minI = i
	  minJ = j
}
\subsection{Method \texttt{Max}}
!Method \texttt{Max} returns the maximum entry of the distance matrix
!and its indexes.
At each off-diagonal entry we search for the maximum.
func (d *DistMat) Max() (min float64, maxI, maxJ int) {
	  max := -math.MaxFloat64
	  n := len(d.Names)
	  for i := 0; i < n; i++ {
		  for j := 0; j < n; j++ {
			  if i != j {
				  //<<Search for maximum>>
			  }
		  }
	  }
	  return max, maxI, maxJ
}
We search for the maximum.
if max < d.Matrix[i][j] {
	  max = d.Matrix[i][j]
	  maxI = i
	  maxJ = j
}
\section{Structure \texttt{Scanner}}
!A \texttt{DistMat} is read using a \texttt{Scanner}.
It is modeled on \texttt{bufio.Scanner}, but can handle lines of any
length, hence it wraps a buffered reader rather than a buffered scanner. We also
add a hook for additional fields.
type Scanner struct {
	  r *bufio.Reader
	  //<<Scanner fields>>
}
We import \texttt{bufio}.
"bufio"
\subsection{Function \ty{NewScanner}}
!The function \texttt{NewScanner} returns a new \ty{Scanner}.
func NewScanner(r io.Reader) *Scanner {
	  sc := new(Scanner)
	  sc.r = bufio.NewReader(r)
	  return sc
}
We import \ty{io}.
"io"
When parsing a distance matrix, we read three types of lines: Lines
without numbers between matrices, which we skip, a header line
containing a number, $n$, or one of the $n$ matrix lines.
\subsection{Method \texttt{Scan}}
!\texttt{Scan} reads input matrix by matrix.
func (s *Scanner) Scan() bool {
	  //<<Skip to header line>>
	  //<<Read header>>
	  //<<Read matrix>>
	  return true
}
Lines between matrices don't contain numbers.
var err error
const num = "1234567890"
l, err := s.r.ReadString('\n')
for err == nil && strings.IndexAny(l, num) < 0 {
	  l, err = s.r.ReadString('\n')
}
if err != nil { return false }
We import \ty{strings}.
"strings"
We read the header and construct a new distance matrix.
l = strings.TrimRight(l, "\r\n")
n, err := strconv.Atoi(l)
if err != nil {
	  log.Fatalf("can't convert %q", l)
}
s.mat = NewDistMat(n)
We import \ty{strconv}.
"strconv"
We declare the scanner field \ty{mat}.
mat *DistMat
We read the $n$ rows of the distance matrix and store each one.
for i := 0; i < n; i++ {
	  l, err = s.r.ReadString('\n')
	  if err != nil { return false }
	  fields := strings.Fields(l)
	  //<<Store row>>
}
The first field in a row is the name, the other fields are the
distances. Parsing them might lead to an error, which we
catch. Alternatively, it might lead to a not-a-number
value, which we warn about.
s.mat.Names[i] = fields[0]
for j := 1; j <= n; j++ {
	  s.mat.Matrix[i][j-1], err =
		  strconv.ParseFloat(fields[j], 64)
	  if err != nil {
		  log.Fatal(err)
	  } else if math.IsNaN(s.mat.Matrix[i][j-1]) {
		  fmt.Fprintf(os.Stderr, "Warning: %s\n", fields[j])
	  }
}
We import \ty{fmt} and \ty{os}.
"fmt"
"os"
\subsection{Method \texttt{DistanceMatrix}}
!The method \texttt{DistanceMatrix} returns the last \ty{DistMat}
!scanned.
func (s *Scanner) DistanceMatrix() *DistMat {
	  return s.mat
}