\section{Introduction}
Consider the text $t=\texttt{abracadabra}$ and think for a moment of
sorcerers, or the eponymous song by the Steve Miller Band. For it is
almost magical what we can do when we rearrange $t$. For example,
let's search it for a pattern like $p=\texttt{abra}$. You might think
that in the best case this takes time proportional to the length of
$t$. But it is possible to search in time proportional to the length
of $p$. If $t$ is the human genome with its 3.2 billion nucleotides
and $p$ is a PCR primer of 20 nucleotides, searching in time
proportional to the length of $p$ rather than $t$ might make a huge
difference. The data structure that makes this feat possible is
called a \emph{suffix tree}~\cite{gus97:alg} shown in
Figure~\ref{fig:abr} for \ty{abracadabra}.
\begin{figure}
\begin{center}
\input{abr}
\end{center}
\caption{Suffix tree of \ty{abracadabra}.}\label{fig:abr}
\end{figure}
A suffix tree is usually constructed from its corresponding
\emph{suffix array}, which starts from the suffixes of $t$ listed in
Table~\ref{tab:suf}A.
\begin{table}
\caption{The suffixes of text $t=\texttt{abracadabra}$ (\textbf{A})
and its suffix array, $\mbox{sa}$ (\textbf{B}).}\label{tab:suf}
\begin{center}
\begin{tabular}{cc}
\textbf{A} & \textbf{B}\\
\begin{tabular}{cl}
\hline
$i$ & $\mbox{suf}[i]$\\\hline
\input{suf}\\\hline
\end{tabular}
&
\begin{tabular}{ccl}
\hline
$i$ & $\mbox{sa}[i]$ & $\mbox{suf}[i]$\\\hline
\input{sa}\\\hline
\end{tabular}
\end{tabular}
\end{center}
\end{table}
By sorting them alphabetically, we obtain the suffix array of $t$,
$\mbox{sa}$, in Table~\ref{tab:suf}B. Strictly speaking, a suffix array
only consists of the array of starting positions, but explicit
suffixes are easier to think about than their mere starting positions,
hence the three columns in Table~\ref{tab:suf}B.
In the suffix array, some neighboring suffixes have matching
prefixes. For example, the first suffix, $\mbox{sa}[1]=\texttt{a}$
matches the first position of its successor,
$\mbox{sa}[2]=\texttt{abra}$, which in turn matches the first four
positions of its successor, $\mbox{sa}[3]=\texttt{abracadabra}$. The
lengths of these common prefixes, also known as the \emph{longest
common prefix}, $\mbox{lcp}$, is added to the suffix array in
Table~\ref{tab:esa}A. The longest common prefix array is said to
\emph{enhance} the underlying suffix array, because the combination of
suffix array and $\mbox{lcp}$ array can be used to implement suffix trees
with their near magical search properties~\cite{abo02:enh}.
\begin{figure}
\caption{Lcp interval tree of \ty{abracadabra}.}\label{fig:abrI}
\begin{center}
\input{abrI}
\end{center}
\end{figure}
An additional data structure for getting from suffix array to suffix
tree is the child
array~\cite[Sec. 4.3.4]{ohl13:bio}. Table~\ref{tab:esa}B shows the
suffix array enhanced by the $\mbox{lcp}$ array and the child array,
$\mbox{cld}$. This data structure allows efficient top-down traversal
of the so-called ``lcp interval tree''. Figure~\ref{fig:abrI} shows
the lcp interval tree of the corresponding suffix tree in
Figure~\ref{fig:abr}. The lcp interval tree has the topology of the
suffix tree when stripped of its terminal branches, leaving only the
internal nodes. Given an interval in the lcp interval tree and a
character, the function \ty{GetInterval} returns the child interval
whose edge-label starts with the character. The function
\ty{MatchPref} takes as input a pattern and repeatedly calls
\ty{GetInterval} to find the longest common prefix between the text
and the pattern.
\begin{table}
\caption{Suffix array of $t=\texttt{abracadabra}$ enhanced by the
$\mbox{lcp}$ array (\textbf{A}), and, additionally, by the child
array, $\mbox{cld}$ (\textbf{B}).}\label{tab:esa}
\begin{center}
\newcommand{\x}{0.90}
\begin{tabular}{cc}
\textbf{A} & \textbf{B}\\
\scalebox{\x}{
\begin{tabular}{cccl}
\hline
$i$ & $\mbox{sa}[i]$ & $\mbox{lcp}[i]$ & $\mbox{suf}[i]$\\\hline
\input{esaTab}\\\hline
\end{tabular}
}
&
\scalebox{\x}{
\begin{tabular}{ccccl}
\hline
$i$ & $\mbox{sa}[i]$ & $\mbox{lcp}[i]$ & $\mbox{cld}[i]$ & $\mbox{suf}[i]$\\\hline
\input{esaTab2}\\\hline
\end{tabular}
}
\end{tabular}
\end{center}
\end{table}
\section{Implementation}
!Package \texttt{esa} provides functions for the computation of
!suffix arrays and related data structures. It is based on Yuta Mori's
!highly optimized divsufsort library published at
!\texttt{https://github.com/y-256/libdivsufsort}
The divsufsort library is written in C, so our package outline
contains hooks for bridging the gap between Go and C, as well as for
imports, types, methods, and functions.
package esa
//<<Bridge to C>>
import (
//<<Imports>>
)
//<<Types>>
//<<Methods>>
//<<Functions>>
The bridge to C consists of three elements, \texttt{cgo} instructions,
includes, and the import of package \texttt{C}. As explained in
\cite[p. 361ff]{don16:go}, the \texttt{cgo} instructions and the
includes are commented out, while the import of \texttt{C} is regular
Go code.
/*
//<<Cgo>>
//<<Includes>>
*/
import "C"
There is only one \texttt{cgo} instruction, a linker flag refering to
the 64-bit version of the divsufsort library. This means we assume we
are working on a 64 bit machine. We also accommodate homebrew on macOS.
#cgo LDFLAGS: -ldivsufsort64 -L/opt/homebrew/lib
#cgo CFLAGS: -I/opt/homebrew/include
We include the corresponding header.
#include <divsufsort64.h>
\subsection{Function \texttt{Sa}}
!\texttt{Sa} takes as argument a text as byte slice and returns the
!corresponding suffix array.
func Sa(t []byte) []int {
var sa []int
//<<Compute suffix array>>
return sa
}
The prototype of the library call for suffix array construction is
\begin{verbatim}
saint_t
divsufsort64(const sauchar_t *t, saidx64_t *sa, saidx64_t n);
\end{verbatim}
where \ty{t} is the text of length \texttt{n}, and \texttt{sa} its
suffix array. The function returns 0 upon success, -1 or -2
otherwise. To use it, we convert the text to C, allocate space for the
suffix array in C, and convert the text length to C. Then we call
\texttt{divsufsort64} with these three arguments, and finally convert
the result back to Go.
//<<Convert text>>
//<<Allocate suffix array>>
//<<Convert text length>>
//<<Call \texttt{divsufsort}>>
//<<Convert result>>
To convert the text byte slice into a pointer to \texttt{char}, we
obtain the slice header and convert its data portion using the types
listed in the prototype of \texttt{divsufsort64}.
header := (*reflect.SliceHeader)(unsafe.Pointer(&t))
ct := (*C.sauchar_t)(unsafe.Pointer(header.Data))
We import \texttt{reflect} and \texttt{unsafe}.
We allocate the suffix array by calling \texttt{malloc} of the
standard C library.
n := len(t)
csa := (*C.saidx64_t)(C.malloc(C.size_t(n * C.sizeof_saidx64_t)))
We include the header of the C standard library.
We cast the text length to its C equivalent.
If \texttt{divsufsort} returns an error, we abort with an error
message.
err := int(C.divsufsort64(ct, csa, cn))
if err != 0 {
log.Fatalf("divsufsort failed with code %d\n", err)
}
To convert the C suffix array, \texttt{csa}, back to the integer slice
promised, we access the header of \ty{sa} and cast \ty{csa} to its Go
equivalent.
header = (*reflect.SliceHeader)((unsafe.Pointer(&sa)))
header.Cap = n
header.Len = n
header.Data = uintptr(unsafe.Pointer(csa))
\subsection{Function \texttt{Lcp}}
!\texttt{Lcp} takes as argument a text and its suffix array, and
!returns the longest common prefix array, lcp.
The implementation is based on Algorithm~\ref{alg:lcp}, which runs in
time linear in the length of the text. The algorithm proceeds in two
phases, computation of the inverse suffix array, $\mbox{isa}$ (lines 1--3), which
is then used for the computation of the $\mbox{lcp}$ array (lines
6--16). In between, we initialize the $\mbox{lcp}$ array and the
length of the last common prefix found, $\ell$ (lines 4 \& 5).
\begin{algorithm}
\caption{Linear-time computation of $\mbox{lcp}$
array~\cite{kas01:lin}.}\label{alg:lcp}
\begin{algorithmic}[1]
\input{lcpAlg}
\end{algorithmic}
\end{algorithm}
func Lcp(t []byte, sa []int) []int {
n := len(t)
lcp := make([]int, n)
isa := make([]int, n)
//<<Compute inverse suffix array>>
//<<Initialize variables>>
//<<Fill $\mbox{lcp}$ array>>
return lcp
}
The inverse suffix array is computed by transcribing lines 1--3 of
Algorithm~\ref{alg:lcp}.
for i := 0; i < n; i++ {
isa[sa[i]] = i
}
We initialize the $\mbox{lcp}$ array and $\ell$.
We transcribe lines 6--16 of Algorithm~\ref{alg:lcp} to fill the
$\mbox{lcp}$ array.
for i := 0; i < n; i++ {
j := isa[i]
if j == 0 { continue }
k := sa[j - 1]
for k+l < n && i+l < n && t[k+l] == t[i+l] { l++ }
lcp[j] = l
l -= 1
if l < 0 { l = 0 }
}
\subsection{Function \ty{Cld}}
!\texttt{Cld} takes an lcp array as input and returns the
!corresponding child array.
\ty{Cld} implements Algorithm~\ref{alg:cld} with his space-saving
notation that $\mbox{cld}[i].r\equiv\mbox{cld}[i]$ and
$\mbox{cld}[i].l\equiv\mbox{cld}[i-1]$~\cite[p. 109f]{ohl13:bio}.
\begin{algorithm}
\caption{Computation of child table~\cite[p. 109f]{ohl13:bio}.}\label{alg:cld}
\begin{algorithmic}[1]
\input{cldAlg}
\end{algorithmic}
\end{algorithm}
The child array is based on the lcp interval with a stop entry
appended. We first construct and initialize the child array. Then we
construct and initialize the stack of tree intervals. Then we iterate
over the child array to fill it. Finally, we remove the stop entry
from the lcp array.
func Cld(lcp []int) []int {
var cld []int
lcp = append(lcp, -1)
//<<Construct and initialize child array>>
//<<Construct and initialize interval stack>>
//<<Iterate over child array>>
lcp = lcp[:len(lcp)-1]
return cld
}
The length of the input text is $n$, which is one less than the length
of the given lcp array. The child array has length $n+1$ and we
initialize its first entry to $n$.
n := len(lcp) - 1
cld = make([]int, n + 1)
cld[0] = n
To prepare the construction of the interval stack, we declare it as a
slice of pointers to intervals. Then we declare a tree interval as
consisting of its start index and its lcp value.
type Stack []*Interval
type Interval struct {
Idx int
Lcp int
}
Our stack of intervals has the usual stack methods, \ty{Top},
\ty{Pop}, and \ty{Push}.
func (s *Stack) Top() *Interval {
return (*s)[len(*s) - 1]
}
func (s *Stack) Pop() *Interval {
i := (*s)[len(*s) - 1]
(*s) = (*s)[0:len(*s) - 1]
return i
}
func (s *Stack) Push(i *Interval) {
(*s) = append(*s, i)
}
Having declared our stack type, we construct an actual stack. We
initialize it with the root interval, which we construct using a call
to \ty{newInterval}.
stack := new(Stack)
iv := newInterval(0, -1)
stack.Push(iv)
The function \ty{newInterval} takes as arguments the index and the lcp
value of the interval, and returns a pointer to the newly minted
interval.
func newInterval(i, l int) *Interval {
iv := new(Interval)
iv.Idx = i
iv.Lcp = l
return iv
}
We iterate over the child array, searching for the next tree interval,
which we push on the stack.
for i := 1; i <= n; i++ {
top := stack.Top()
//<<Find next interval>>
iv = newInterval(i, lcp[i])
stack.Push(iv)
}
While looking for the next interval, we fill the child array.
for lcp[i] < top.Lcp {
last := stack.Pop()
top = stack.Top()
//<<Fill child array>>
}
First we fill the right entries of the child array for intervals with
identical string depths. The last interval we reach either refers to
a right child or a left child.
for top.Lcp == last.Lcp {
cld[top.Idx] = last.Idx
last = stack.Pop()
top = stack.Top()
}
top = stack.Top()
if lcp[i] < top.Lcp {
cld[top.Idx] = last.Idx
} else {
cld[i - 1] = last.Idx
}
\subsection{Function \ty{MakeEsa}}
!\texttt{MakeEsa} takes as input a text and returns a pointer to the
!corresponding enhanced suffix array.
We append a sentinel to the text and a stop element to the lcp array.
func MakeEsa(t []byte) *Esa {
esa := new(Esa)
esa.T = t
esa.T = append(esa.T, 0)
esa.Sa = Sa(esa.T)
esa.Lcp = Lcp(esa.T, esa.Sa)
esa.Lcp = append(esa.Lcp, -1)
esa.Cld = Cld(esa.Lcp)
return esa
}
\subsection{Type \ty{Esa}}
!\texttt{Esa} is an enhanced suffix array consisting of the underlying
!text, the suffix array, the longest common prefix array, and the
!child array.
type Esa struct {
T []byte
Sa []int
Lcp []int
Cld []int
}
\subsection{Method \texttt{MatchPref}}
!Method \texttt{MatchPref} takes as input a query sequence and returns
!the match with the longest prefix of the query.
I follow Algorithm 5.2 in~\cite[p. 119]{ohl13:bio}. We set the pattern
position, $k$, to the beginning and call the pattern length $m$. Then
we declare a parent and a child interval as a match interval, which is
yet to be implemented. The first parent interval we construct, is the
root of the tree. Starting from the root, we iterate over the pattern
and walk the tree until we return the child representing the longest
match.
func (e *Esa) MatchPref(p []byte) *Minterval {
k := 0
m := len(p)
var parent, child *Minterval
//<<Construct parent>>
//<<Iterate over pattern>>
child.L = k
return child
}
!A \texttt{Minterval} is a match interval. It has a beginning, an end,
!and a match length.
type Minterval struct {
I, J int
L int
}
At this point, the parent interval is the root of the lcp interval
tree and spans the full lcp array.
parent = new(Minterval)
parent.I = 0
parent.J = len(e.T) - 1
We iterate over the pattern and get the child for the character at the
current pattern position, $k$. If this child is nil, we've run out of
matches. In that case, we set the parent's match length to $k$ and
return it. Otherwise, we initialize the length of the current path
label, $l$, to $m$ and determine its correct value. Having determined
$l$, we can match along that path label. After matching, we adjust the
pattern position to the length of the path label.
for k < m {
child = e.GetInterval(parent, p[k])
if child == nil {
parent.L = k
return parent
}
l := m
//<<Determine length of path label>>
//<<Match path label>>
k = l
}
If the child interval is not a singleton interval, we determine the
length of the path label of the child array.
i := child.I
j := child.J
if i < j {
r := 0
if e.Lcp[i] <= e.Lcp[j+1] {
r = e.Cld[j]
} else {
r = e.Cld[i]
}
l = min(l, e.Lcp[r])
}
The function \ty{min} returns the smaller of two integers.
func min(i, j int) int {
if i < j {
return i
}
return j
}
We walk along the path label until we run out of matches. If we find a
mismatch, we set the child's match length and return it.
for w := k+1; w < l; w++ {
if e.T[e.Sa[i]+w] != p[w] {
child.L = w
return child
}
}
\subsection{Method \ty{GetInterval}}
!Method \texttt{GetInterval} takes as input an
!enhanced suffix array, a match interval, and a character. It returns
!the interval of the suffixes starting with that character.
I follow Algorithm~\ref{alg:giv}, which is adapted from
two Algorithms, Algorithm~5.1~\cite[p. 118]{ohl13:bio} and
Algorithm~4.10~\cite[p. 109]{ohl13:bio}.
\begin{algorithm}
\caption{Get matching child interval; adapted from Algorithms 5.1 and 4.10~\cite{ohl13:bio}}\label{alg:giv}.
\begin{algorithmic}
\input{givAlg}
\end{algorithmic}
\end{algorithm}
Inside \ty{GetInterval}, we first check whether we are dealing with a
singleton interval. If not, we iterate over the children of the proper
interval and pick the matching one. By default, we don't find a
matching child and hence return nil.
func (e *Esa) GetInterval(iv *Minterval, c byte) *Minterval {
i := iv.I
j := iv.J
if i == j {
//<<Deal with singleton interval>>
}
//<<Deal with proper interval>>
return nil
}
For a singleton interval, we check whether it's a match, in which case
we return it.
if e.T[e.Sa[i]] == c {
return iv
}
A proper interval is analyzed by determining the new match length and
then iterating with that over the children. During the iteration a
match may be found, which is returned. Finally, we check the last
child reached.
//<<Determine match length>>
//<<Iterate over child intervals>>
//<<Check last child>>
The match length is looked up in the lcp array with the aid of the
child array.
m := 0
if e.Lcp[i] <= e.Lcp[j+1] {
m = e.Cld[j]
} else {
m = e.Cld[i]
}
l := e.Lcp[m]
We iterate over the child intervals with the same lcp value, that is,
string depth. If we find a match, we return it. If we find a
singleton, we break from the loop.
k := i
for e.Lcp[m] == l {
if e.T[e.Sa[k]+l] == c {
iv.I = k
iv.J = m - 1
return iv
}
k = m
if k == j { break }
m = e.Cld[m]
}
We check the last interval.
if e.T[e.Sa[k]+l] == c {
iv.I = k
iv.J = j
return iv
}
\section{Testing}
The testing outline contains hooks for imports and the testing logic.
package esa
import (
"testing"
//<<Testing imports>>
)
func TestEsa(t *testing.T) {
//<<Testing>>
}
We compute a small suffix array and the corresponding $\mbox{lcp}$ and
cld arrays. Then we compare what we get to what we want. After that,
we test the matching.
//<<Compute small suffix array>>
//<<Compute $\mbox{lcp}$ array>>
//<<Compute cld array>>
//<<Compare results>>
//<<Test Matching>>
Our testing text is \texttt{abracadabra}. We compute its suffix array
by calling \ty{Sa}.
text := []byte("abracadabra")
sa := Sa(text)
Similarly, getting the $\mbox{lcp}$ array is just a call to \ty{Lcp}.
We compute the cld array from the lcp array by calling \ty{Cld}.
To compare the results, we write what we get and read from file what
we want.
var get, want []byte
//<<Write what we get>>
//<<Read what we want>>
if !bytes.Equal(want, get) {
t.Errorf("want:\n%s\nget:\n%s\n", want, get)
}
We import \texttt{bytes}.
We write the results we get to a buffer.
w := new(bytes.Buffer)
for i, s := range sa {
fmt.Fprintf(w, "%d\t%d\t%d\t%d\t%s\n",
i, s, lcp[i], cld[i], text[s:])
}
get = w.Bytes()
The results we want are contained in the file \texttt{r1.txt}.
want, err := ioutil.ReadFile("r1.txt")
if err != nil {
t.Errorf("couldn't read r1.txt\n")
}
We import \texttt{ioutil}.
We generate an enhanced suffix array for our text and match
\ty{racket} against it. The result should be a match of length 3,
starting at a single index position, 2, in the text.
e := MakeEsa(text)
pattern := []byte("racket")
iv := e.MatchPref(pattern)
if iv.L != 3 || e.Sa[iv.I] != 2 || e.Sa[iv.J] != 2 {
t.Errorf("couldn't match %s\n", pattern)
}