|
| 1 | +\section{Model data format}\label{sec:model-data-format} |
| 2 | + |
| 3 | +Because the model data does not need to be very concise, MessagePack has been |
| 4 | +chosen as the data storage format for the models. |
| 5 | +The file contains the model type (acids or quality scores), context specifier |
| 6 | +type (described in \Cref{subsec:contexts}), list of contexts (symbol |
| 7 | +probabilities), and map of context specifiers to context indices. |
| 8 | +Precisely, the data stored in such MessagePack file corresponds to a given |
| 9 | +JSON file: |
| 10 | + |
| 11 | +\colorlet{punct}{red!60!black} |
| 12 | +\definecolor{background}{HTML}{EEEEEE} |
| 13 | +\definecolor{delim}{RGB}{20,105,176} |
| 14 | +\colorlet{numb}{magenta!60!black} |
| 15 | + |
| 16 | +\lstdefinelanguage{json}{ |
| 17 | + basicstyle=\normalfont\ttfamily, |
| 18 | + numberstyle=\scriptsize, |
| 19 | + commentstyle=\color{gray}, |
| 20 | + stepnumber=1, |
| 21 | + numbersep=8pt, |
| 22 | + showstringspaces=false, |
| 23 | + breaklines=true, |
| 24 | + frame=single, |
| 25 | + comment=[l]{//}, |
| 26 | + backgroundcolor=\color{background}, |
| 27 | + literate= |
| 28 | + *{0}{{{\color{numb}0}}}{1} |
| 29 | + {1}{{{\color{numb}1}}}{1} |
| 30 | + {2}{{{\color{numb}2}}}{1} |
| 31 | + {3}{{{\color{numb}3}}}{1} |
| 32 | + {4}{{{\color{numb}4}}}{1} |
| 33 | + {5}{{{\color{numb}5}}}{1} |
| 34 | + {6}{{{\color{numb}6}}}{1} |
| 35 | + {7}{{{\color{numb}7}}}{1} |
| 36 | + {8}{{{\color{numb}8}}}{1} |
| 37 | + {9}{{{\color{numb}9}}}{1} |
| 38 | + {:}{{{\color{punct}{:}}}}{1} |
| 39 | + {,}{{{\color{punct}{,}}}}{1} |
| 40 | + {\{}{{{\color{delim}{\{}}}}{1} |
| 41 | + {\}}{{{\color{delim}{\}}}}}{1} |
| 42 | + {[}{{{\color{delim}{[}}}}{1} |
| 43 | + {]}{{{\color{delim}{]}}}}{1}, |
| 44 | +} |
| 45 | + |
| 46 | +\begin{lstlisting}[language=json,firstnumber=1,label={lst:model-json}] |
| 47 | +[ |
| 48 | + // Model identifier (as a byte array) |
| 49 | + [31, 77, 69, 112, ..., 125], |
| 50 | + // Model type |
| 51 | + "Acids", |
| 52 | + // Context specifier type |
| 53 | + "generic_ao4_qo1_pb2", |
| 54 | + [ |
| 55 | + [ |
| 56 | + // Context specifiers (represented as integers) |
| 57 | + [420, 2137, 2403], |
| 58 | + // Context |
| 59 | + [ |
| 60 | + // Context probability |
| 61 | + // The sum of all context probabilities in a model should be 1 |
| 62 | + 0.1234, |
| 63 | + // Symbol probabilities |
| 64 | + // (in this case: N, A, C, T, G, respectively) |
| 65 | + // The sum of all symbol probabilities in a context should be 1 |
| 66 | + [0.0, 0.2, 0.3, 0.4, 0.1] |
| 67 | + ] |
| 68 | + ], |
| 69 | + // ... more contexts |
| 70 | + ] |
| 71 | +] |
| 72 | +\end{lstlisting} |
| 73 | + |
| 74 | +The \emph{model identifier} is an SHA-3\cite{1421} 256-bit checksum of the |
| 75 | +entire model contents. |
| 76 | +When deserializing the model from a file, the identifier indicates if the |
| 77 | +model has been read correctly. |
| 78 | +When reading a sequence file, the identifier list tells the decompressor |
| 79 | +which models to use. |
| 80 | + |
| 81 | +The identifier generation process starts with serialized by storing the model |
| 82 | +type, context specifier type, model map sorted by keys ascending, and then |
| 83 | +the contexts themselves. |
| 84 | +Then, the hash of such a blob is calculated. |
0 commit comments