dacapo_calibration.nw

% -*- mode: text -*-

% In Emacs, use "C-x f" to set the fill column; use 80

\documentclass[a4paper,10pt,twoside]{book}
\usepackage{a4wide}
\usepackage{amsmath}
\usepackage{algorithmic}
\usepackage{bm}
\usepackage{noweb}
\usepackage{fancyhdr}
\usepackage{url}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{ccaption}
\usepackage{textcomp}
\usepackage{titlesec}
\usepackage[round]{natbib}
\usepackage{fontspec}
\usepackage[dvipsnames]{xcolor}
\usepackage{verbatim}
\usepackage[bitstream-charter]{mathdesign}
\usepackage[T1]{fontenc}

\newcommand{\DaCapo}{\texttt{Da\,Capo}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\DeclareTextCommandDefault{\nobreakspace}{\leavevmode\nobreak\ }

\noweboptions{smallcode,longchunks}
\def\nwendcode{\endtrivlist \endgroup}
\let\nwdocspar=\par
%\def\nwendcode{\endtrivlist \endgroup \vfil\penalty10\vfilneg}
%\let\nwdocspar=\smallbreak

% Command used to indicate a section
\newcommand{\sectmark}{\S\ }

\titleformat{\section}[block]
  {\centering\normalfont\bfseries}
  {\sectmark\thesection.}{.5em}{}
\titleformat{\subsection}[runin]
  {\normalfont\bfseries}
  {\thesubsection.}{.5em}{}[. ]
\titleformat{\subsubsection}[runin]
  {\normalfont\bfseries}
  {}{.2em}{}[. ]

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\pagestyle{fancy}
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\sectionmark}[1]{%
  \markright{\thesection.\ #1}}
\fancyhf{}
\fancyhead[L,RO]{\bfseries\thepage}
\fancyhead[LO]{\bfseries\rightmark}
\fancyhead[RE]{\bfseries \DaCapo{} calibration}

\fancypagestyle{plain}{%
  \fancyhf{}
  \fancyfoot[C]{\thepage}
  \renewcommand{\headrulewidth}{0pt}
  \renewcommand{\footrulewidth}{0pt}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\matr}[1]{\bm{#1}}
\newcommand{\vect}[1]{\bm{#1}}

\captionnamefont{\small\bfseries}
\captiontitlefont{\small\itshape}

\newcommand\myshade{85}
\definecolor{hypercolor}{RGB}{29,152,29}
\colorlet{mylinkcolor}{hypercolor}
\colorlet{mycitecolor}{hypercolor}
\colorlet{myurlcolor}{hypercolor}

\hypersetup{
  pdftitle={Calibrating TODs using the DaCapo algorithm},
  pdfauthor=Maurizio Tomasi,
  pdfsubject={Commented implementation of a Python program to simulate
  the calibration of TODs},
  pdfkeywords={CMB {data analysis} {optics}},
  pdfborder={0 0 0},
  linkcolor  = mylinkcolor!\myshade!black,
  citecolor  = mycitecolor!\myshade!black,
  urlcolor   = myurlcolor!\myshade!black,
  colorlinks = true,
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

\bibliographystyle{plainnat}

\title{Calibrating TODs using the \DaCapo{} algorithm}
\author{Maurizio Tomasi}
\maketitle

\frontmatter

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter*{Introduction}
\label{ch:introduction}

This document presents the implementation of a Python\ 3 program that
is able to calibrate Time Ordered Data (TOD) measured by a CMB space
survey using the signal of the CMB dipole. The underlying assumption
is that the signal of the dipole is strong enough to be used for a
photometric calibration, and that the scanning strategy of the
spacecraft is such that the full dynamic range of the dipole signal is
explored regularly during the survey.

The algorithm used for the calibration is \DaCapo, which was developed
by Dr.~E.~Keih\"anen and is presented in
\citet{planck2015.lfi.calibration}. The algorithm uses many concepts
developed in the context of destriping
\citep{burigana.1997.destriping,keihanen2004}, and it is able to
produce an estimate of three quantities at the same time:
\begin{enumerate}
\item A time-dependent calibration factor;
\item A timestream of offsets, used to model the $1/f$ part of the
noise in the TOD;
\item An estimate of the sky map.
\end{enumerate}

This implementation of the algorithm includes two Python\ 3 programs:
the first program, [[index.py]], scans the potentially large set of
FITS files containing the TODs and creates an \emph{index file}, which
is used by the second program, [[calibrate.py]], to decide how to
split the workload among the MPI processes used in the computation;
then, [[calibrate.py]] runs the \DaCapo{} algorithm and output the set
of products listed above. The most critical parts of the code have
been written in Fortran\ 2003, in order to improve the execution
speed; Python bindings have been generated automatically using
[[f2py]].

The source code of the program is provided in this document in its
full form. It has been typeset using
\texttt{noweb}\footnote{\url{http://www.cs.tufts.edu/~nr/noweb/}.}, a
\emph{literate programming} tool which allows to produce source code
files and \LaTeX{} documentation at the same time. This document is
meant to be read from the first to the last page, like a narrative: my
hope is that it is readable enough for other people to understand a
number of concepts related to High Performance Computing, i.e., the
management of large quantities of data, the distribution of work among
MPI processes, and the ways one can produce scientific results that
are both reliable and reproducible.

My biggest thanks go to Elina~Keih\"anen, which helped me a lot in
understanding the details of the \DaCapo{} algorithm and gave me a few
suggestions that simplified my work considerably. I would also like to
thank the CORE systematics working group for their suggestions in how
to improve the scientific quality of the code's results: in
particular, Jacques Delabrouille, Ted Kisner, Diego Molinari, and
Paolo Natoli.

\mainmatter

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Presentation of the \DaCapo{} algorithm}
\label{ch:daCapoIntroduction}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}
\label{sec:introduction}

In this document we will provide the complete source code of a program
which reads a set of Time Ordered Data (TOD) files and performs a
photometric calibration of the data using the dipole signal. We use
the [[noweb]]
system\footnote{\url{http://www.cs.tufts.edu/~nr/noweb/}.} by Norman
Ramsey to implement the code: two programs, called \emph{tangler} and
\emph{weaver}, are used to extract the source code to compile
([[notangle]]) and the \LaTeX{} file used to produce a standalone
document ([[noweave]]) from the same document, which is a text file
with extension \texttt{.nw}.

The code is written in Python 3, and it reads a set of FITS files
containing Time-Ordered Information (TOI), i.e., tables containing the
uncalibrated samples produced by a detector scanning the sky and the
timing and pointing information. The program disentangles the noise,
the dipolar signal caused by the Doppler effect induced by the motion
of the spacecraft, and the sky signal, and it produces the following
outputs:
\begin{enumerate}

\item A timestream of gain constants, used to perform the photometric
calibration;

\item A calibrated sky map, in Kelvin;

\item A set of offsets, which give a rough approximation of the $1/f$
component of the noise.
\end{enumerate}

The program uses the \DaCapo{} algorithm developed by
Dr.~E.~Keih\"anen and described in
\citet{planck2015.lfi.calibration}. The algorithm is briefly presented
in Sect.~\ref{sec:DaCapoMathematicalModel}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Modelling the output of a detector scanning the sky}
\label{sec:DaCapoMathematicalModel}

\begin{figure}[tbh]
    \centering
    \includegraphics{figures/core_scan_example.pdf}

    \caption{\label{fig:coreScanExample} Example of a TOD stream. The
    overall signal (black line) is mainly due to the sum of two
    components, the smooth component $G_k\,D_i$ (thick gray line) due
    to the dipole and the fast-changing $G_k T_i$ component due to the
    Galaxy and the CMB (thin gray line), plus a noise component due to
    $1/f$ fluctuations and white noise.}

\end{figure}

In this work we consider a detector which measures the signal entering
the optical system from a fixed direction of the sky. Each sample in
the Time Ordered Data (TOD) produced by the detector can be modelled
using the following equation:
\begin{equation}
\label{eq:radiometer}
V_i = G_k \bigl(T_i + D_i\bigr) + b_k + N_i,
\end{equation}
where $V_i$ is an uncalibrated sample ($[V_i] = \text{V}$), $G_k$ is
the gain\footnote{We assume from now on that the output of the
detector is a voltage. However, the code runs fine as long as the
detector's output is some quantity that is proportional to the
temperature of the object within the main beam of the instrument.}
($[G] = \text{V/K}$), $T_i$ is the signal associated with Galactic
emissions and the CMB ($[T_i] = \text{K}$), $D_i$ is the signal of the
CMB dipole (it might already have been convolved with the beam
response $\gamma$, i.e., $D_i \equiv \gamma * D$; in any case, $[D_i]
= \text{K}$), $b_k$ is the offset used to model the slow $1/f$ noise
variations ($[b_k] = \text{V}$), and $N_i$ is a white noise component
($[N_i] = \text{V}$). For a realistic example of the behaviour in time
of these components, see Fig.~\ref{fig:coreScanExample}

\begin{figure}[tbh]
    \centering
    \includegraphics[height=5cm]{figures/1fnoise.pdf}

    \caption{\label{fig:1fnoise} How noise is modelled by \DaCapo. A
    $1/f^2$ noise realization is approximated with a set of offsets,
    each calculated by averaging the noise samples over periods of
    0.5\,s (\textit{left}). In this example, the $1/f^2$ noise knee
    frequency is $f_\text{knee} = 0.5\,\text{Hz}$, which corresponds
    to a period of $2\,\text{s}$: this is four times greater than the
    period of each offset (\textit{middle}). The difference between
    the $1/f^2$ noise samples and the offsets reveals an almost pure
    white noise component (\textit{right}).}

\end{figure}

The idea behind Eq.~\eqref{eq:radiometer} is to consider the noise
part as composed by a set of constant baselines $b_k$, which
approximate the $1/f$ component of the noise, and a purely white noise
component. This works if the period of each baseline is smaller than
$1/f_\text{knee}$, with $f_\text{knee}$ being the knee frequency of
the $1/f$ noise power spectrum. See Fig~\ref{fig:1fnoise}.

\begin{figure}[tbh]
    \centering
    \includegraphics{figures/TOD_indexing.pdf}

    \caption{\label{fig:todindexing} How samples and calibration
    periods are indexed in this document. A minimal TOD containing
    $N=11$ samples is shown here. It has been arbitrarily split into
    two calibration periods (indexed by $k$), containing 6 and 5
    samples respectively. The first 6 samples are calibrated using the
    same gain $G_{k=1}$, the latter 5 using $G_{k=2}$. Note that,
    given some $i$, the value of $k$ can be always derived
    unambiguously.}

\end{figure}

The \DaCapo{} algorithm was developed by E.~Keih\"anen for the
calibration of the Planck/LFI timelines released in 2015. Its purpose
is to find an optimal estimate for the gains $G_k$ and the offsets
$b_k$. In the case of the 2015 Planck data release, both the gains and
the offsets were calculated once every hour. I call such time frame a
\emph{calibration period}, and I index it using the symbol $k$: thus,
$G_k$ and $b_k$ are the gain and offset to be used for all the samples
acquired during the $k$-th calibration period. (Later, we will
consider the more interesting case where the calibration period and
the offset period are of different lengths.) The $i$ symbol is
reserved for indexing samples in a TOD. See Fig.~\ref{fig:todindexing}
for an example.

The two unknowns of Eq.~\eqref{eq:radiometer} are $G_k$ and $b_k$,
while all the other quantities are supposed either to be known ($D_i$,
$V_i$) or unimportant for the computation ($N_i$, $T_i$ if we apply a
mask first). The value $T_i$ is typically taken from a pixelized map,
so
\begin{equation}
T_i = \sum_p P_{ip} m_p,
\end{equation}
where $P$ is the so-called \emph{pointing matrix}, a rectangular
matrix of size $M \times N$, where $N$ is the number of elements in
the TOD and $M$ is the number of pixels in the map. The index $p$ runs
over all the $M$ pixels in the map. The value of the element $P_{ip}$
is 1 if the $i$-th sample in the TOD has been measured while the
instrument was pointing towards pixel $p$, 0 otherwise.

An example of a pointing matrix is the following (assuming a TOD with length $N
= 9$ and a map with $M = 3$ pixels):
\begin{equation}
\label{eq:pexample}
\matr{P} = \begin{pmatrix}
1& 0& 0\\
1& 0& 0\\
0& 1& 0\\
1& 0& 0\\
0& 1& 0\\
0& 0& 1\\
0& 0& 1\\
1& 0& 0\\
0& 0& 1\\
\end{pmatrix}.
\end{equation}
An important property of this matrix is that the product $P^T P$ is a
square diagonal matrix where each diagonal element at position $pp$ is
equal to the number of times pixel $p$ has been observed. In our
example:
\begin{equation}
\matr{P}^T\matr{P} = \begin{pmatrix}
4& 0& 0\\
0& 2& 0\\
0& 0& 3\\
\end{pmatrix}.
\end{equation}

To determine the gains, the \DaCapo{} algorithm minimizes the quantity
\begin{equation}
\label{eq:chisq}
\chi^2 = \sum_i \frac{\bigl(V_i - V^\text{model}_i\bigr)^2}{\sigma_i^2},
\end{equation}
where
\begin{equation}
V^\text{model}_i = G_k \left(\sum_p P_{ip} m_p + D_i\right) + b_k
\end{equation}
is our model for the output of the detector, which supposes no white noise and
considers a static sky signal not dependent on time (thus, variable sources have
to be masked before the application of this algorithm).

Since Eq.~\eqref{eq:chisq} is quadratic in its unknowns,
\citet{planck2015.lfi.calibration} linearizes it in the following way:
\begin{equation}
V^\text{model}_i =
G_k \left(D_i + \sum_p P_{ip} m^0_p\right)
+ G^0_k \sum_p P_{ip} \bigl(m_p - m^0_p\bigr)
+ \left(\bigl(G_k - G^0_k\bigr) \sum_p P_{ip} \bigl(m_p - m^0_p\bigr)\right)
+ b_k
\end{equation}
(in the Planck 2015 paper there is a typo: the part $\sum_p P_{ip}$ is
missing from the third term in the right side), where $m^0$ and $G^0$
are the results of the previous iteration. The algorithm iterates to
converge towards the solution. Since the third term is of the second
order, it can be dropped, and the model becomes
\begin{equation}
\label{eq:vmodapprox}
V^\text{model}_i \approx
G_k \left(D_i + \sum_p P_{ip} m^0_p\right)
+ G^0_k \sum_p P_{ip} \tilde{m}_p
+ b_k,
\end{equation}
where
\begin{equation}
\tilde{m}_p = m_p - m^0_p.
\end{equation}
To further simplify Eq.~\eqref{eq:chisq}, we rewrite part of the terms in
Eq.~\eqref{eq:vmodapprox} by introducing a new matrix $F_{ij}$, where $j$ is an
index which runs over all the available calibration periods, and a vector $a$
which contains all the unknowns ($G_k$ and $b_k$):
\begin{equation}
\sum_j F_{ij} a_j \equiv G_k \left( D_i + \sum_p P_{ip} m^0_p \right) + b_k.
\end{equation}
Taking as example the pointing matrix in Eq.~\eqref{eq:pexample}, we imagine
here that there are two calibration periods, with 5 and 3 samples respectively.
Then,
\begin{equation}
\label{eq:matrF}
\matr F = \begin{pmatrix}
1& 0& D_1 + m^0_1& 0\\
1& 0& D_1 + m^0_1& 0\\
1& 0& D_2 + m^0_2& 0\\
1& 0& D_1 + m^0_1& 0\\
1& 0& D_2 + m^0_2& 0\\
0& 1& 0& D_3 + m^0_3\\
0& 1& 0& D_3 + m^0_3\\
0& 1& 0& D_1 + m^0_1\\
0& 1& 0& D_3 + m^0_3\\
\end{pmatrix},
\qquad \vect{a} = \begin{pmatrix}
b_1\\
b_2\\
G_1\\
G_2\\
\end{pmatrix}.
\end{equation}
Note that $\matr F$ contains both the dipole signal and the previous guess $m^0$
for the sky map. If we rewrite the second term in the right hand of
Eq.~\eqref{eq:vmodapprox} as
\begin{equation}
\label{eq:ptildedef}
\matr{\tilde{P}}_{ip} = G^0_k \matr{P}_{ip},
\end{equation}
then Eq.~\eqref{eq:chisq} reaches its final form
\begin{equation}
\label{eq:chisqfinal}
\chi^2 = \bigl(\vect{V} - \matr{\tilde{P}}\,\vect{\tilde{m}} - \matr{F} \vect{a}\bigr)^T \matr{C}_n^{-1} \bigl(\vect{V} -
\matr{\tilde{P}}\,\vect{\tilde{m}} - \matr{F}\,\vect{a}\bigr),
\end{equation}
with $C_n$ being the noise covariance matrix for the noise component
$N_i$. Equation~\eqref{eq:chisqfinal} is easy to understand: it
requires to minimize the difference between the measured vector
$\vect{V}$ and the vector $\matr{F}\vect{a}$, which was created by
scanning the sky map (dipole and Galaxy) and assuming some gains $G_k$
and a fixed $1/f$ noise realization $b_k$. The term
$\matr{\tilde{P}}\vect{\tilde{m}}$ takes into account the
discrepancies between the map used in building $\matr{F}\vect{a}$
(produced during the previous iteration) and the new map being
estimated now.

The calibration solution is found by minimizing Eq.~\eqref{eq:chisqfinal} with
respect to $\tilde{m}$ and then with respect to $a$. The solution $a$ must
satisfy the equation
\begin{equation}
\label{eq:cjgr}
\matr{A} \vect{a} = \vect{v},
\end{equation}
where
\begin{align}
\label{eq:Amatrix} \matr{A} &= \matr{F}^T \matr{C}_n^{-1} \matr{Z} \matr{F},\\
\label{eq:vVector} \vect{v} &= \matr{F}^T \matr{C}_n^{-1} \matr{Z} \vect{V},\\
\label{eq:zdef} \matr{Z} &= \matr{I} - \matr{\tilde P}\bigl(\matr{\tilde P}^T \matr{C}_n^{-1} \matr{\tilde{P}}\bigr)^{-1} \matr{\tilde{P}}^T \matr{C}_n^{-1}.
\end{align}
(The definition of $\matr Z$ in \citet{planck2015.lfi.calibration} has
a typo.)

Once a first estimate $\vect a_1$ for $\vect a$ is found, we can solve
for $\vect{\tilde m}_1$:
\begin{equation}
\label{eq:tildeMap}
\vect{\tilde m}_1 = \bigl(\matr{\tilde P}^T \matr{C}_n^{-1} \matr{\tilde
P}\bigr)^{-1} \matr{\tilde P}^T \matr{C}_n^{-1} \bigl(\vect V - \matr
F \vect a_1\bigr),
\end{equation}
and then we iterate again, starting from $(\vect a_1, \vect{\tilde
m}_1)$ to find an improved solution $(\vect a_2, \vect{\tilde m}_2)$
and so on until we reach convergence.

\citet{planck2015.lfi.calibration} explains that the equations derived
so far have serious degeneracies, that prevent them from being solved
in any practical purpose. The most obvious one is the ambiguity
between $D_i$ and $T_i$: in the extreme case $T_i \propto D_i$ (a sky
signal similar to the dipole), there would be a perfect degeneracy
between $G_k$ and the sky map. In a more realistic case, $T_i$ would
be the sum of a dipole-free component and a nonzero dipolar component,
and the degeneracy would be smaller but still present. Therefore, the
authors introduced a constraint to the equations, which is related to
the peculiar way the dipole is used in the calibration. We only report
their results here:
\begin{enumerate}
\item We need to introduce a new matrix, $\matr{m}_c$, which is a
$M\times 2$ matrix containing the dipole and the monopole:
\begin{equation}
\label{eq:constrainMatrix}
\matr{m}_c = \begin{pmatrix}
D_1& 1\\
D_2& 1\\
D_3& 1\\
\hdotsfor{2}\\
D_M& 1
\end{pmatrix}.
\end{equation}

\item The definition of matrix $\matr{A}$ (Eq.~\ref{eq:cjgr}) remains
the same, but matrix $\matr{Z}$ becomes
\begin{equation}
\label{eq:newZmatrix}
\matr{Z} = \matr{I} - \matr{\tilde P} \bigl(\matr{M} +
\matr{C}_m^{-1}\bigr)^{-1} \matr{\tilde{P}}^T \matr{C}_n^{-1},
\end{equation}
with
\begin{align}
\label{eq:Mmatrix}
\matr{M} &= \matr{\tilde{P}}^T \matr{C}_n^{-1} \matr{\tilde{P}},\\
\label{eq:MCmMatrix}
\bigl(\matr{M} + \matr{C}_m^{-1}\bigr)^{-1} &= \matr{M}^{-1} -
\matr{M}^{-1} \matr{m}_c \bigl(\matr{m}_c^T \matr{M}^{-1}
\matr{m}_c\bigr)^{-1} \matr{m}_c^T \matr{M}^{-1}.
\end{align}

\item With these modifications, the solution has the property that
there is no dipole in the map $\matr{m}$, in the sense that
\begin{equation}
\label{eq:dipoleConstraint}
\sum_p \matr{m}_p \matr{D}_p = 0.
\end{equation}
Moreover, that the monopole of the map is zero, i.e.,
\begin{equation}
\label{eq:monopoleConstraint}
\left<\matr m\right> = 0.
\end{equation}
\end{enumerate}

In the remainder of this document, I will show how the codes presented
in this document can be used to perform the aforementioned
calculations.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{How to install and build the code}

Before presenting the implementation of the code, I show here how the
source code can be obtained and compiled. The code has been developed
on Linux machines, but it should be fairly portable to POSIX
architectures (e.g., Mac OS X). Windows users might have some
problems, as the code relies on the
Healpy\footnote{\url{https://github.com/healpy/healpy}.} library,
which has not been ported to Windows at the time of writing (December
2016).

\subsection{Installing the code from source}

To install and compile the source code of this program, you will need
the following dependencies:
\begin{enumerate}
  \item Python\ 3.x (version 3.5 was used in the development; the code
        might work with earlier versions, but this has not been
        tested);

  \item Noweb (\url{http://www.cs.tufts.edu/~nr/noweb/}), for creating
        the [[.py]] source codes and this documentation: you can
        install it under Ubuntu with the command [[apt install noweb]];

  \item NumPy (\url{http://www.numpy.org/}), for low-level vector/matrix
        computations;

  \item SciPy (\url{https://www.scipy.org/}), for higher-level
        vector/matrix computations;

  \item AstroPy (\url{http://www.astropy.org/}), for writing/reading
        FITS files;

  \item MPI4py (\url{https://pythonhosted.org/mpi4py/}), for MPI
        communications;

  \item Numba (\url{http://numba.pydata.org/}), to speed up a few
        Python routines;

  \item Healpy (\url{https://github.com/healpy/healpy}), used to work
        with Healpix maps;

  \item Click (\url{http://click.pocoo.org/5/}), for implementing a
        smart command-line interface;

  \item Autopep8, to nicely re-indent the Python files produced by
        Noweb;

  \item A fortran compiler that is supported by [[f2py]] ([[gfortran]]
        and Intel Fortran compiler are ok);
\end{enumerate}

After you have all the requirements installed, you can download the
source codes discussed in this text from the GitHub repository hosted
at the following URL:
\begin{center}
    \url{https://github.com/ziotom78/dacapo_calibration}
\end{center}

\begin{table}
    \begin{tabular}{lll}
    Variable& Default value& Notes\\
    [[NOWEAVE]]& [[noweave]]& Part of Noweb\\
    [[NOTANGLE]]& [[notangle]]& Part of Noweb\\
    [[CPIF]]& [[cpif]]& Part of Noweb\\
    [[TEX2PDF]]& [[lualatex]]& \\
    [[BIBTEX]]& [[bibtex]]& \\
    [[PYTHON]]& [[python3]]& \\
    [[F2PY]]& [[f2py]]& Bundled with NumPy\\
    [[AUTOPEP8]]& [[autopep8]]& You can use [[yapf]] as an alternative\\
    [[DOCKER]]& [[sudo docker]]& You might want to remove [[sudo]]\\
    [[MPIRUN]]& [[mpirun -n 2]]& Used for the integration tests\\
    [[INKSCAPE]]& [[inkscape]]& Only used to build the PDF file
    \end{tabular}

    \caption{\label{tab:makeVariables} List of the variables that can
    be defined in [[configuration.mk]].}
\end{table}

To compile the program, just run [[make]]. If the defaults used by
[[Makefile]] do not suit you, you can create a file named
[[configuration.mk]] in the same directory where [[Makefile]] resides,
which can provide new values for the variables listed in
Table~\ref{tab:makeVariables}.

\subsection{Using Docker to describe the installation}
\label{sec:dockerContainer}

We provide here a Docker file which installs all the requirements in a
virtual machine based on the [[miniconda3]]
image\footnote{\url{https://hub.docker.com/r/continuumio/miniconda3/}.}
provided by Continuum Analytics. Docker is a containerization platform
that allows to spawn small Virtual Machines (VMs) running some
specific task within a Linux system. Docker's advantage over other
virtualization systems (e.g., QEmu, VirtualBox, VMWare\ldots) is that
such machines are really fast to start. They are typically used to run
a service or a task that needs to be run in an isolated and controlled
environment.

In our context, Docker is the ideal solution to test for missing
dependencies in the environment we use. We describe an environment,
called \emph{image}, by means of one of Docker's preconfigured VMs
available on the Docker's Hub website (\url{https://hub.docker.com/}).
There are plenty of them, we'll pick one of the official Anaconda
Python images, by Continuum Analytics. These virtual machines are
based on well-known Linux distributions (Continuum Analytics uses
Debian), and they come with a basic Python distribution already
installed. For our purposes, Anaconda's VMs are not enough, as we need
a few more packages not included there. So we specify a set of
commands that have to be run in order to configure the VM properly.

The recipe to build a Docker image must be written in a text file,
usually called [[Dockerfile]]. Our [[Dockerfile]] contains the name of
the base Anaconda Python's VM, the name of the mantainer (myself), and
a set of commands to be run from the command line which prepare the
VM. In such commands (the lines beginning with [[RUN]]) we can use
Debian's and Anaconda's standard tools to install programs and
libraries:
<<Dockerfile>>=
FROM continuumio/miniconda3

MAINTAINER Maurizio Tomasi <maurizio.tomasi@unimi.it>

RUN apt-get -y update && \
    apt-get -y upgrade && \
    apt-get install -y \
    apt-utils \
    g++ \
    gcc \
    gfortran \
    git \
    inkscape \
    make \
    noweb \
    openmpi-bin \
    texlive-latex-base \
    texlive-latex-extra
RUN conda install \
    astropy \
    click \
    matplotlib \
    mpi4py \
    numba \
    numpy \
    pytest \
    scipy
RUN pip install \
    autopep8 \
    healpy \
    quaternionarray
CMD git clone https://github.com/ziotom78/dacapo_calibration && \
    cd dacapo_calibration && \
    make all && \
    make fullcheck
@
The last [[CMD]] command will be explained later.

To build the Docker image, you have to enter the [[docker]]
directory within the source code repository and run [[docker build]]:
\begin{verbatim}
$ sudo docker build -t="ziotom78:dacapo" .
\end{verbatim}
This task needs to be redone only when [[Dockerfile]] changes. It will
take a while, since it needs to build the entire VM, running all the
[[RUN]] command lines and downloading the necessary packages from
Debian's, Anaconda's and PyPI's repositories: once this is done,
subsequent runs will be much faster.

Once the image has been built, you can run the associated
\emph{container}, which is basically an instance of the VM based on
some image (refer to the Docker's documentation for more details about
the terminology):
\begin{verbatim}
$ sudo docker run --name dacapo "ziotom78:dacapo"
\end{verbatim}
When we use [[docker run]], and only in this case, Docker will run the
command on the line beginning with [[CMD]], thus downloading the most
up-to-date version of the source code presented in this document,
compiling it and running the full set of tests.

The advantage of creating a Docker machine is that the full sequence
of operations needed to configure a barebone system in order to
install and run the code is fully documented.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Running the programs}

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{figures/FITS_structure.pdf}

    \caption{\label{fig:FitsStructure} Structure of a FITS file
    created by the program [[calibrate.py]]. The primary HDU and the
    first two extension HDU contain copies of the input files used to
    run the calculation; they are used to allow users to trace the
    history of the computation that produced the results saved in this
    file. HDUs \#3, \#4, and \#5 contain the solution of the \DaCapo{}
    problem, namely, the offsets and gains that are in $\vect a$
    (Eq.~\protect\ref{eq:cjgr}) and the sky map $\vect m$
    (Eq.~\protect\ref{eq:tildeMap}). HDU \#6 and following contain
    information about the rate of convergence of the algorithm; in
    this example we show just two of them, but the effective number
    depends on the details of the computation.}

\end{figure}

The programs we are going to implement in the next chapters works on
the following assumptions:
\begin{enumerate}
\item The user must provide one or more FITS files containing the TODs
that must be used as an input for the calculation;

\item A program, [[index.py]] (implemented in Ch.~\ref{ch:indexPy}),
scans the FITS files one by one and counts the number of samples in
each of them, optionally flagging unwanted data; the output of this
program is an \emph{index file};

\item A second program, [[calibrate.py]] (Ch.~\ref{ch:daCapo}),
implements the \DaCapo{} algorithm. It divide the TODs among the MPI
processes according to the information read from the index file, it
performs the calculations shown above, and it collects the results at
the end of the computation. Results are saved in one FITS file, whose
structure is shown in Fig.~\ref{fig:FitsStructure}.

\end{enumerate}

To see an example of how to invoke the code, suppose that you have a
simulated/measured TOD saved in a set of files in some directory
([[$]] indicates the command-line prompt, where you type commands):
\begin{verbatim}
$ ls /storage/my_data/tods/
tod_0001.fits
tod_0002.fits
tod_0003.fits
tod_0004.fits
\end{verbatim}

Using CFITSIO's [[listhead]]\footnote{\url{http://}.} example program,
we can see the structure of each file. Each of them contains one
tabular HDU with a number of columns:
\begin{verbatim}
$ listhead /storage/my_data/tods/tod_0001.fits
[...snip...]
TTYPE1  = 'TIME    '
TFORM1  = 'J       '
TUNIT1  = 's       '
TTYPE2  = 'PIXIDX  '
TFORM2  = 'J       '
TTYPE3  = 'SIGNAL  '
TFORM3  = 'D       '
TUNIT3  = 'V       '
[...snip...]
\end{verbatim}
The first column ([[TIME]]) contains the time when each sample was
acquired (in seconds); the second column ([[PIXIDX]]) contains the
index of the pixels on the sky sphere that are associated with each
sample; the third column contains the actual, uncalibrated samples
measured by the detector. We assume that all the samples must be used
in the computation, i.e., there is no need to flag the samples. (Our
program will be able to discard bad data, but we are keeping this
example as simple as possible.)

The first step to do is to run the program [[index.py]], which read
the input parameters from a file, named the \emph{parameter
file}. It is a text file which follows the traditional syntax of INI
files; in our case, it is the following (file [[examples/index.ini]]):
\begin{verbatim}
[input_files]
path = /storage/my_data/tods
mask = tod_????.fits
hdu = 1
column = TIME

[periods]
length = 1536

[output_file]
file_name = index.fits
\end{verbatim}
The section [[input_files]] specifies where to load the files, and
which is the HDU and column containing timing information; this is
needed by [[index.py]] to decide the length of each offset period. The
[[periods]] section specifies how long each offset period must be, in
seconds. Finally, the [[output_file]] section specifies the name of
the file that will be produced by the program.

To run [[index.py]], we must pass the INI file as the first and only
argument on the command line:
\begin{verbatim}
python3 index.py examples/index.ini
\end{verbatim}
The program will read the TOD files one by one and will create the
file [[index.fits]]. We are now ready to run the \DaCapo{}
calibration.

The [[calibrate.py]] program is more complex than [[index.py]], but it
works in a similar way. It requires the user to specify input
parameters through a parameter file, which in our example is the
following:
\begin{verbatim}
[input_files]
index_file = index.fits
signal_hdu = 1
signal_column = SIGNAL
pointing_hdu = 1
pointing_columns = PIXIDX

[dacapo]
t_cmb_K = 2.72548
solsysdir_ecl_colat_rad = 1.7656131194951572
solsysdir_ecl_long_rad = 2.995889600573578
solsysspeed_m_s = 370082.2332
nside = 256
periods_per_cal_constant = 1
cg_stop_value = 1e-9

[output_file]
file_name = dacapo_results.fits
comment = Dummy
\end{verbatim}
The [[input_files]] section specifies the location of the index file
and the columns containing the pointing information\footnote{Pointing
information can either be provided as a couple (colatitude,
longitude), in radians, in which case the Healpix pixelization is
assumed, or as a sequence of integer pixel indexes, like in this
example.} and the signal. The [[dacapo]] section is used to configure
the way calculations are made. Finally the [[output_file]] section
tells the program how to save the results of the computation.

The structure of the output file written by [[calibrate.py]] is
complex: it contains the estimates for the unknowns in the \DaCapo{}
problem (gains $G_k$, offsets $b_k$, and the sky map $T_i$), but also
many other ancillary information. This is a variation of an idea
proposed by K.~Riebe in a
talk\footnote{\url{http://www.adass2016.inaf.it/index.php/participant-list/14-talk/124-riebe-kristin}.}
given at the XXVI
conference\footnote{\url{http://www.adass2016.inaf.it/}.} of the
Astronomical Data Analysis Software and Systems (ADASS): to have a
``provenance model'' that records the processing steps leading from
the input raw data to the scientific products. This allows users to
check the quality of the products, as well as to search for possible
error sources. The structure of the FITS file shown in
Fig.~\ref{fig:FitsStructure} allows the user to reconstruct the path
from the input TODs to the estimates for $G_k$, $b_k$, and $T_i$.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Mapping the input data}
\label{ch:indexPy}

Before implementing the algorithm described in
Chapter~\ref{ch:daCapoIntroduction}, we need to decide which is the
best way to load the input data to be used by \texttt{calibrate.py},
our implementation of the \DaCapo{} algorithm. Our program is likely to
be applied to large amount of data: considering one detector with a
sampling frequency of $\sim 10^2\,\text{Hz}$, in one year such
detector will produce $\sim 10^9$ samples. Considering that each
sample is 8 bytes wide and must be associated with ancillary
information (e.g., pointing angles, flags), storing these data in
memory is likely to require hundreds of GB of memory. Therefore, we
are going to use MPI to implement a program which splits the necessary
computations among a number of computing units; each unit will load
only a subset of the whole data. (This approach works because, as we
shall see, the operations described in
Chapter~\ref{ch:daCapoIntroduction} can be easily done concurrently.)
In order to do this, we first need to devise a strategy to efficiently
split the data among the MPI processes.

The subject of this chapter is the implementation of an ancillary
program, \texttt{index.py}, which scans the input data and writes and
\emph{index file}. The contents of the index file will allow
\texttt{calibrate.py} to quickly decide which file each MPI process
needs to read.

This is the skeleton of the program, we'll discuss its implementation
through this chapter:
<<index.py>>=
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-

from collections import namedtuple
from configparser import ConfigParser
from enum import Enum
from glob import glob
from typing import Any, List, Union
import logging as log
import os.path, sys
from astropy.io import fits
from numba import jit
import click
import numpy as np

__version__ = '1.1.1'

<<Flagging code>>
<<Datatypes used by [[index.py]]>>
<<Functions to calculate the length of each period>>
<<Functions to load the parameters for [[index.py]] from an INI file>>
<<Functions to write the index file to disk>>
<<Other functions used by [[index.py]]>>

<<Implementation of the [[index_main]] function>>
if __name__ == '__main__':
    index_main()
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{What is an index file?}
\label{sec:inputMappingIntroduction}

As said above, the amount of data to be read by our program is going
to be quite relevant. It is unlikely that such data will be saved in
one huge file: typically they will be split into many FITS files that
will be loaded by the many MPI processes.

\begin{figure}
    \centering
    \includegraphics[width=0.8\textwidth]{figures/period_lengths.pdf}

    \caption{\label{fig:periods} Example of splitting a TOD into
    offset and calibration periods: the length of each offset is
    proportional to the number of samples, not on their time span.
    Shorter periods are likely to contain more flagged data than
    others. The length of an offset period (labeled as 1, 2, 3,
    \ldots) is arbitrary and variable, but calibration periods
    (labeled as A, B, C, D) must be made of an integer number of
    offset periods. Offset periods are shorter than calibration
    periods because they have to keep track of $1/f$ noise, which
    fluctuates faster than gain variations in typical situations. A
    further complication stems from the fact that the TOD might be
    split into FITS files (\texttt{file01.fits}, \texttt{file02.fits})
    whose boundaries do not coincide with the boundaries of the
    period, as shown here. Nevertheless, we will implement
    [[calibrate.py]] with the constraint that each MPI process only
    loads an integer number of offset/calibration baselines.}

\end{figure}

We introduce now a requirement on the way data are split into
offset/calibration periods: we recall that a \emph{calibration period}
is a time interval for which the gain $G$ is constant for all the
samples acquired during the same interval, while an \emph{offset
period} is the same, but for the offset $b$ (see
Eq.~\ref{eq:radiometer}). From now on, we allow the two periods to be
different, with the constraint that a calibration period must be an
integer number of offset periods. (This stems from the fact that
offset periods need to keep track of $1/f$ noise, which typically show
faster variations than gain changes.) See Fig.~\ref{fig:periods} for
an example. In this way, when we will implement [[calibrate.py]] in
Chapter~\ref{ch:daCapo}, we will assign an integer number of
calibration periods to each MPI process, and this will consequently
imply that each process will get an integer number of offset periods
as well.

This is the most logical way to split data in order to implement the
\DaCapo{} algorithm using MPI. However, it complicates data loading
from disk, because the data might be split in files at boundaries that
do not corrispond with the offset/calibration periods. The most
efficient solution would be to make each MPI process load the data
from disk (i.e., the first MPI process reads the first file, the
second one reads the second file, and so on). Then, the processes
should exchange data among them using MPI calls, until each process
has the data it needs: in the example shown in Fig.~\ref{fig:periods},
after MPI process \#1 has read \texttt{file01.fits}, it should ask
process \#2 for the first samples in \texttt{file02.fits}. But this
solution is complicated to implement: as each process must both send
and receive data, the possibility of deadlocks\footnote{A
\emph{deadlock} happens when some process A halts after having
requested data from process B, while at the same time B is waiting for
some data from A. The two process are therefore hanged, as none of
them is able to continue.} is high.

Therefore, we adopt a simpler, albeit slower, solution. We split our
implementation in \emph{two} programs:
\begin{enumerate}

\item The first program, \texttt{index.py}, reads the data from all
the FITS files, removes any bad data (signaled by a flag column),
counts the number of remaining samples in each file, and writes an
\emph{index file};

\item The second program, \texttt{calibrate.py}, which implements the
\DaCapo{} algorithm, uses the index file to decide which files must be
loaded by each MPI process.

\end{enumerate}

The index file should also contain all the necessary information
required for [[calibrate.py]] to remove the same bad (``flagged'')
data that were discarded by \texttt{index.py}: in other words, it must
apply the same flagging algorithm used by [[index.py]]. We shall begin
to implement this part of the code first.


\section{How flagged data are removed}

We allow the possibility to have a column in the TOD FITS files which
specifies the quality of the data, and which can therefore be used to
discard bad data when necessary. Let's call this column the
\emph{flags column}. Our program allows the user to pick two methods
to remove data using the flags column:
\begin{enumerate}

\item Data are discarded by comparing the values in the flags column
with some predefined value. For instance, the flags column might
contain booleans, and \texttt{index.py} could therefore be instructed
to remove data whose flag is set to \texttt{FALSE} (or \texttt{TRUE}).
This is a particularly simple situation, which is usually found in
data generated by simulations.

\item In more realistic situations, a flag is a much richer structure
than a simple true/false switch. Flags are usually unsigned integers
of some bitwidth, where each bit represents a different flag. For
instance, one bit might signal whether sample was acquired while the
main beam was crossing some moving object (planets, etc.), while
another bit could be set whenever the spacecraft is doing some
nonstandard maneuver, etc. In such case, to determine if a sample must
be used or not, one must use a bitmask, which is combined with the
flag to check if the result is nonzero. This operation is called a
\emph{bitwise and}.

\end{enumerate}

We represent the two types of flagging using an enumeration type
([[Enum]] is part of Python's standard library):
<<Flagging code>>=
class FlagType(Enum):
    equality = 0
    bitwise_and = 1
@ %def FlagType

A constant source of ambiguity in implementing flagging algorithms is
to decide whether to remove samples whose condition is [[True]]
or [[False]]. For this reason, our code will ask explicitly
which case applies. Thus, we need an additional enumeration type:
<<Flagging code>>=
class FlagAction(Enum):
    include = 0
    exclude = 1
@ %def FlagAction
The [[include]] case specifies that only those samples for which
the equality/bitwise-and condition returns [[True]] must be
included, while the contrary applies for the [[exclude]] case.

Finally, we combine all the information required to properly remove
flagged samples from a TOD into a new type, which combines
[[FlagType]], [[FlagAction]], and the numerical value used in the
equality/bitwise-and operation:
<<Flagging code>>=
Flagging = namedtuple('Flagging', 'flag_type flag_value flag_action')
@ %def Flagging
(If you do not know what a [[namedtuple]] is, refer to the Python
documentation.) We are going to use this type both in
\texttt{index.py} and in \texttt{calibrate.py}.

The function [[flag_mask]] takes a generic array as input, the array
of flags, and a [[Flagging]] object, and it returns a mask of booleans
which records the position of the ``good'' samples. The implementation
is quite straightforward:
<<Flagging code>>=
def flag_mask(flags: Any,
              flagging: Flagging) -> Any:
    if flagging.flag_type == FlagType.equality:
        mask = (flags == flagging.flag_value)
    elif flagging.flag_type == FlagType.bitwise_and:
        mask = (np.array(np.bitwise_and(flags, flagging.flag_value),
                         dtype='bool') != 0)
    else:
        raise ValueError('unknown FlagType in "flag_mask": {0}'
                         .format(flagging.flag_type))

    if flagging.flag_action == FlagAction.exclude:
        mask = np.logical_not(mask)

    return mask
@ %def flag_mask
Note that we define the type for [[flags]] and for the return value to
be [[Any]]: at the moment, Python's typing system is not powerful
enough to deal with NumPy's arrays, so we'll use [[Any]] every time we
expect the parameter to be a NumPy array.

\section{Splitting the data into periods}

Now that we have implemented all the functions necessary to remove
flagged data, let's turn to the computation of the length of each
offset/calibration period. As said in
Sect.~\ref{sec:inputMappingIntroduction}, we assume that a calibration
period is made by an integer number of offset periods. Therefore,
offset periods are the minimal unit of time we're going to use when
splitting data among the MPI processes. It is therefore crucial to
record the length of each offset period in the index file.

\begin{figure}[tb]
    \centering
    \includegraphics[width=0.75\textwidth]{figures/baseline_calculation.pdf}

    \caption{\label{fig:baselineCalculation} \textbf{Top:} Example of
    the application of the algorithm used in the Planck/LFI pipeline
    to determine the length of a calibration/offset period. The number
    of samples in each period is fixed using the value for
    $\nu_\text{samp}$, the sampling frequency (in this example, 5
    samples). Thus, the effective time span covered by each period
    depends on the number of flagged data it contains.
    \textbf{Bottom:} In our implementation, we stick to the time;
    therefore, the number of samples in each period might change, but
    the time span is always the same.}

\end{figure}

There are many ways to compute how many samples should be in each
offset period. The most simple formula considers the sampling
frequency $\nu_\text{samp}$ of the data (typical values fall in the
range 10--$100$\,\text{Hz}), and assigns to each period a number of
samples $N$ equal to
\begin{equation}
N = \nu_\text{samp} \times \Delta t,
\end{equation}
where $\Delta t$ is the timescale of an offset period, e.g., 5\,s.
This is the approach followed by \texttt{Madam} \citep{keihanen2005}.
This formula has the advantage of being really simple, as each period
has the same number of samples, but it has the drawback that if many
data have been flagged, the actual interval of time spanned by the
periods will vary widely and will generally be greater than $\Delta
t$.

We adopt a slightly more complex algorithm. For each period, we
consider the maximum number $N$ of elements $\{x_i\}_{i=1}^N$ such
that $x_N - x_i \leq \Delta t$. With this algorithm, each period is
never longer than $\Delta t$. (Compare this with the algorithm used by
\texttt{Madam}, where each period is never \emph{shorter} than $\Delta
t$.) See Fig.~\ref{fig:baselineCalculation} for a comparison between
the two approaches.

We implement this algorithm in the function [[split_into_periods]],
which is quite simple. We require the caller to pre-allocate an array
that will contain the number of elements for each period (the
[[period]]) argument. Although the number of periods cannot be
estimated easily before the actual computation of the lengths of each
period, an upper bound to the number of periods $n_p$ is given by the
formula
\begin{equation}
n_p = \left\lfloor\frac{t_N - t_i}{\Delta
t}\right\rfloor + 1,
\end{equation}
which follows from the fact that our algorithm guarantees that each
period will not be longer than $\Delta t$.

We use two counters, [[sample_idx]] and [[period_idx]], which are used
to cycle over the samples in [[data]] and over the output array
[[periods]], respectively. Since the number of samples is potentially
large, we use the
\texttt{Numba}\footnote{\url{http://numba.pydata.org/}.} package to
speed up the running time. Just putting the [[@jit]] decorator in
front of the definition of the function will make Numba compile the
function before executing it the first time, thus reducing its running
time.
<<Functions to calculate the length of each period>>=
@jit
def split_into_periods(time_array, period_length, periods):
    '''Decide the length (in samples) of the destriping periods,
    according to the timing of each sample. Both "time_array" and
    "period length" must be expressed using the same measure unit
    (e.g., seconds, clock ticks...). The array "periods" must have
    been sized before calling this function.'''

    periods[:] = 0
    sample_idx = 0
    period_idx = 0
    while sample_idx < len(time_array):
        start_time = time_array[sample_idx]
        while (sample_idx < len(time_array)) and \
              (time_array[sample_idx] - start_time < period_length):
            sample_idx += 1
            periods[period_idx] += 1

        period_idx += 1
@ %def split_into_periods
As an example, consider the following script:
\verbatiminput{scripts/test_split_into_periods.py}
The output is:
\verbatiminput{test_split_into_periods.txt}

\section{Configuring the program through a parameter file}
\label{sec:indexParameterFile}

The operations done by [[index.py]] are not complex, but there are
many options available to the user: the type of flagging algorithm to
use, the location of the FITS files containing the data, etc. Our aim
is to make [[index.py]] as much versatile as possible. When the user
wants to run the program, it must provide the path to a text file
which describes the way the program must execute its operations.
Keeping the parameters in a file ensures that it is possible to re-run
the same analysis and get the same output (compare this with those
programs which get such parameters from command line switches, such as
\texttt{--num-of-iterations=10}, which are lost in the user's shell
history).

Python provides the convenient module [[configparser]], which
implements a INI file reader. INI files are text files which associate
a \emph{key name} to each parameter (value). Here is an example of INI
file that we want [[index.py]] to parse:
\begin{verbatim}
[flagging]
type = bitwise_and
value = 1
action = exclude
hdu = 1
column = FLAGS

[input_files]
path = /storage/tods
mask = tod_*.fits
hdu = 1
column = TIME

[periods]
length = 3.5

[output_file]
file_name = ./output.fits
\end{verbatim}
As the example shows, INI files are split into sections marked by
\texttt{[.]}, e.g., \texttt{[flagging]}. Each sections contains
key/parameter pairs in the form \texttt{key = parameter}. The
[[configparser]] library implements a class, [[ConfigParser]], which
parses this kind of files and offers a dictionary-like access to the
keys.

The example above shows the way one can specify all the information
needed to [[index.py]] to perform its job: the [[flagging]] section
specifies which samples in the input files need to be discarded, the
[[input_files]] section specifies the paths of the TOD files and the
location of the column containing the timing of each samples (needed
to properly compute the length of each offset period), and so on.

Instead of directly accessing a [[ConfigParser]] object during the
computation, we use it to initialize a dedicated object of type
[[IndexConfiguration]], which contains all the fields read from the
INI file. In this way, should we switch to some other library than
[[configparser]] in the future, the number of places in the code to
upgrade to the API of the new library will be limited. The object
containing the configuration needed by [[index.py]] is defined as
follows:
<<Datatypes used by [[index.py]]>>=
IndexConfiguration = namedtuple('IndexConfiguration',
                                ['flagging',
                                 'flag_hdu',
                                 'flag_column',
                                 'input_path',
                                 'input_mask',
                                 'input_hdu',
                                 'input_column',
                                 'period_length',
                                 'output_file_name'])
@ %def IndexConfiguration
All the fields have a one-to-one correspondence with the keys in the
INI file shown above.

We need now to implement [[read_index_conf_file]], the function
that will parse the INI file and return a [[IndexConfiguration]] object.
<<Functions to load the parameters for [[index.py]] from an INI file>>=
def read_index_conf_file(file_name: str) -> IndexConfiguration:
    conf_file = ConfigParser()
    conf_file.read(file_name)

    try:
        <<Parse the [[flagging]] section and initialize local variables>>
        <<Parse all the other sections and initialize local variables>>
    except ValueError as e:
        log.error('invalid value found in one of the entries in "%s": %s',
                  file_name, e)
        sys.exit(1)

    return IndexConfiguration(flagging=flagging,
                              flag_hdu=flag_hdu,
                              flag_column=flag_column,
                              input_path=input_path,
                              input_mask=input_section.get('mask'),
                              input_hdu=time_hdu,
                              input_column=time_column,
                              period_length=period_length,
                              output_file_name=output_file_name)
@ %def read_index_conf_file

The reason why the parsing of the [[flagging]] section is separated
from the others in the implementation above stems from the fact that
[[index.py]] makes flagging an optional step. There are many cases
(primarily in simulations) where flagging is not used, and all the
input samples must be kept. In this case, [[input.py]] allows the user
to drop the [[flagging]] section from the INI file entirely. Here is
the code which parses this section:
<<Parse the [[flagging]] section and initialize local variables>>=
try:
    flag_section = conf_file['flagging']
except KeyError:
    flag_section = flagging = flag_hdu = flag_column = None

if flag_section is not None:
    flagging = Flagging(flag_type=FlagType[flag_section.get('type')],
                        flag_value=flag_section.getint('value'),
                        flag_action=FlagAction[flag_section.get('action')])
    flag_hdu = int_or_str(flag_section.get('hdu'))
    flag_column = int_or_str(flag_section.get('column'))
@
If [[conf_file['flagging']]] fails because of the lack of the
[[flagging]] section in the INI file, a [[KeyError]] is raised and all
the variables used for flagging are set to [[None]]. Otherwise, the
methods provided by [[flag_section]] ([[get]] and [[getint]]) are used
to retrieve the value of the parameters.

Note the use of the [[int_or_str]] function to initialize [[flag_hdu]]
and [[flag_column]]. The user might want to specify these values
through their ordinal number or their full name (in the case of HDUs,
this is the [[EXTNAME]], while for columns it is simply the column's
name). The function [[int_to_str]] assumes that the value specified by
the user is a number if it can be parsed as an integer, otherwise it
is a name:
<<Functions to load the parameters for [[index.py]] from an INI file>>=
def int_or_str(x: str) -> Union[int, str]:
    'Convert "x" into an integer if possible. Otherwise, return it unmodified.'
    try:
        int_value = int(x)
        return int_value
    except ValueError:
        return x  # A string
@ %def int_or_str
Thus, the user can indicate the HDU to load either by means of its
position in the FITS file, as in \verb|hdu = 1|, or by its
[[EXTNAME]], as in \verb|hdu = FLAGS|.

Parsing the other sections ([[input_files]], [[output_file]], and
[[periods]]) is done similarly to [[flagging]]. Of course, since the
formers are required, the code prints an error and exit immediately if
it does not find them:
<<Parse all the other sections and initialize local variables>>=
try:
    input_section = conf_file['input_files']
    output_section = conf_file['output_file']
    period_section = conf_file['periods']

    input_path = input_section.get('path', fallback='.')
    time_hdu = int_or_str(input_section.get('hdu', fallback=1))
    time_column = int_or_str(input_section.get('column', fallback=1))
    period_length = period_section.getfloat('length')
    output_file_name = output_section.get('file_name')
except KeyError as e:
    log.error('section/key %s not found in the configuration file "%s"',
              e, file_name)
    sys.exit(1)
@
As above, we use [[int_or_str]] to initialize the variable containing
the HDU and the column specified by the user.


\section{Implementing the main loop for \texttt{index.py}}
\label{sec:indexMainLoop}

Now that we have implemented [[IndexConfiguration]], it's time to
implement the main loop of the program. We start with the skeleton of
the [[main]] function, which is wrapped using the
[[click]]\footnote{\url{http://click.pocoo.org/5/}.} library. Using
[[click]] allows to build sophisticated command-line interfaces with
little effort; using it for [[index.py]], which just accepts the name
of the INI file on the command line, might sound like an overkill; but
it's just too convenient here. (Among other things, it provides an
help message if the user forgets to specify the path to the INI file
when running [[index.py]], and it automatically prints an helpful
error message if the INI file does not exist or is not readable.)

After having initialized a number of libraries and variables, the
routine enters the main loop, reading all the FITS files containing
the TODs are read sequentially and saving the information needed to
create the index file. After all the files have been read, the results
are saved by means of the function [[write_output]], which is
implemented later.
<<Implementation of the [[index_main]] function>>=
@click.command()
@click.argument('configuration_file')
def index_main(configuration_file):
    <<Initialize [[index.py]]>>

    list_of_file_info = []  # type: List[TODFileInfo]
    periods = np.array([], dtype='int64')
    prev_times = None
    prev_flags = None
    prev_last_time = None
    for idx, file_name in enumerate(list_of_file_names):
        log.info('processing file "%s" (%d/%d)',
                 file_name, idx + 1, len(list_of_file_names))

        <<Extract basic information from the FITS file and save them into [[list_of_file_info]]>>
        <<Determine the length of each period>>
        <<Save the periods into [[periods]]>>

        log.info('file "%s" processed successfully', file_name)

    write_output(file_name=configuration.output_file_name,
                 info_list=list_of_file_info,
                 periods=periods,
                 configuration=configuration)
@ %def index_main
The variables [[prev_times]] and [[prev_flags]] are used when
determining the length of the periods; they will be explained in due
time. The variable [[prev_last_time]] is used to check that the FITS
files have their time columns sorted properly.

In the initialization, we want to set up the logging system
(implemented using the standard Python module [[logging]], which we
rename to [[log]] when importing it). The default output format used
by [[logging]] is quite lame, so we specify a custom format which
includes the date and time when the message was produced:
<<Initialize [[index.py]]>>=
log.basicConfig(level=log.INFO,
                format='[%(asctime)s %(levelname)s] %(message)s')
@

After the logging system is ready, we can parse the INI file provided
by the user via the command line (and passed as a parameter to
[[index_main]] by the [[click]] library):
<<Initialize [[index.py]]>>=
log.info('reading configuration file "%s"', configuration_file)
configuration = read_index_conf_file(configuration_file)
log.info('configuration file read successfully')
@
Recalling the definition for [[read_index_conf_file]], the object
[[configuration]] will be of type [[IndexConfiguration]].

The INI file requires to specify the list of FITS files containing the
TOD via two parameters, the path and a file name mask, which can use
the widely known wildcards \verb|*| and \verb|?|. (Refer to
Sect.~\ref{sec:indexParameterFile}.) We use Python's standard function
[[glob]] (from the [[glob]] module) to expand these wildcards into a
list of file names, and [[expanduser]] to expand \verb|~| into the
home directory path:
<<Initialize [[index.py]]>>=
list_of_file_names = \
    sorted(glob(os.path.join(os.path.expanduser(configuration.input_path),
                             configuration.input_mask)))
@
\noindent The use of [[sorted]] here is extremely important. We scan
the TOD files one by one; since the code that determines the length of
each period (provided below) assumes that the time of each samples
increases monotonically, we must be sure that the files are read in
the right order. However, the order of the elements in the list
returned by [[glob]] is not specified. We therefore assume that the
name of the FITS files has a fixed form with an increasing index, so
that a lexicographical sort of the names will provide the right
ordering. (A more robust, although more complicated, approach would be
to scan the files once to determine the time of the first sample in
each file, and then to sort them before reading their data.)

Within the [[for]] loop, the first task to do is to load the column
containing the timing of each sample, filter the data, and save the
basic information about this file and its data in a dedicated data
structure, [[TODFileInfo]]. This structure is defined as follows:
<<Datatypes used by [[index.py]]>>=
TODFileInfo = namedtuple('TODFileInfo',
                         ['file_name',
                          'mod_time',
                          'num_of_samples',
                          'num_of_unflagged_samples'])
@ %def TODFileInfo

We now enter the [[for]] loop. Remember that [[file_name]] is the
variable used in the loop to hold the name of the file to be loaded
during the current iteration. Here is the code which creates
[[TODFileInfo]] objects:
<<Extract basic information from the FITS file and save them into [[list_of_file_info]]>>=
times = read_column(file_name=file_name,
                   hdu=configuration.input_hdu,
                   column=configuration.input_column)

if (prev_last_time is not None) and (times[0] < prev_last_time):
    log.error(('column {0} in HDU {1} of the FITS files is not '
               'sorted in ascending order ({2} > {3})')
              .format(configuration.input_column,
                      configuration.input_hdu,
                      times[0], prev_last_time))
    sys.exit(1)

prev_last_time = times[-1]
num_of_samples = len(times)

if configuration.flagging:
    flags = read_column(file_name=file_name,
                        hdu=configuration.flag_hdu,
                        column=configuration.flag_column)

    mask = flag_mask(flags=flags,
                     flagging=configuration.flagging)
    times = times[mask]

file_info = TODFileInfo(file_name=os.path.abspath(file_name),
                        mod_time=os.path.getmtime(file_name),
                        num_of_samples=num_of_samples,
                        num_of_unflagged_samples=len(times))
list_of_file_info.append(file_info)
@
In order to be useful, the path to the input FITS files should be
absolute, i.e., paths such as \verb|../data/file1.fits| must have
their \verb|..| part be expanded. This is the reason behind the call
to [[os.path.abspath]]. We record the file last modification time
(through [[os.path.getmtime]]) because in this way it is easy to catch
stale index files (i.e., situations that arises when one updates the
input TOD files but forgets to recreate the index file).

The code above makes use of the function [[read_column]], which is
just a tiny wrapper around a few [[astropy.io.fits]]' functions:
<<Other functions used by [[index.py]]>>=
def read_column(file_name: str,
                hdu: Union[int, str],
                column: Union[int, str]):
    with fits.open(file_name) as f:
        return f[hdu].data.field(column)
@ %def read_column
Note that [[astropy.io.fits]] allows to specify HDUs and column names
either by means of one-based integers (positional indexes) or strings
(names). Thus, our use of [[int_or_str]] to parse HDUs and columns
from the INI files is justified.

The following code performs the harder task to determine the length of
each period within the current file. As shown in
Fig.~\ref{fig:periods}, it is usually the case that the boundary
between two FITS files falls within a period: the period marked as
``6'' has one half which falls in \texttt{file01.fits} and the other
half which falls in \texttt{file02.fits}). Therefore, we need special
care for those periods which fall at the file boundaries. Here is the
code:
<<Determine the length of each period>>=
if prev_times is not None:
    times = np.concatenate((prev_times, times))
    if configuration.flagging is not None:
        flags = np.concatenate((prev_flags, flags))

num_of_periods = int((times[-1] - times[0]) // configuration.period_length) + 1
cur_periods = np.zeros(num_of_periods, dtype='int64')
split_into_periods(times, configuration.period_length, cur_periods)
cur_periods = cur_periods[cur_periods > 0]

if idx + 1 < len(list_of_file_names):
    last_period_len = cur_periods[-1]
    prev_times = times[-last_period_len:]
    if configuration.flagging is not None:
        prev_flags = flags[-last_period_len:]

    cur_periods = cur_periods[0:-1]
@
Since it is impossible to tell if period ``6'' is complete before
reading \texttt{file02.fits}, we save those samples in [[times]] and
[[flags]] which belong to the last period into [[prev_times]] and
[[prev_flags]], and we ignore them in the current iteration (thus
cutting the last element from the list [[cur_periods]]). In the next
iteration, we call [[np.concatenate]] to prepend the tail of the
previous file to [[times]] and [[flags]].

Finally, we can append [[cur_periods]] to the overall list of periods:
<<Save the periods into [[periods]]>>=
periods = np.concatenate((periods, cur_periods))
@

This completes the implementation of \texttt{index.py}'s main loop: at
the end of the [[for]] loop, the variables [[file_info_list]] and
[[periods]] contain all the information needed to write the index file
to disk. This last task is the subject of the next section.


\section{Writing the index file}

We now turn to the implementation of the code that reads and write
index information in FITS files. We implement a class, [[IndexFile]],
whose purpose is to keep together all the information needed and to
provide a couple of methods to read and write them to files:
<<Datatypes used by [[index.py]]>>=
class IndexFile:
    def __init__(self,
                 input_hdu: Union[int, str]=1,
                 input_column: Union[int, str]=0,
                 flag_hdu: Union[int, str]=1,
                 flag_column: Union[int, str]=1,
                 flagging: Flagging=None):
        self.flagging = flagging
        self.tod_info = []
        self.periods = None
        self.period_length = 0.0

        self.input_hdu = input_hdu
        self.input_column = input_column
        self.flag_hdu = flag_hdu
        self.flag_column = flag_column

    def store_in_hdus(self) -> List:
        <<Create [[file_info_hdu]], containing file information>>
        <<Create [[period_hdu]], containing the length of each period>>
        <<Set the header of the two HDUs>>
        return [file_info_hdu, period_hdu]

    def load_from_fits(self, file_name: str, first_idx=-1, last_idx=-1):
        with fits.open(file_name) as f:
            fileinfo_hdu = f['FILEINFO']
            <<Load the [[FILEINFO]] HDU from [[fileinfo_hdu]]>>

            periods_hdu = f['PERIODS']
            <<Load the [[PERIODS]] HDU from [[periods_hdu]]>>
@ %def IndexFile

Let's discuss the implementation of [[IndexFile.store_in_hdus]] first.
The first data HDU contains a table with the name of each FITS file
read in the main loop, the last modification time of the file, the
overall number of samples, and the number of samples that were kept
after removing bad (flagged) data. Since we need to specify the number
of characters in FITS columns which contain string, we must first
compute the maximum length of the paths to be written in the column:
this is the purpose of the variable [[fname_format]]:
<<Create [[file_info_hdu]], containing file information>>=
fname_format = '{0}A'.format(max([len(x.file_name) for x in self.tod_info]))
file_info_columns = [fits.Column(name='FILENAME',
                                 format=fname_format,
                                 array=[x.file_name for x in self.tod_info]),
                     fits.Column(name='MODTIME',
                                 format='1D',
                                 array=[x.mod_time for x in self.tod_info]),
                     fits.Column(name='NSAMPLES',
                                 format='1K',
                                 array=[x.num_of_samples
                                        for x in self.tod_info]),
                     fits.Column(name='NUNFLAG',
                                 format='1K',
                                 array=[x.num_of_unflagged_samples
                                        for x in self.tod_info])]
file_info_hdu = fits.BinTableHDU.from_columns(file_info_columns)
file_info_hdu.name = 'FILEINFO'
@

The [[PERIODS]] HDU has a much simpler structure:
<<Create [[period_hdu]], containing the length of each period>>=
period_columns = [fits.Column(name='NSAMPLES',
                                format='1K',
                                array=self.periods)]
period_hdu = fits.BinTableHDU.from_columns(period_columns)
period_hdu.name = 'PERIODS'
@

Before writing the two HDUs in a FITS file, we set a number of header
keywords. The most important keywords are those that specify the
flagging, as [[calibrate.py]] will need to redo the flagging exactly
as it was done when creating the index file.
<<Set the header of the two HDUs>>=
if self.flagging is not None:
    list_of_flag_types = ', '.join(['"{0}"'.format(x.name)
                                    for x in FlagType])
    list_of_flag_actions = ', '.join(['"{0}"'.format(x.name)
                                      for x in FlagAction])

    file_info_hdu.header['FTYPE'] = (self.flagging.flag_type.name,
                                     list_of_flag_types)
    file_info_hdu.header['FVALUE'] = (self.flagging.flag_value,
                                      'Value used for flagging')
    file_info_hdu.header['FACTION'] = (self.flagging.flag_action.name,
                                       'Flag action ({0})'
                                       .format(list_of_flag_actions))
    file_info_hdu.header['FHDU'] = (self.flag_hdu,
                                    'Flags column HDU')
    file_info_hdu.header['FCOL'] = (self.flag_column,
                                    'Flags column number/name')

file_info_hdu.header['INPHDU'] = (self.input_hdu,
                                  'Time column HDU')
file_info_hdu.header['INPCOL'] = (self.input_column,
                                  'Time column number/name')
file_info_hdu.header['NSAMPLES'] = (sum([x.num_of_samples
                                         for x in self.tod_info]),
                                    'Total number of samples')
file_info_hdu.header['NUNFLAG'] = (sum([x.num_of_unflagged_samples
                                        for x in self.tod_info]),
                                   'Total number of unflagged samples')

period_hdu.header['LENGTH'] = (self.period_length, 'Length of a period')
@

We can now turn to the implementation of [[IndexFile.load_from_fits]].
Starting from version 1.1, this function allows to specify the first and
last index (both inclusive) of the [[TODFileInfo]] array, in order to
load the definition of a subset of the whole TOD: this is useful for
debugging. Here is the code that reads back the [[FILEINFO]] HDU:
<<Load the [[FILEINFO]] HDU from [[fileinfo_hdu]]>>=
fileinfo_hdr = fileinfo_hdu.header
if 'FTYPE' in fileinfo_hdr:
    self.flagging = \
                    Flagging(flag_type=FlagType[fileinfo_hdr['FTYPE']],
                             flag_action=FlagAction[fileinfo_hdr['FACTION']],
                             flag_value=fileinfo_hdr['FVALUE'])
    self.flag_hdu = int_or_str(fileinfo_hdr['FHDU'])
    self.flag_column = int_or_str(fileinfo_hdr['FCOL'])
else:
    self.flagging = self.flag_hdu = self.flag_column = None

self.tod_info = [TODFileInfo(file_name=x[0].decode(),
                             mod_time=x[1],
                             num_of_samples=x[2],
                             num_of_unflagged_samples=x[3])
                 for x in fileinfo_hdu.data.tolist()]
if first_idx < 0:
    _first = 0
else:
    _first = first_idx

if last_idx < 0:
    _last = len(self.tod_info) - 1
else:
    _last = min(len(self.tod_info) - 1, last_idx)

self.tod_info = self.tod_info[_first:(_last + 1)]
self.input_hdu = fileinfo_hdu.header['INPHDU']
self.input_column = fileinfo_hdu.header['INPCOL']
@
Note that if [[first_index]] and [[last_index]] are not provided, the code
reads all the TODs (as it was the case in version 1.0).

And here is the code which reads the [[PERIODS]] HDU:
<<Load the [[PERIODS]] HDU from [[periods_hdu]]>>=
self.periods = periods_hdu.data.field('NSAMPLES')
self.period_length = periods_hdu.header['LENGTH']
@

To save the results in [[index.py]], we use the [[IndexFile]] class we
have just defined. The following function is a simple wrapper
around [[IndexFile.store_in_hdu]]:
<<Functions to write the index file to disk>>=
def write_output(file_name: str,
                 info_list: List[TODFileInfo],
                 periods: Any,
                 configuration: IndexConfiguration):
    log.info('writing file "%s"', file_name)

    index_file = IndexFile(input_hdu=configuration.input_hdu,
                           input_column=configuration.input_column,
                           flag_hdu=configuration.flag_hdu,
                           flag_column=configuration.flag_column,
                           flagging=configuration.flagging)
    index_file.tod_info = info_list
    index_file.periods = periods
    index_file.period_length = configuration.period_length

    hdu_list = [fits.PrimaryHDU()] + index_file.store_in_hdus()
    fits.HDUList(hdu_list).writeto(file_name, overwrite=True)
    log.info('file "%s" written successfully', file_name)
@ %def write_output

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Implementing \DaCapo}
\label{ch:daCapo}

In this chapter we provide a full implementation of the algorithm
described in Chapter~\ref{ch:introduction}. The purpose is to read a
potentially large number of FITS files containing the input TODs, and
to fit the data with a map containing the dipole signal in order to
extract a set of offsets and gains. We will call this program
[[calibrate.py]], and we are going to implement it in Python and
Fortran (the latter being used for the most number-crunching intensive
tasks).


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Overall structure of the program}

Here is the skeleton of the program; we will fill all the details later:

<<calibrate.py>>=
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-

import io
import sys
from copy import copy
from collections import namedtuple
from configparser import ConfigParser
from datetime import datetime

import click
from typing import List, Any
import numpy as np, scipy
import healpy
import logging as log
from numba import jit
from index import int_or_str, flag_mask, TODFileInfo, IndexFile
from astropy.io import fits
from mpi4py import MPI
import ftnroutines

__version__ = '1.1.1'

<<Miscellaneous functions>>
<<Datatypes used by [[calibrate.py]]>>
<<Functions to divide the TOD among the MPI processes>>
<<Functions to set up [[calibrate.py]]>>
<<Estimation of the dipole signal>>
<<Matrix/vector multiplication functions>>
<<Implementation of the conjugate gradient method>>
<<CG preconditioners and error estimation>>
<<Implementation of the \DaCapo{} algorithm>>

<<Implementation of [[calibrate_main]]>>

if __name__ == '__main__':
    calibrate_main()
@

Note the line [[import ftnroutines as ftn]]. We are going to
implement a few number-crunching routines in Fortran, in order to
run faster (\emph{way} faster!). We will use [[f2py]], an awesome tool
provided by the NumPy package which allows to seamlessy call Fortran
routines within Python. We therefore need to have a fortran file,
where to put these low-level routines. Here is the skeleton:
<<ftnroutines.f90>>=
! Compile this file using the following command:
!
!    f2py -c -m ftnroutines --f90flags=-std=f2003 ftnroutines.f90

<<Fortran routines used by [[calibrate.py]]>>
@


\section{Implementation of the main loop for \texttt{calibrate.py}}

To implement the main function, we make use of the excellent [[click]]
library. We specify that the executable takes only one required
argument, the configuration file. There are a few switches that tune
the way logging messages are produced:
\begin{itemize}

\item The [[--debug]] flag increases the verbosity of the code, and it
is useful for debugging;

\item By default, only the MPI process with rank \#0 prints messages
on the screen, unless something critical happens. The [[--full-log]]
makes all the MPI process write their own log messages: this is useful
to debug deadlocks. If this flag is used, the messages are no longer
printed on the console, but in a separate file. The name of the file
is derived from the variable [[DEFAULT_LOGFILE_MASK]], which has the
string [[%04d]] which is filled with the zero-based rank number of the
current MPI process: therefore, each process writes its own log file.

\item The user can provide a custom file name for the log files, thus
overriding [[DEFAULT_LOGFILE_MASK]]. This can be done using the
[[--logfile]] flag. The file name must have an escape sequence like
[[%d]], otherwise all the MPI processes will write into the same file,
with unforeseeable consequences.

\end{itemize}
The skeleton of the function lists the actions that we will implement
in the rest of this chapter.

<<Implementation of [[calibrate_main]]>>=
DEFAULT_LOGFILE_MASK = 'calibrate_%04d.log'

@click.command()
@click.argument('configuration_file')
@click.option('--debug/--no-debug', 'debug_flag',
              help='Print more debugging information during the execution')
@click.option('--full-log/--no-full-log', 'full_log_flag',
              help='Make every MPI process write log message to files'
              ' (use --logfile to specify the file name)')
@click.option('-i', '--index-file', 'indexfile_path', default=None, type=str,
              help='Specify the path to the index file to use '
              '(overrides the one specified in the parameter file).')
@click.option('--logfile', 'logfile_mask', default=DEFAULT_LOGFILE_MASK,
              help='Prints (a subset of) logging messages on the screen'
              ' (default is "{0}")'.format(DEFAULT_LOGFILE_MASK))
def calibrate_main(configuration_file: str, debug_flag: bool,
                   full_log_flag: bool, logfile_mask: str,
                   indexfile_path: str):
    mpi_comm = MPI.COMM_WORLD
    mpi_size = mpi_comm.Get_size()
    mpi_rank = mpi_comm.Get_rank()

    <<Initialize the logging system for [[calibrate.py]]>>
    <<Read the configuration file and build a [[CalibrateConfiguration]] object>>
    <<Load the index file>>
    <<Determine how input data should be split into baselines>>
    <<Load the part of the TOD that is assigned to this MPI process into the [[tod]] variable>>
    <<Build the monopole and dipole maps>>
    <<Configure the preconditioner>>

    da_capo_results = da_capo(mpi_comm,
                              voltages=tod.signal, pix_idx=tod.pix_idx,
                              samples_per_ofsp=local_samples_per_ofsp,
                              samples_per_gainp=local_samples_per_gainp,
                              mc=mc,
                              mask=mask,
                              threshold=configuration.dacapo_stop,
                              max_iter=configuration.dacapo_maxiter,
                              max_cg_iter=configuration.cg_maxiter,
                              cg_threshold=configuration.cg_stop,
                              pcond=pcond)

    <<Gather the results from the MPI processes and save them in [[coll_offsets]] and [[coll_gains]]>>
    if mpi_rank == 0:
        <<Save the results of the calibration in the output file>>
@

\section{The logging system}
\label{sec:loggingSystem}

We use Python's [[logging]] module, which we rename to [[log]] in
order to short a bit function calls. We do not want too much output,
so we enable logging only for the root MPI process (which is going to
do more stuff than the others). Keep this in mind when you find
[[log.info]] calls elsewhere in the code.

<<Initialize the logging system for [[calibrate.py]]>>=
if debug_flag:
    log_level = log.DEBUG
else:
    log_level = log.INFO

log_format = '[%(asctime)s %(levelname)s MPI#{0:04d}] %(message)s'.format(mpi_rank)
if full_log_flag:
    log.basicConfig(level=log_level, filename=(logfile_mask % mpi_rank),
                    filemode='w', format=log_format)
else:
    if mpi_rank == 0:
        log.basicConfig(level=log_level, format=log_format)
    else:
        log.basicConfig(level=log.CRITICAL)
@


\section{Keeping track of the time spent by the program}

It is usually a good idea to keep track of the time spent in doing
calculations. This allows to see if any change in the codebase
improves computation times or introduces a performance regression. We
will use a very simple class, [[Profile]], to keep track of how much
time is spent during the execution:
<<Miscellaneous functions>>=
class Profiler:
    def __init__(self):
        self.start_time = None
        self.tic()

    def tic(self):
        '''Record the current time.'''
        self.start_time = datetime.now()

    def toc(self):
        '''Return the elapsed time (in seconds) since the last call to tic/toc.'''
        now = datetime.now()
        diff = now - self.start_time
        self.start_time = now
        return diff.total_seconds()
@ %def Profiler tic toc

The idea of the class is the following: before running some long
computation, the [[tic]] method is called. After the computation, the
[[toc]] method will return the number of seconds elapsed since the
last call to [[tic]]. Calling [[toc]] one more time will return the
time elapsed since the last call to [[toc]], and so on.


\section{Configuring the program}
\label{sec:calibrateParams}

\begin{table}[tb]

    \centering
    \footnotesize
    \begin{tabular}{lllp{6cm}}
        \textbf{Section}& \textbf{Parameter}& \textbf{Type}& \textbf{Meaning}\\
        \hline

        [[input_files]]& [[index_file]]& \texttt{str}& Path to the
    file created by [[index.py]]\\
        & [[first_tod_index]]& \texttt{int}& 0-based index of the first
    TOD file to read, or -1 to read from the first element (default)\\
        & [[last_tod_index]]& \texttt{int}& 0-based index of the last
    TOD file to read (inclusive), or -1 to read till the last element (default)\\
        & [[signal_hdu]]& \texttt{str}/\texttt{int} &
    Name/number of the tabular HDU containing the column with the TOD\\
        & [[signal_column]]& \texttt{str}/\texttt{int} &
    Name/number of the column with the TOD (see [[signal_hdu]])\\
        & [[pointing_columns]]& \texttt{str}& Comma-separated list of
    column names/numbers containing the colatitude and longitude, or a
    single column name/number containing the pixel index\\
        [[dacapo]]& [[t_cmb_k]]& \texttt{float}& Temperature of the
    CMB [K]\\
        & [[solsysdir_ecl_colat_rad]]& \texttt{float}& Ecliptic
    colatitude of the direction of the Solar system velocity [rad]\\
        & [[solsysdir_ecl_long_rad]]& \texttt{float}& Ecliptic
    longitude of the direction of the Solar system velocity [rad]\\
        & [[solsysspeed_m_s]]& \texttt{float}& Speed of the solar
    system with respect to the CMB rest frame [m/s]\\
        & [[frequency_hz]]& \texttt{float}& Frequency (in Hertz)
    to use for the calculation of relativistic corrections to the
    dipole. Default is "NaN" (no correction is applied)\\
        & [[nside]]& \texttt{int}& Resolution (NSIDE) of the sky map
    built by \DaCapo\\
        & [[mask]]& \texttt{str}& Path to a Healpix map containing the
    Galactic mask\\
        & [[periods_per_cal_constant]]& \texttt{int}& Number of offset
    periods per each gain period\\
        & [[cg_stop_value]]& \texttt{float}& Stopping criterion for
    the conjugate gradient algorithm\\
        & [[cg_max_iterations]]& \texttt{int}& Maximum number of
    iterations for the conjugate gradient algorithm\\
        & [[dacapo_stop_value]]& \texttt{float}& Stopping criterion for
    the \DaCapo{} algorithm\\
        & [[dacapo_max_iterations]]& \texttt{int}& Maximum number of
    iterations for the DaCapo{} algorithm\\
        & [[pcond]]& \texttt{str}& Kind of preconditioner to be used
    in the conjugate gradient algorithm\\
    [[output]]& [[file_name]]& \texttt{str}& Path to
    the output FITS file that will contain the calibration constants
    and the offsets\\
    & [[save_map]]& \texttt{bool}& If [[true]], the output file
    will contain the estimated sky map (using the Healpix
    pixelization) in a separate HDU\\
        & [[save_convergence_information]]& \texttt{bool}& If
    [[true]], the statistics about the convergence of each
    conjugate gradient step are saved at the end of the output file\\
        & [[comment]]& \texttt{str}& User-defined string to be
    embedded in the header of the output file created by the program.

    \end{tabular}

    \caption{\label{tbl:calibrateParams} Parameters to be specified in
    the INI file read by [[calibrate.py]].}

\end{table}

The program [[calibrate.py]] is going to be considerably more complex
than [[index.py]], and therefore it is going to require a large number
of parameters to start. As we did for [[index.py]] (see
Sect.~\ref{sec:indexParameterFile}), we use Python's
\texttt{configparser} module to implement the ability to read input
parameters from INI files. The set of parameters required by the
program is listed in Table~\ref{tbl:calibrateParams}; we are not going
to explain them all here, but we will refer many times to this table
in the following sections.

As we did for [[index.py]] (see [[IndexConfiguration]]), we define a
data structure that holds all the settings loaded from an INI file:
<<Datatypes used by [[calibrate.py]]>>=
CalibrateConfiguration = namedtuple('CalibrateConfiguration',
                                    ['index_file',
                                     'first_tod_index', 'last_tod_index',
                                     'signal_hdu', 'signal_column',
                                     'pointing_hdu', 'pointing_columns',
                                     't_cmb_k', 'solsys_speed_vec_m_s',
                                     'frequency_hz',
                                     'nside', 'mask_file_path',
                                     'periods_per_cal_constant',
                                     'cg_stop', 'cg_maxiter',
                                     'dacapo_stop', 'dacapo_maxiter',
                                     'pcond', 'output_file_name', 'save_map',
                                     'save_convergence',
                                     'comment',
                                     'parameter_file_contents'])
@ %def CalibrateConfiguration

We are now to implement a function, [[read_calibrate_conf_file]],
which has the purpose of reading an INI file and build a
[[CalibrateConfiguration]] object; the path to the INI file is the
only argument to the function. We are going to use [[configparser]]'s
functions like [[getint]] and [[getfloat]], which raise a
[[ValueError]] exception if the data type of the key in the INI file
is not what we expected (for instance, a line like ``\texttt{nside =
16.32}'', see Table~\ref{tbl:calibrateParams} for the expected types).
Therefore, after we create a [[ConfigParser]] object and read the INI
file, we wrap all the routines that extract the parameters in a
[[try...except]] block and provide human-readable error messages for
this kind of errors (remember that [[log.error]] only prints an error
on the root MPI process, see Sect~\ref{sec:loggingSystem}):

<<Functions to set up [[calibrate.py]]>>=
def read_calibrate_conf_file(file_name: str) -> CalibrateConfiguration:
    log.debug('entering read_calibrate_conf_file')
    conf_file = ConfigParser()
    conf_file.read(file_name)

    try:
        <<Read the INI section [[index_file]]>>
        <<Read the INI section [[dacapo]]>>
        <<Read the INI section [[output]]>>
    except ValueError as e:
        log.error('invalid value found in one of the entries in "%s": %s',
                  file_name, e)

    <<Build and return a [[CalibrateConfiguration]] object>>
@ %def read_calibrate_conf_file

The first section we interpret in the INI file is [[index_file]]. For
each key under [[index_file]], we initialize a local variable with the
value found in the INI file. At the end of
[[read_calibrate_conf_file]], we will use them to build a
[[CalibrateConfiguration]] object. We allow the user to specify either
the name or the number of the HDUs/table columns, as we did in the
implementation of [[index.py]]. Therefore, we reuse the function
[[int_or_str]] we implemented before (see the list of imports at the
beginning of [[calibrate.py]]):
<<Read the INI section [[index_file]]>>=
input_sect = conf_file['input_files']

index_file = input_sect.get('index_file', None)
first_tod_index = input_sect.get('first_tod_index', -1)
last_tod_index = input_sect.get('last_tod_index', -1)
signal_hdu = int_or_str(input_sect.get('signal_hdu'))
signal_column = int_or_str(input_sect.get('signal_column'))
pointing_hdu = int_or_str(input_sect.get('pointing_hdu'))
pointing_columns = [int_or_str(x.strip())
                    for x in input_sect.get('pointing_columns').split(',')]
@
We do not force the user to provide an [[index_file]] line in the configuration
file, as this information can be provided from the command line using the
[[--index-file]] switch. We'll check later if an index file has been actually
provided in one way or in the other.  In order to preserve compatibility with
version 1.0 of this code, we do not need the user to specify
[[first_tod_index]] and [[last_tod_index]]: in this case, the code will analyze
all the TOD files. 

We allow the user to provide TOD files containing pointing information
either as colatitude/longitude pairs or pixel numbers, and the number
of elements in the list will allow us to discriminate between these
two cases. However, we do not make this distinction here: for the
moment, we are allowing [[pointing_columns]] to be a list of
integer/strings of any length. We will check how many columns have
been actually specified later.

The next section we read is [[dacapo]]. Here we do not store the
direction of the solar system velocity and its speed, but we convert
it directly into a NumPy 3-element vector:
<<Read the INI section [[dacapo]]>>=
dacapo_sect = conf_file['dacapo']

t_cmb_k = dacapo_sect.getfloat('t_cmb_k')
solsysdir_ecl_colat_rad = dacapo_sect.getfloat('solsysdir_ecl_colat_rad')
solsysdir_ecl_long_rad = dacapo_sect.getfloat('solsysdir_ecl_long_rad')
solsysspeed_m_s = dacapo_sect.getfloat('solsysspeed_m_s')
solsys_speed_vec_m_s = solsysspeed_m_s * \
    np.array([np.sin(solsysdir_ecl_colat_rad) * np.cos(solsysdir_ecl_long_rad),
              np.sin(solsysdir_ecl_colat_rad) * np.sin(solsysdir_ecl_long_rad),
              np.cos(solsysdir_ecl_colat_rad)])

freq_str = dacapo_sect.get('frequency_hz', fallback='none')
if freq_str.lower() in ['', 'none', 'nan', 'no']:
    frequency_hz = None
else:
    frequency_hz = float(freq_str)

nside = dacapo_sect.getint('nside')
mask_file_path = dacapo_sect.get('mask', fallback=None)
if mask_file_path.strip() == '':
    mask_file_path = None

periods_per_cal_constant = dacapo_sect.getint('periods_per_cal_constant')
cg_stop = dacapo_sect.getfloat('cg_stop_value', 1e-9)
cg_maxiter = dacapo_sect.getint('cg_max_iterations', 100)
dacapo_stop = dacapo_sect.getfloat('dacapo_stop_value', 1e-9)
dacapo_maxiter = dacapo_sect.getint('dacapo_max_iterations', 20)
pcond = dacapo_sect.get('pcond').lower()
@

The default values we provide for [[cg_stop]], [[cg_maxiter]],
[[dacapo_stop]], and [[dacapo_maxiter]] are rought estimates; we will
come back to this topic later, when we will present the implementation
of the conjugate gradient algorithm.

After we read the parameters, we check the meaningfulness of a few of
them:
<<Read the INI section [[dacapo]]>>=
try:
    if not healpy.isnsideok(nside):
        raise ValueError('invalid NSIDE = {0}'.format(nside))
    if cg_stop < 0.0:
        raise ValueError('cg_stop_value ({0:.3e}) should not be negative'
                         .format(cg_stop))
    if dacapo_stop < 0:
        raise ValueError('dacapo_stop_value ({0:.3e}) should not be negative'
                         .format(dacapo_stop))
    if periods_per_cal_constant < 1:
        raise ValueError('periods_per_cal_constant ({0}) should be greater than zero'
                         .format(periods_per_cal_constant))
    if cg_maxiter < 0:
        raise ValueError('cg_maxiter (%d) cannot be negative'.format(cg_maxiter))
    if dacapo_maxiter < 0:
        raise ValueError('dacapo_maxiter (%d) cannot be negative'.format(dacapo_maxiter))
except ValueError as e:
    log.error(e)
    sys.exit(1)
@
We raise an exception and catch it immediately so that we do not
duplicate the calls to [[log.error]] and [[sys.exit]] for each case.

The last part of the INI file includes the paths of the files that are
to be created by [[calibrate.py]]. Nothing new here:
<<Read the INI section [[output]]>>=
output_sect = conf_file['output']

output_file_name = output_sect.get('file_name')
save_map = output_sect.getboolean('save_map', fallback=True)
save_convergence = output_sect.getboolean('save_convergence_information', fallback=True)
comment = output_sect.get('comment', fallback=None)
@

We apply the idea discussed in Chapter~\ref{ch:daCapoIntroduction} of
keeping a ``provenance model'' for scientific products by saving a
copy of the parameter file in the [[Configuration]] object, with the
purpose of saving a copy in the output files. We decode the string in
an array of bytes and keep them in a NumPy array:
<<Build and return a [[CalibrateConfiguration]] object>>=
param_file_contents = io.StringIO()
conf_file.write(param_file_contents)
param_file_contents = np.array(list(param_file_contents.getvalue().encode('utf-8')))
@

We are now ready to build a [[CalibrateConfiguration]] object. It is
just a matter of putting together all the local variables initialized
before:
<<Build and return a [[CalibrateConfiguration]] object>>=
return CalibrateConfiguration(index_file=index_file,
                              first_tod_index=first_tod_index,
                              last_tod_index=last_tod_index,
                              signal_hdu=signal_hdu,
                              signal_column=signal_column,
                              pointing_hdu=pointing_hdu,
                              pointing_columns=pointing_columns,
                              t_cmb_k=t_cmb_k,
                              solsys_speed_vec_m_s=solsys_speed_vec_m_s,
                              frequency_hz=frequency_hz,
                              nside=nside,
                              mask_file_path=mask_file_path,
                              periods_per_cal_constant=periods_per_cal_constant,
                              cg_stop=cg_stop,
                              cg_maxiter=cg_maxiter,
                              dacapo_stop=dacapo_stop,
                              dacapo_maxiter=dacapo_maxiter,
                              pcond=pcond,
                              output_file_name=output_file_name,
                              save_map=save_map,
                              save_convergence=save_convergence,
                              comment=comment,
                              parameter_file_contents=param_file_contents)
@

We can now turn back to the implementation of [[calibrate_main]]. We
must take into account the possibility that the user specified an
index file through the command line, so we update
[[configuration.index_file]] if needed:
<<Read the configuration file and build a [[CalibrateConfiguration]] object>>=
log.info('reading configuration file "%s"', configuration_file)
configuration = read_calibrate_conf_file(configuration_file)
log.info('configuration file read successfully')

if indexfile_path is not None:
    configuration.index_file = indexfile_path

if configuration.index_file is None:
    log.error('error: you must specify an index file, either in the '
              'parameter file or using the --index-file switch')
    sys.exit(1)
@

To load the index file, we use the [[IndexFile]] class we implemented
for [[index.py]], passing the first and last index of the TOD files
to be loaded by the code:
<<Load the index file>>=
index = IndexFile()
index.load_from_fits(configuration.index_file,
                     first_idx=configuration.first_tod_index,
                     last_idx=configuration.last_tod_index)
log.info('%d files are going to be loaded', len(index.tod_info))
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Sharing data among the MPI processes}

In this section we implement the algorithms used by [[calibrate.py]]
to split the TODs read from FITS files among the MPI processes. It is
useful to have a fresh read of Ch.~\ref{ch:indexPy}, as we are going
to reuse many concepts from that discussion.

It is easy to explain the problem we are going to solve here by
referring to Fig.~\ref{fig:periods} (page~\pageref{fig:periods}).
Suppose we have running using two MPI processes, and that the TOD we
are going to analyze has been split in two FITS files,
\texttt{file01.fits} and \texttt{file02.fits}. We suppose that
[[index.py]] (implemented in Ch.~\ref{ch:indexPy}) has already ran, so
we know that there are 10 offsets baselines, and that the sixth one is
shared between the two FITS files \texttt{file01.fits}. What we need
to do now is (1) to determine the length of each calibration baseline,
and (2) to split the baselines between the two MPI processes.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Splitting lengths}

The length of each calibration baseline is easy to compute, in
principle, as we have asked the user to provide the length of such
baselines in terms of the number of offset baselines (see the
parameter [[periods_per_cal_constant]] in
Table~\ref{tbl:calibrateParams}, page~\pageref{tbl:calibrateParams}).
The point is that not every way of splitting is equivalent in terms of
performance. Consider for instance the case where we want to fit
$n=10$ baselines into $m=4$ MPI processes. We could do this if we
assign to the first $m-1=3$ processes a number of baselines equal to
\[
\left\lfloor \frac{n}{m - 1} \right\rfloor = 3,
\]
(we use $\lfloor\cdot\rfloor$ operation because we want to handle only
an integer number of baselines), and if we leave the remaining
baselines (1) to the last process. However, this is not optimal, as
the fourth process has 1/3 of the data to process, and it is therefore
likely to complete its tasks more quickly than the others:
\begin{center}
        \includegraphics[width=0.7\columnwidth]{figures/split_MPI_1.pdf}
\end{center}

The best solution is to evenly distribute the data among the
processes, something like in the following figure:
\begin{center}
        \includegraphics[width=0.7\columnwidth]{figures/split_MPI_2.pdf}
\end{center}
Achieving this is only slightly more complicated than the previous
formula:
<<Functions to divide the TOD among the MPI processes>>=
def split_into_n(length: int, num_of_segments: int):
    log.debug('entering split_into_n')
    assert (num_of_segments > 0), \
        "num_of_segments={0} is not positive".format(num_of_segments)
    assert (length >= num_of_segments), \
        "length={0} is smaller than num_of_segments={1}".format(length, num_of_segments)

    start_positions = np.array([int(i * length / num_of_segments)
                                for i in range(num_of_segments + 1)],
                               dtype='int')
    return start_positions[1:] - start_positions[0:-1]
@ %def split_into_n

If we test [[split_into_n]] with the following script:
\verbatiminput{scripts/test_split_into_n.py}
then the output is:
\verbatiminput{test_split_into_n.txt}
which is slightly different from the figure above because it follows
the pattern 2+3+2+3 instead of 3+3+2+2, but it is perfectly equivalent
in terms of evenness of the distribution.

The function [[split_into_n]] is exactly what we need in order to
divide the calibration baselines among the MPI processes, but we have
some more little work to do here. Remember that a calibration baseline
is made up by an integer number of offset baselines: the task is quite
similar, but we cannot use [[split_into_n]] for this task too. The
user does not specify the \emph{overall number} of calibration
baselines (like it is the case for the overall number of MPI
processes, which is the parameter [[num_of_segments]] in
[[split_into_n]]), but rather the number of offset baselines that must
fit in each calibration baseline. Therefore, we need another splitting
function to do this task. Fortunately, it is quite easy to wrap
[[split_into_n]] into another function which does what we want:
<<Functions to divide the TOD among the MPI processes>>=
def split(length, sublength: int):
    log.debug('entering split')
    assert (sublength > 0), "sublength={0} is not positive".format(sublength)
    assert (sublength < length), \
        "sublength={0} is not smaller than length={1}".format(sublength, length)

    return split_into_n(length=length,
                        num_of_segments=int(np.ceil(length / sublength)))
@ %def split
The return value must be interpreted in the same way as for
[[split_into_n]]. But note that [[sublength]] is just an
approximation, as not every segment will end up having this number of
baselines. For instance, if we want to split 10 offset baselines in a
set of calibration baselines, each having roughly 4 of them, the
script
\verbatiminput{scripts/test_split.py}
produces
\verbatiminput{test_split.txt}
which is the most even distribution possible with these constraints:
two calibration baselines of length 3 and the last one of length 4.

We are now able to implement the code in [[calibrate_main]]. Remember
that at the beginning of that function we have defined [[mpi_rank]] to
be the zero-based index of the current MPI process, and that
[[configuration]] is an instance of the [[CalibrateConfiguration]]
class which contains the key/value pairs loaded from the parameter
file (see Sect.~\ref{sec:calibrateParams} and
Table~\ref{tbl:calibrateParams}). Here is the code:
<<Determine how input data should be split into baselines>>=
samples_per_ofsp = index.periods

gainp_lengths = split(length=len(samples_per_ofsp),
                      sublength=configuration.periods_per_cal_constant)
samples_per_gainp = ftnroutines.sum_subranges(samples_per_ofsp,
                                              gainp_lengths)

log.info('number of offset periods: %d; number of gain periods: %d',
         len(samples_per_ofsp), len(gainp_lengths))

gainp_per_process = split_into_n(length=len(samples_per_gainp),
                                 num_of_segments=mpi_size)
samples_per_process = ftnroutines.sum_subranges(samples_per_gainp,
                                                gainp_per_process)

gainp_idx_start = sum(gainp_per_process[0:mpi_rank])
gainp_idx_end = gainp_idx_start + gainp_per_process[mpi_rank]

ofsp_idx_start = sum(gainp_lengths[0:gainp_idx_start])
ofsp_idx_end = ofsp_idx_start + sum(gainp_lengths[gainp_idx_start:gainp_idx_end])

local_samples_per_ofsp = samples_per_ofsp[ofsp_idx_start:ofsp_idx_end]
local_samples_per_gainp = samples_per_gainp[gainp_idx_start:gainp_idx_end]
@
We make use here of [[sum_subranges]], the first of the many Fortran
functions we will find in this chapter. The purpose of this function
is to group the elements in [[array]] according to the parameter
[[subrange_lengths]], and then to add all the elements belonging to
the same group. In Python, this function would be implemented in the
following way:
\begin{verbatim}
def sum_subranges(array, subrange_lengths):
    log.debug('entering sum_subranges')
    array_idx = 0
    result = np.zeros(len(subrange_lengths), dtype='int')
    for subrange_idx, cur_length in enumerate(subrange_lengths):
        for i in range(cur_length):
            result[subrange_idx] += array[array_idx + i]
        array_idx += cur_length

    return result
\end{verbatim}
However, cycling over large arrays is deadly slow. A Fortran
implementation is very similar, but it works at far larger speeds:
<<Fortran routines used by [[calibrate.py]]>>=
subroutine sum_subranges(array, subrange_lengths, output)
  integer(kind=8), dimension(:), intent(in) :: array
  integer(kind=8), dimension(:), intent(in) :: subrange_lengths
  integer(kind=8), dimension(size(subrange_lengths)), intent(out) :: output

  integer(kind=8) :: array_idx, subrange_idx
  integer(kind=8) :: i

  output = 0
  array_idx = 0

  do subrange_idx = 1, size(subrange_lengths)
     do i = 1, subrange_lengths(subrange_idx)
        output(subrange_idx) = output(subrange_idx) + array(array_idx + i)
     enddo
     array_idx = array_idx + subrange_lengths(subrange_idx)
  enddo

end subroutine sum_subranges
@ %def sum_subranges
The powerfulness of [[f2py]] is that it is able to parse the
definition of routines like [[sum_subranges]] and produce a
NumPy-friendly interface. The following script tests
[[sum_subranges]]:
\verbatiminput{scripts/test_sum_subranges.py}
The call asks to return a vector of two elements, containing the sum
of the first three ($1+2+3$) and last two ($4 + 5$) elements of array.
The output is:
\verbatiminput{test_sum_subranges.txt}

In the next section we will tackle the problem of loading the data
from the set of FITS files specified in the index file.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Loading parts of TODs from FITS files}

As we stated before, we do not expect the FITS files containing the
TOD to have their boundaries coincident with our offset baselines.
Each MPI process in our code must therefore be able to read samples
from many FITS files and to assemble them together in one big data
structure. This ``big data structure'' is a TOD, and we represent it
using a class:
<<Functions to divide the TOD among the MPI processes>>=
class TOD:
    def __init__(self, signal, pix_idx, num_of_pixels):
        self.signal = signal
        self.pix_idx = pix_idx
        self.num_of_pixels = num_of_pixels
@ %def TOD
We avoid using named tuples, as this is not going to be a read-only
class (as [[namedtuple]]s are): these objects will be built
iteratively, each time loading a new chunk of data from a different
FITS file.

We need also a structure which identifies the chunk of data in a FITS
file that must be loaded by the current process:
<<Functions to divide the TOD among the MPI processes>>=
TODSubrange = namedtuple('TODSubrange',
                         ['file_info',
                          'first_idx',
                          'num_of_samples'])
@ %def TODSubrange
In this case, we use a simple [[namedtuple]] because we do not plan to
modify a [[TODSubrange]] variable once it has been created.

We have defined the two data structures that we need to implement the
loading of data from the FITS files. What we want to do now is
explained by the following example. We have a set of three files,
named \texttt{A.fits}, \texttt{B.fits}, and \texttt{C.fits}, each
containing a section of the complete TOD. We are running using 4 MPI
processes, and we have judiciously split the 10 calibration baselines
among them. What we need now is to build a function which determines
which sections of which files each MPI process must load. The
following sketch shows a possible solution:
\begin{center}
    \includegraphics[width=0.7\columnwidth]{figures/split_TOD_files_1.pdf}
\end{center}
The first gray row shows the boundaries of the FITS files containing
the TOD, while in the bottom line we have the usual calibration
baselines, distributed among the four MPI process. Our task is to
determine the solution shown by the row of black rectangles, each of
them symbolizing a [[TODSubrange]] object. Each rectangle is a section
of a FITS file that is loaded by an MPI process:
\begin{enumerate}
\item Process \#1 loads \texttt{A\#1}, which is part of \texttt{A.fits};
\item Process \#2 loads \texttt{A\#2} from \texttt{A.fits} and
\texttt{B\#2} from \texttt{B.fits};
\item Process \#3 loads \texttt{B\#3} from \texttt{B.fits} and
\texttt{C\#3} from \texttt{C.fits};
\item Process \#4 loads \texttt{C\#4} from \texttt{C.fits}.
\end{enumerate}
Note that the boundaries of each [[TODSubrange]] object coincide
either with the boundary of a FITS file or with the boundary of a gain
baseline. Also, the algorithm does not need to know the boundary for
\emph{every} gain baseline, but only for those which are at the
boundary between two MPI processes. In other words, the algorithm only
needs to know the overall number of samples that each MPI process must
load. See this sketch:
\begin{center}
    \label{page:splitTODFigure}
    \includegraphics[width=0.7\columnwidth]{figures/split_TOD_files_2.pdf}
\end{center}

We implement this algorithm in the function
[[assign_files_to_processes]], which has the purpose of producing a
nested list of [[TODSubrange]] objects. Each element in the list is a
nested list of [[TODSubrange]] objects that must be read by a
particular MPI process; so, in pseudocode we want to have the
following behaviour for the case shown in the plot above:
\begin{verbatim}
>>> print(assign_files_to_processes(...))
[ [<A#1>], [<A#2>, <B#2>], [<B3>, <C3>], [<C4>] ]
\end{verbatim}
The result is a list of four elements because there are four MPI
processes; each element in the list is the list of [[TODSubrange]]
objects to be read by the process itself.

The function accepts two arguments: the first one is the overall
number of samples that each process must load, and it is simply the
sum of the number of samples in each of the gain baseline ``owned'' by
the process. The second argument is a list of [[TODFileInfo]] objects,
which contains the information about each of the TOD FITS files that
were listed in the index file (see Sect.~\ref{sec:indexMainLoop}, when
we implemented [[TODFileInfo]] for [[index.py]]).

Our implementation of the algorithm works by running a [[for]] loop
for each of the MPI processes. Within each loop, it iterates over the
TOD files in [[tod_info]] until it ``fills'' the MPI process with the
right number of samples:
<<Functions to divide the TOD among the MPI processes>>=
def assign_files_to_processes(samples_per_process: Any,
                              tod_info: List[TODFileInfo]) -> List[List[TODSubrange]]:
    log.debug('entering assign_files_to_processes')

    result = []  # type: List[List[TODSubrange]]
    file_idx = 0
    file_sample_idx = 0
    samples_in_file = tod_info[file_idx].num_of_unflagged_samples
    for samples_for_this_MPI_proc in samples_per_process:
        samples_left = samples_for_this_MPI_proc
        MPI_proc_subranges = []
        while (samples_left > 0) and (file_idx < len(tod_info)):
            if samples_in_file > samples_left:
                <<Process cases like [[A#1]], [[B#2]], [[C#3]]>>
            else:
                <<Process cases like [[A#2]], [[B#3]], [[C#4]]>>

            if samples_in_file == 0:
                if file_idx + 1 == len(tod_info):
                    break  # No more files, exit the while loop

                file_sample_idx = 0
                file_idx += 1
                samples_in_file = tod_info[file_idx].num_of_unflagged_samples

        result.append(MPI_proc_subranges)

    return result
@ %def assign_files_to_processes

Cases like [[A#1]], [[B#2]], and [[C#3]] happen when the current file
has more samples than are needed to fill the current MPI process. In
this case we reset [[samples_left]] in order to make the [[while]]
loop terminate:
<<Process cases like [[A#1]], [[B#2]], [[C#3]]>>=
MPI_proc_subranges.append(TODSubrange(file_info=tod_info[file_idx],
                                      first_idx=file_sample_idx,
                                      num_of_samples=samples_left))
file_sample_idx += samples_left
samples_in_file -= samples_left
samples_left = 0
@

The opposite case (as for [[A#2]], [[B#3]], and [[C#4]]) is when the
FITS file is not large enough to provide the current MPI process with
the required number of samples. In this case, we set
[[samples_in_file]] to zero, so that [[file_idx]] will be increased
before the next [[for]] iteration:
<<Process cases like [[A#2]], [[B#3]], [[C#4]]>>=
MPI_proc_subranges.append(TODSubrange(file_info=tod_info[file_idx],
                                      first_idx=file_sample_idx,
                                      num_of_samples=samples_in_file))
samples_left -= samples_in_file
samples_in_file = 0
@

How does our implementation perform against the example case we
presented above? Let's test it with the following code:
\verbatiminput{scripts/test_assign_files_to_processes.py}

The code builds three fake [[TODFileInfo]] objects, each with the same
number of samples in it (12). Then, it asks to assign the
$12+12+12=36$ samples to 4 MPI processes, such that each of them has
10, 10, 8, and 8 samples, respectively. Finally, for each subrange it
prints the start index in the FITS file and some ASCII art to
represent the length of the data to load. If you have a look at the
figure at page~\pageref{page:splitTODFigure}, you can see the match.
The output of this script is the following:
\verbatiminput{test_assign_files_to_processes.txt}
This shows that our implementation works as expected.

We need a function that loads one subrange and returns a [[TOD]]
object. Our idea, as we already said, is to iteratively call this
function and concatenate all the TODs it returns into one huge TOD. We
check that the characteristics of the file match the information
loaded from the index file. (However, in the current implementation
we do not check if the modification time embedded in the index file
matches the FITS file we are loading: the implementation of this
feature is left to a next release.)

<<Functions to divide the TOD among the MPI processes>>=
def load_subrange(subrange: TODSubrange,
                  index: IndexFile,
                  configuration: CalibrateConfiguration) -> TOD:
    log.debug('entering load_subrange')

    with fits.open(subrange.file_info.file_name) as f:
        signal = f[configuration.signal_hdu].data.field(configuration.signal_column)
        if len(signal) != subrange.file_info.num_of_samples:
            log.error('expected %d samples in file "%s", but %d found: '
                      'you should rebuild the index file',
                      subrange.file_info.num_of_samples,
                      subrange.file_info.file_name,
                      len(signal))
            sys.exit(1)

        <<Load the pixel index of each sample>>
        <<If necessary, remove flagged data>>

    if len(signal) != subrange.file_info.num_of_unflagged_samples:
        log.error('expected %d unflagged samples in file "%s", but %d found: '
                  'you should rebuild the index file',
                  subrange.file_info.num_of_unflagged_samples,
                  subrange.file_info.file_name,
                  len(signal))
        sys.exit(1)

    start, end = (subrange.first_idx,
                  (subrange.first_idx + subrange.num_of_samples))
    signal = signal[start:end]
    pix_idx = pix_idx[start:end]

    return TOD(signal=signal, pix_idx=pix_idx,
               num_of_pixels=healpy.nside2npix(configuration.nside))
@ %def load_subrange

Loading the pixel index of each sample requires some work. As you
might remember, we allow the user to keep either the pixel index in
the FITS files, or the colatitude/longitude of the pointing direction
(see Table~\ref{tbl:calibrateParams},
page~\pageref{tbl:calibrateParams}). When we loaded the parameter file
(Sect.~\ref{sec:calibrateParams}), we did not discriminate between the
two possibilities, allowing [[configuration.pointing_columns]] to be
either a 1-element or 2-element list. Therefore, in the implementation
of [[load_subrange]] we need to take care of both cases:
<<Load the pixel index of each sample>>=
pix_idx = [f[configuration.pointing_hdu].data.field(x)
           for x in configuration.pointing_columns]
if len(pix_idx) == 2:
    theta, phi = pix_idx
    pix_idx = healpy.ang2pix(configuration.nside, theta, phi)
    del theta, phi
elif len(configuration.pointing_columns) == 1:
    pix_idx = pix_idx[0]
else:
    log.error('one or two columns are expected for the pointings (got %s)',
              ', '.join([str(x) for x in configuration.pointing_columns]))
    sys.exit(1)
@

Since unwanted data might be present in the TODs, we have to remove
any of them. In this case we reuse function [[flag_mask]], which we
implemented for [[index.py]], to remove any flagged
sample\footnote{Note that there is a potential inefficiency here:
[[pix_idx]] might have been computed from colatitude/longitude pairs
using [[healpy.ang2pix]] (see above); in this case, part of the result
of that computation is discarded here and might have therefore not
been computed in the first place.} from [[signal]] and [[pix_idx]].
<<If necessary, remove flagged data>>=
if index.flagging is not None:
    flags = f[index.flag_hdu].data.field(index.flag_column)
    mask = flag_mask(flags, index.flagging)

    signal = signal[mask]
    pix_idx = pix_idx[mask]
@

We have all the pieces needed to implement [[load_tod]], which loads
all the [[TODSubrange]] objects in a list and build a single,
potentially huge [[TOD]] object. This function is going to be called
just once by every MPI process:
<<Functions to divide the TOD among the MPI processes>>=
def load_tod(tod_list: List[TODSubrange],
             index: IndexFile,
             configuration: CalibrateConfiguration) -> TOD:
    log.debug('entering load_tod')

    result = TOD(signal=np.array([], dtype='float'),
                 pix_idx=np.array([], dtype='int'),
                 num_of_pixels=healpy.nside2npix(configuration.nside))
    for cur_subrange in tod_list:
        cur_tod = load_subrange(cur_subrange, index, configuration)
        result.signal = np.concatenate((result.signal, cur_tod.signal))
        result.pix_idx = np.concatenate((result.pix_idx, cur_tod.pix_idx))

    return result
@ %def load_tod

We call [[load_tod]] in the main function, [[calibrate_main]]:
<<Load the part of the TOD that is assigned to this MPI process into the [[tod]] variable>>=
files_per_process = assign_files_to_processes(samples_per_process, index.tod_info)
tod = load_tod(files_per_process[mpi_rank], index, configuration)
@

As a test of the correctness of the index, the data, and the program,
we check that the number of samples in the TOD we have just loaded
matches the sum of the samples in each offset period and calibration
period that this MPI process will analyze:
<<Load the part of the TOD that is assigned to this MPI process into the [[tod]] variable>>=
assert len(tod.signal) == sum(local_samples_per_ofsp)
assert len(tod.signal) == sum(local_samples_per_gainp)
@

Finally, we print some information about the number of samples loaded
by all the MPI processes. In order to do this, we use the MPI function [[allreduce]]:
<<Load the part of the TOD that is assigned to this MPI process into the [[tod]] variable>>=
overall_num_of_samples = mpi_comm.allreduce(len(tod.signal), op=MPI.SUM)
log.info('elements in the TOD: %d (split among %d processes)',
         overall_num_of_samples, mpi_size)
@

All the code needed to load the input TOD files has been presented.
Now we turn to the most mathematical part of the code, the one where
we apply the \DaCapo{} calibration code.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Estimation of the dipole signal}

To calibrate a stream of raw measurements, we need to have some
function to estimate the amplitude of the dipole signal given some
pointing direction towards the sky. The formula which relates the
temperature of the dipole $D$ to the speed of the spacecraft rest
frame at time $t$ along direction $\vect{x}$ is
\begin{equation}
\label{eq:dipole}
D(\vect{x}, t) = T_\mathrm{CMB} \left(\frac1{\gamma(t) \bigl(1 -
\vect{\beta}(t) \cdot \vect{x}\bigr)} - 1\right),
\end{equation}
where $\vect\beta(t) = \vect v(t)/c$ is the speed of the rest frame,
and $\gamma(t) = \bigl(1 - \vect\beta^2(t)\bigr)^{-1/2}$. However,
this formula does not consider frequency-dependent relativistic
corrections due to the fact that detectors usually measure brightness
around some frequency $\nu$ instead of the temperature $D$ itself
\citep{quartin.notari.2015}. A more precise formula for the dipole
signal would thus be the following:
\begin{equation}
\label{eq:dipoleCorr}
D(\vect{x}, t, \nu) = T_\mathrm{CMB} \left( \beta \cdot \vect{x} +
Q(\nu) \cdot \bigl(\beta \cdot \vect{x}\bigr)^2\right),
\end{equation}
where $Q(\nu)$ is defined as
\begin{equation}
\label{eq:quadrupolarCorr}
Q(\nu) =
\frac{h \nu}{2 k_B T_\mathrm{CMB}} \coth \left(\frac{h \nu}{2 k_B T_\mathrm{CMB}}\right) = 
\frac{h \nu}{2 k_B T_\mathrm{CMB}} \frac{\exp(h \nu / k_B T_\mathrm{CMB}) + 1}{\exp(h \nu / k_B T_\mathrm{CMB}) - 1}.
\end{equation}

In our case, we ignore the velocity of the spacecraft and take into
account the velocity of the Solar System with respect to the CMB rest
frame only. Consider that in the case of a spacecraft orbiting the
$L_2$ point, this speed is of the order of 30\,km/s, thus accounting
for roughly 10\,\% of the speed with respect to the CMB rest frame
(370\,km/s).

We implement the function [[get_dipole_temperature]], which takes the
value of $T_\mathrm{CMB}$ ([[t_cmb_k]]), the velocity of the Solar
System $\vect{v}(t)$ ([[solsys_speed_vec_m_s]], a 3-element NumPy
array), and an array of directions $\left\{\vect{x}_i\right\}$
([[directions]]), and it returns an array of temperature (one for each
direction $\vect{x}_i$). Since version 1.1, the caller can optionally specify
a frequency: in this case, Eq.~\eqref{eq:dipoleCorr} will be used instead of
Eq.~\eqref{eq:dipole}.
<<Estimation of the dipole signal>>=
SPEED_OF_LIGHT_M_S = 2.99792458e8
PLANCK_H_MKS = 6.62606896e-34
BOLTZMANN_K_MKS = 1.3806504e-23

def get_dipole_temperature(t_cmb_k: float, solsys_speed_vec_m_s, directions, freq=None):
    '''Given one or more one-length versors, return the intensity of the CMB dipole

    The vectors must be expressed in the Ecliptic coordinate system.
    If "freq" (frequency in Hz) is specified, the formulation will use the
    quadrupolar correction.
    '''
    log.debug('entering get_dipole_temperature')

    beta = solsys_speed_vec_m_s / SPEED_OF_LIGHT_M_S
    if freq:
            fact = PLANCK_H_MKS * freq / (BOLTZMANN_K_MKS * t_cmb_k)
            expfact = np.exp(fact)
            q = (fact / 2) * (expfact + 1) / (expfact - 1)
            dotprod = np.dot(beta, directions)
            return t_cmb_k * (dotprod + q * dotprod**2)
    else:
	    gamma = (1 - np.dot(beta, beta))**(-0.5)

	    return t_cmb_k * (1.0 / (gamma * (1 - np.dot(beta, directions))) - 1.0)
@ %def SPEED_OF_LIGHT_M_S PLANCK_H_MKS BOLTZMANN_K_MKS get_dipole_temperature 

We can now implement the part of the main function [[calibrate_main]]
where we build the monopole and dipole maps $\matr{m}_c$
(Eq.~\ref{eq:constrainMatrix}). Since the monopole map needs to take
any mask into account, we load the mask the user might have specified
in the parameter file (through the [[mask]] parameter in the
[[dacapo]] section, see Table~\ref{tbl:calibrateParams},
page~\pageref{tbl:calibrateParams}):
<<Build the monopole and dipole maps>>=
if configuration.mask_file_path is not None:
    mask = healpy.read_map(configuration.mask_file_path, verbose=False)
    mask = np.array(healpy.ud_grade(mask, configuration.nside), dtype='int')
else:
    mask = None
@
Converting the result of [[healpy.ud_grade]] truncates any
floating-point number to its lowest integer; thus, pixels in the mask
with any value between 0 and 1 will be converted to zero. (This is
likely to be the case if the mask has an higher value for [[NSIDE]]
than the value in [[configuration.nside]], which has been specified by
the user in the configuration file.)

After having loaded the mask, we can generate the dipole map:
<<Build the monopole and dipole maps>>=
directions = healpy.pix2vec(configuration.nside,
                            np.arange(healpy.nside2npix(configuration.nside)))
dipole_map = get_dipole_temperature(t_cmb_k=configuration.t_cmb_k,
                                    solsys_speed_vec_m_s=configuration.solsys_speed_vec_m_s,
                                    directions=directions,
                                    freq=configuration.frequency_hz)
@

Finally, we create a [[MonopoleAndDipole]] class, which will be needed
to run the \DaCapo algorithm:
<<Build the monopole and dipole maps>>=
mc = MonopoleAndDipole(mask=mask, dipole_map=dipole_map)
@
The implementation of [[MonopoleAndDipole]] map will be discussed in
Sect.~\ref{sec:highLevelCalculations}.

\begin{figure}
    \centering
    \includegraphics[width=0.6\textwidth]{figures/test_dipole_temperature.pdf}
    \caption{\label{fig:testDipoleTemperature} Amplitude of the
    Doppler signal caused by the motion of the Solar System with
    respect to the CMB rest frame. See Eq.~\protect\eqref{eq:dipole}.}
\end{figure}

As a reference, the following code produces the map shown in
Fig.~\ref{fig:testDipoleTemperature}:
\verbatiminput{scripts/test_dipole_temperature.py}
The velocity of the Solar System ([[solsys_speed_vec_m_s]]) is the one
measured by \citet{planck2015.lfi.calibration}, and it is expressed in
the Ecliptic coordiante system.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Implementation of the \DaCapo{} algorithm}

In this chapter we are going to implement the equations presented in
Chapter~\ref{ch:daCapo}, which are the core of the \DaCapo{} algorithm.
Let's first recap them shortly.

Our model for the output of a detector is
\begin{equation*}
V_i = G_k \bigl(T_i + D_i\bigr) + b_k + N_i,
\tag{\ref{eq:radiometer} revisited}
\end{equation*}
where $V_i$ is the output of the detector, $G_k$ is the gain
factor, $T_i$ is the temperature of the sky (Galaxy and CMB),
$D_i$ is the dipole, $b_k$ is a constant offset, and $N_i$ is a
zero-mean, Gaussian random variable.

In order to estimate the $1/f$ baselines $b_k$ (offsets), the gains
$G_k$, and the sky map $\matr{m}$, we start from a guess for $G$ and
$\vect{m}$, and then we solve the following equation:
\begin{equation*}
\matr{A} \vect{a} = \vect{v}, \tag{\ref{eq:cjgr} revisited}
\end{equation*}
where $\vect{a}$ is the vector which concatenates the offsets $b_k$
and the gains $G_k$, and
\begin{align*}
\matr{A} &= \matr{F}^T \matr{C}_n^{-1} \matr{Z} \matr{F},
\tag{\ref{eq:Amatrix} revisited}\\
\vect{v} &= \matr{F}^T \matr{C}_n^{-1} \matr{Z} \vect{V},
\tag{\ref{eq:vVector} revisited}\\
\matr{Z} &= \matr{I} - \matr{\tilde P} \bigl(\matr{M} +
\matr{C}_m^{-1}\bigr)^{-1} \matr{\tilde{P}}^T \matr{C}_n^{-1},
\tag{\ref{eq:newZmatrix} revisited}\\
\matr{M} &= \matr{\tilde{P}}^T \matr{C}_n^{-1} \matr{\tilde{P}},
\tag{\ref{eq:Mmatrix} revisited}\\
\bigl(\matr{M} + \matr{C}_m^{-1}\bigr)^{-1} &= \Bigl(\matr{I} -
\matr{M}^{-1} \matr{m}_c \bigl(\matr{m}_c^T \matr{M}^{-1}
\matr{m}_c\bigr)^{-1} \matr{m}_c^T \Bigr)
\matr{M}^{-1},\tag{\ref{eq:MCmMatrix} revisited}
\end{align*}
and $\matr{m}_c$ is a $N \times 2$ matrix which contains the map of
the monopole (first column, with all elements equal to one) and of the
dipole (second column). Once we have an estimate for $\vect{a}$, we
get a better estimate of the sky map $\vect{m}$ if we add the map
\begin{equation*}
\tag{\ref{eq:tildeMap} revisited}
\vect{\tilde m}_1 = \bigl(\matr{\tilde P}^T \matr{C}_n^{-1} \matr{\tilde
P}\bigr)^{-1} \matr{\tilde P}^T \matr{C}_n^{-1} \bigl(\vect V - \matr
F \vect a\bigr)
\end{equation*}
to $\vect{m}$. Then, we can solve again Eq.~\eqref{eq:cjgr} until the
solution is good enough.

\begin{figure}[tb]
    \centering
    \includegraphics[width=\textwidth]{figures/operators.pdf}
    \caption{\label{fig:operators} Domains and codomains of the matrix
    operators used in \DaCapo. Matrix $\matr{\tilde P}$ is not shown,
    as it is structurally identical to matrix $\matr P$. For each
    domain, a qualitative estimate of its size is provided; for a
    1~year mission, a detector with $1/f$ knee frequency 20\,mHz which
    samples data at 50\,Hz and needs to be recalibrated each 6 hours
    requires the following amount of memory: 5\,MB for the
    instrumental parameters, 100\,MB for a Healpix map with
    $\mathtt{NSIDE}=1024$, and 50\,GB for the TODs.}
\end{figure}

Since the equations seem daunting, we provide here a few clues about
how to implement these calculations. Refer also to
Fig.~\ref{fig:operators}.

First of all, if we explicit $\matr{A}$ and $\vect{v}$ in
Eq.~\eqref{eq:cjgr}, we get
\begin{equation}
\matr{F}^T \matr{C}_n^{-1} \matr{Z}\,(\matr{F}\vect{a}) =
\matr{F}^T \matr{C}_n^{-1} \matr{Z}\,(\vect{V}).
\end{equation}
This representation of Eq.~\eqref{eq:cjgr} makes a few facts
self-evident:
\begin{enumerate}

\item Once vector $\vect{a}$ is multiplied by $\matr{F}$, the result
has the shape of a TOD of voltages, exactly like $\vect{V}$ (the TOD
passed as input to the algorithm). This TOD is our best approximation
of the input TOD, based on the values of the offsets $b$, the gains
$G$, the dipole $\vect{D}$, and the map $\matr{m}$.

\item The two TODs are passed to the same sequence of operators,
namely, $\matr{F}^T \matr{C}_n^{-1} \matr{Z}$. From the equation
above, it follows that \DaCapo{} requires that the results of the two
operations be equal, if $\vect{a}$ contains the ``right'' offsets
$b_k$ and gains $G_k$.

\end{enumerate}

We now turn to describe the meaning of the operator $\matr{F}^T
\matr{C}_n^{-1} \matr{Z}$\ldots

We need a datatype to represent vectors in the group ``Instrumental parameters''
(Fig.~\ref{fig:operators}), like $\vect{a}$. Such vectors must contain
both the offsets $b_k$ and the gains $G_k$, are represented using a new data structure,
[[OfsAndGains]]. We keep them together in one NumPy vector, [[a_vec]],
using the following memory layout:
\begin{center}
\includegraphics{figures/OfsAndGains_structure_1.pdf}
\end{center}
It is much handier to have them both in the same array than to keep
two separate arrays, as we shall see soon.

The definition of [[OfsAndGains]] is the following. Note that we
provide two [[@property]] functions to access the offsets and the
gains as if they were separate arrays:
<<Datatypes used by [[calibrate.py]]>>=
class OfsAndGains:
    def __init__(self, offsets, gains, samples_per_ofsp, samples_per_gainp):
        self.a_vec = np.concatenate((offsets, gains))
        self.samples_per_ofsp = np.array(samples_per_ofsp, dtype='int')
        self.samples_per_gainp = np.array(samples_per_gainp, dtype='int')

        self.ofsp_per_gainp = OfsAndGains.calc_ofsp_per_gainp(samples_per_ofsp,
                                                              samples_per_gainp)

    def __copy__(self):
        return OfsAndGains(offsets=np.copy(self.offsets),
                           gains=np.copy(self.gains),
                           samples_per_ofsp=np.copy(self.samples_per_ofsp),
                           samples_per_gainp=np.copy(self.samples_per_gainp))

    def __repr__(self):
        return 'a: {0} (offsets: {1}, gains: {2})'.format(self.a_vec,
                                                          self.offsets,
                                                          self.gains)

    @property
    def offsets(self):
        return self.a_vec[0:len(self.samples_per_ofsp)]

    @property
    def gains(self):
        return self.a_vec[len(self.samples_per_ofsp):]

    <<Implementation of [[OfsAndGains.calc_ofsp_per_gainp]]>>
@ %def OfsAndGains

We use [[OfsAndGains.calc_ofsp_per_gainp]] in the constructor. This is
a static method to compute the number of offset periods within each
gain period, and it is a complementary information to
[[samples_per_ofsp]] (number of raw samples in each offset period) and
[[samples_per_gainp]] (number of raw samples in each gain period). To
explain why this is useful, consider the following example, where a
TOD with $N=20$ samples has already been divided into 5 offset periods
and 2 gain periods:
\begin{center}
    \includegraphics[width=0.75\textwidth]{figures/OfsAndGains_structure_2.pdf}
\end{center}
The variable [[samples_per_ofsp]] is a list containing the numbers
\texttt{[4, 4, 3, 5, 4]} (samples in each offset period), and the list
in [[samples_per_gainp]] contains \texttt{[11, 9]} (samples in each
gain period). The purpose of calling
[[OfsAndGains.calc_ofsp_per_gainp]] is to produce a third list,
[[ofsp_per_gainp]], which contains the number of offset periods per
gain period, in this case the list \texttt{[3, 2]}. This information
will be used by the preconditioner used to speed up the Conjugate
Gradient algorithm.

The implementation of [[OfsAndGains.calc_ofsp_per_gainp]] is trivial:
<<Implementation of [[OfsAndGains.calc_ofsp_per_gainp]]>>=
@staticmethod
def calc_ofsp_per_gainp(samples_per_ofsp, samples_per_gainp):
    log.debug('entering calc_ofsp_per_gainp')

    ofsp_per_gainp = []

    cur_ofsp_idx = 0
    for samples_in_cur_gainp in samples_per_gainp:
        ofsp_in_cur_gainp = 0
        sample_count = 0
        while sample_count < samples_in_cur_gainp:
            sample_count += samples_per_ofsp[cur_ofsp_idx]
            cur_ofsp_idx += 1
            ofsp_in_cur_gainp += 1

        assert sample_count == samples_in_cur_gainp
        ofsp_per_gainp.append(ofsp_in_cur_gainp)

    return ofsp_per_gainp
@ %def OfsAndGains.calc_ofsp_per_gainp

We have implemented the [[__copy__]] method because we are going to
make copies of [[OfsAndGains]] objects in the implementation of the
conjugate gradient algorithm. The following helper function creates a
new [[OfsAndGains]] object with the same number of offset/gain periods
as [[source]]:
<<Datatypes used by [[calibrate.py]]>>=
def ofs_and_gains_with_same_lengths(source: OfsAndGains, a_vec):
    result = copy(source)
    result.a_vec = np.copy(a_vec)
    return result
@ %def ofs_and_gains_with_same_lengths
The typical usage of this function is to create a new vector
$\vect{a}$ from an existing one without the annoyance of recomputing
the length of every gain/offset period.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Computational tricks}
\label{sec:computationalTricks}

The application of $\matr A$ to the vector of unknowns $\vect a$ is
quite trivial and does not really need to build any matrix in memory
at all:
\begin{enumerate}

\item We can assume that $\matr{C}_n = \sigma^2 \matr{I}$ (i.e., the
white noise of the detector remains constant in time), so that all the
$\matr{C}_n$ terms in the equation above are simplified out;

\item Matrix $\tilde{\matr P}$ is slightly more complicated than
$\matr P$. First, elements in the former are expressed in V/K, while
elements in the latter are pure numbers. Therefore, $\matr{P}\matr{m}$
scans a map into a TOD expressed in the same measure unit as
$\matr{m}$ (temperature), while $\matr{\tilde{P}}\matr{m}$ produces a
TOD of voltages.

\item Applying matrix $\matr{P}^T$ to a TOD produces a binned map of
the TOD itself. Applying $\matr{\tilde{P}}^T$ to a TOD of temperatures
produces a map of binned voltages.

\item Matrix $\matr{M} \equiv \matr{\tilde P}^T \matr{\tilde{P}}$ has
a particularly simple structure. As an example, we consider the
example in Eq.~\eqref{eq:pexample} and two calibration periods with
gains $G_1$ and $G_2$ which cover 5 and 3 samples respectively. Then
\begin{equation}
\matr{\tilde P} = \begin{pmatrix}
G_1& 0& 0\\
G_1& 0& 0\\
0& G_1& 0\\
G_1& 0& 0\\
0& G_1& 0\\
0& 0& G_2\\
0& 0& G_2\\
G_2& 0& 0\\
0& 0& G_2\\
\end{pmatrix},
\end{equation}
and
\begin{equation}
\matr{M} \equiv \tilde{\matr P}^T\tilde{\matr P} = \begin{pmatrix}
3 G_1^2 + G_2^2& 0& 0\\
0& 2 G_1^2& 0\\
0& 0& 3 G_2^2\\
\end{pmatrix}.
\end{equation}
Therefore, the diagonal elements of matrix $\matr{M}$ represent a map
where each pixel is the sum of the squared gains of each sample
falling within the pixel itself.

\item $\matr Z$ is a complex matrix, which has the purpose of
``cleaning'' a TOD from the part of the signal which does not depend
on noise.

\item $\matr F$ converts the parameters $\vect b$ and $\vect G$ into a
TOD containing an estimate of the \emph{uncalibrated} signal (in
Volt);

\item $\matr F^T$ takes a TOD as input and produces a shorter vector
divided in two halves: the first half contains the sum of the signal
within each calibration period, the second half contains the sum of
the \emph{uncalibrated} sky signal $\vect D + m$ (remember that the
$G$ constant decalibrates a signal, converting it from Kelvin to
Volt).

\end{enumerate}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Implementation of the conjugate gradient algorithm}

In this long section we will slowly implement all the matrix/vector
functions used by the conjugate gradient algorithm and by \DaCapo. We
begin by introducing the algorithm of the conjugate gradient and its
variant with a preconditioner (sect.~\ref{eq:pcgIntro}). Then, in
sections \ref{sec:applyf} and following, we will implement the
algorithm in Python.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Solving the problem using the conjugate gradient method}
\label{eq:pcgIntro}

Equation~\eqref{eq:cjgr} on page~\pageref{eq:cjgr} can be solved for
$\vect a$ without inverting matrix $\matr A$ if we use the conjugate
gradient method \citep{Strikwerda_2004}. We do not spend too much time
describing the method, as there are plenty of explanations\footnote{A
quite good introduction is available on Wikipedia:
\url{https://en.wikipedia.org/wiki/Conjugate_gradient_method}.}
available. The interest of this method lies in the fact that it does
not calculate matrix $\matr{A}^{-1}$ explicitly, because it only
computes products of $\matr{A}$ with other vectors. (In our case,
matrix $\matr{A}$ would be too big to be stored and inverted in
memory.)

It is possible to devise an iterative algorithm which converges to the
solution of Eq.~\eqref{eq:cjgr}; the outline is the following
(the symbol $\leftarrow$ denotes variable assignment):
\begin{algorithmic}
\STATE $\vect{r}_0 \leftarrow \vect{v} - \matr{A} \vect{a}_0$

\STATE $\vect{p}_0 \leftarrow \vect{r}_0$

\STATE $k \leftarrow 0$

\LOOP

    \STATE $\gamma_k \leftarrow \frac{\vect{r}_k^T \vect{r_k}}{\vect{p}_k^T \matr{A}
\vect{p}_k}$

    \STATE $\vect{a}_{k+1} \leftarrow \vect{a}_k + \gamma_k \vect{p}_k$

    \STATE $\vect{r}_{k+1} \leftarrow \vect{r}_k - \gamma_k \matr{A}
    \vect{p}_k$

    \IF{$\vect{r}_{k+1}$ is small enough} \STATE exit the loop \ENDIF

    \STATE $\vect{p}_{k+1} \leftarrow \vect{r}_{k+1} + \frac{\vect{r}_{k+1}^T
    \vect{r}_{k+1}}{\vect{r}_k^T \vect{r}_k} \vect{p}_k$

    \STATE $k \leftarrow k + 1$

\ENDLOOP
\end{algorithmic}
The condition which makes the loop end is that the residual
$\vect{r}_k$ is small enough (had the algorithm converged to the exact
solution, each element of $\vect{r}_k$ would be identically zero).

It is often the case that the algorithm above converges slowly towards
the solution. This usually happens because matrix $\matr{A}$ has a bad
conditioning number, i.e., the ratio between the largest and the
smallest eigenvalues of $\matr{A}$ is large: this implies that
numerical roundoff errors significantly affects the evaluation of
$\matr{A}$. It is possible to improve the convergence rate if one is
able to derive a positive symmetric definite matrix $\matr{M}$, called
the \emph{preconditioner}, which has the property that
\begin{equation}
\matr{M} \approx \matr{A}^{-1}.
\end{equation}
In this case, the algorithm becomes the following:
\begin{algorithmic}
\STATE $\vect{r}_0 \leftarrow \vect{v} - \matr{A} \vect{a}_0$

\STATE $\vect{z}_0 \leftarrow \matr{M}^{-1} \vect{r}_0$

\STATE $\vect{p}_0 \leftarrow \vect{r}_0$

\STATE $k \leftarrow 0$

\LOOP

    \STATE $\gamma_k \leftarrow \frac{\vect{z}_k^T \vect{r_k}}{\vect{p}_k^T \matr{A}
\vect{p}_k}$

    \STATE $\vect{a}_{k+1} \leftarrow \vect{a}_k + \gamma_k \vect{p}_k$

    \STATE $\vect{r}_{k+1} \leftarrow \vect{r}_k - \gamma_k \matr{A}
    \vect{p}_k$

    \IF{$\vect{r}_{k+1}$ is small enough} \STATE exit the loop \ENDIF

    \STATE $\vect{z}_{k+1} \leftarrow \matr{M}^{-1} \vect{r}_{k+1}$

    \STATE $\vect{p}_{k+1} \leftarrow \vect{z}_{k+1} + \frac{\vect{z}_{k+1}^T
    \vect{r}_{k+1}}{\vect{z}_k^T \vect{r}_k} \vect{p}_k$

    \STATE $k \leftarrow k + 1$

\ENDLOOP
\end{algorithmic}
It is not unusual to see wall clock times reduced by one order of
magnitude with the introduction of a preconditioner. We will implement
two kinds of preconditioners in Sect.~\ref{sec:preconditioners}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{From instrumental parameters to TODs}
\label{sec:applyf}

The first function we implement is [[apply_f]], which calculates the
product between matrix $\matr{F}$ (Eq~\ref{eq:matrF}). This operation
transforms a set of instrumental parameters (gains and offsets) into a
TOD (see Fig.~\ref{fig:operators}).

Since this function must be iterated over the number of samples in the
TOD, and since this is a potentially large number, we implement it in
Fortran. Here is the code:
<<Fortran routines used by [[calibrate.py]]>>=
subroutine apply_f(offsets, gains, samples_per_ofsp, samples_per_gainp, &
     pix_idx, dipole_map, sky_map, output)

  real(kind=8), dimension(:), intent(in) :: offsets
  real(kind=8), dimension(:), intent(in) :: gains
  integer(kind=8), dimension(size(offsets)), intent(in) :: samples_per_ofsp
  integer(kind=8), dimension(size(gains)), intent(in) :: samples_per_gainp
  integer(kind=8), dimension(:), intent(in) :: pix_idx
  real(kind=8), dimension(:), intent(in) :: dipole_map
  real(kind=8), dimension(size(dipole_map)), intent(in) :: sky_map
  real(kind=8), dimension(size(pix_idx)), intent(out) :: output

  integer(kind=8) :: cur_ofsp_idx, cur_gainp_idx
  integer(kind=8) :: samples_in_ofsp, samples_in_gainp
  integer(kind=8) :: i

  cur_ofsp_idx = 1
  cur_gainp_idx = 1
  samples_in_ofsp = 0
  samples_in_gainp = 0

  do i = 1, size(output)

     output(i) = offsets(cur_ofsp_idx) +&
          (dipole_map(pix_idx(i) + 1) + sky_map(pix_idx(i) + 1)) * gains(cur_gainp_idx)

     <<Increment [[samples_in_ofsp]]>>
     <<Increment [[samples_in_gainp]]>>
  enddo

end subroutine apply_f
@ %def ftnroutines.apply_f

Once we have calculated the value of the $i$-th sample, we need to
advance to the next sample. While doing this, we check if we must
advance to the next offset/gain value in the arrays [[offsets]] and
[[gains]]. This is the code for updating the offset indexes:
<<Increment [[samples_in_ofsp]]>>=
samples_in_ofsp = samples_in_ofsp + 1

if (samples_in_ofsp .ge. samples_per_ofsp(cur_ofsp_idx)) then
   cur_ofsp_idx = cur_ofsp_idx + 1
   samples_in_ofsp = 0
endif
@
We update the index to the pixel and the index to the period. The same
must be done for the gain as well:
<<Increment [[samples_in_gainp]]>>=
samples_in_gainp = samples_in_gainp + 1

if (samples_in_gainp .ge. samples_per_gainp(cur_gainp_idx)) then
   cur_gainp_idx = cur_gainp_idx + 1
   samples_in_gainp = 0
endif
@

We nicely wrap this function in a Python function, which has the only
purpose of extracting all the parameters from the [[OfsAndGains]]
object:
<<Matrix/vector multiplication functions>>=
def apply_f(a: OfsAndGains, pix_idx, dipole_map, sky_map):
    log.debug('entering apply_f')
    return ftnroutines.apply_f(a.offsets, a.gains,
                               a.samples_per_ofsp, a.samples_per_gainp,
                               pix_idx, dipole_map, sky_map)
@ %def apply_f

Note that we do not use MPI explicitly in the application of $\matr
F$. However, remember that the overall TOD (which might well be one
full year of data) is split among the MPI processes. The way
[[apply_F]] is called is such that the result is a TOD which covers
the same time span as the chunk of data currently loaded by the
current MPI process: [[a]] only includes the offsets and gains owned
by the process, and [[pixidx]] lists only the pixels ``seen'' by the
samples loaded by the process.

Since this is a quite important concept to grasp, let's see it from
another point of view. The case here is to compute a matrix-vector
multiplication of the form $\vect y = \matr A \vect x$. We write this
operation in full form:
\begin{equation}
\label{eq:matrixMulMPIsplit}
\begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_m\\
\end{pmatrix} =
\begin{pmatrix}
a_{11}& a_{12}& a_{13}& \ldots& a_{1m}\\
a_{21}& a_{22}& a_{23}& \ldots& a_{1m}\\
\hdotsfor{5}\\
a_{n1}& a_{n2}& a_{n3}& \ldots& a_{nm}
\end{pmatrix}
\begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_m\\
\end{pmatrix}
\end{equation}

Suppose now that vector $\vect x$ is split in many sub-vectors, and
that each of them belongs to a different MPI process. For instance,
process 1 has $(x_1\ x_2)$, process 2 has $(x_3\ x_4\ x_5)$, and so
on. Suppose also that  multiplication by $\matr A$ can be implemented
algorithmically, i.e., without the need of keeping the whole matrix
$\matr A$ in memory. If process 1 computes the quantity
\begin{equation}
\begin{pmatrix}
y_1\\
y_2\\
\end{pmatrix} =
\begin{pmatrix}
a_{11}& a_{12}& a_{13}& \ldots& a_{1m}\\
a_{21}& a_{22}& a_{23}& \ldots& a_{1m}\\
\end{pmatrix}
\begin{pmatrix}
x_1\\
x_2\\
\end{pmatrix},
\end{equation}
and all the other MPI processes do the same with their sub-vectors of
$\vect x$. Then, the coefficients of $\vect y$ are equal to the result
that would have been obtained using Eq.~\eqref{eq:matrixMulMPIsplit}
directly. Therefore, in order to compute all the coefficients for
vector $\vect y$, it is enough for each MPI process to compute the
multiplication between a sub-matrix of $\matr A$ and a sub-vector of
$\vect x$, and then to concatenate together all the results from each
MPI process. This explains why we have implemented [[apply_F]] without
caring for MPI parallelization.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{From TODs to instrumental parameters}

Now that we have implemented the multiplication by $\matr F$, let's
turn ourselves to $\matr F^T$. As it was the case before, the amount
of data that are typically involved require us to implement the
calculation in Fortran. The input data (a TOD) is passed through the
argument [[vector]]; the other arguments should be self-explanatory:
<<Fortran routines used by [[calibrate.py]]>>=
subroutine apply_ft(vector, offsets, gains, samples_per_ofsp, samples_per_gainp, &
     pix_idx, dipole_map, sky_map, output)
  real(kind=8), dimension(:), intent(in) :: vector
  real(kind=8), dimension(:), intent(in) :: offsets
  real(kind=8), dimension(:), intent(in) :: gains
  integer(kind=8), dimension(size(offsets)), intent(in) :: samples_per_ofsp
  integer(kind=8), dimension(size(gains)), intent(in) :: samples_per_gainp
  integer(kind=8), dimension(size(vector)), intent(in) :: pix_idx
  real(kind=8), dimension(:), intent(in) :: dipole_map
  real(kind=8), dimension(size(dipole_map)), intent(in) :: sky_map
  real(kind=8), dimension(size(offsets) + size(gains)), intent(out) :: output

  integer(kind=8) :: cur_ofsp_idx, cur_gainp_idx
  integer(kind=8) :: samples_in_ofsp, samples_in_gainp
  integer(kind=8) :: i

  cur_ofsp_idx = 1
  cur_gainp_idx = 1
  samples_in_ofsp = 0
  samples_in_gainp = 0

  output = 0

  do i = 1, size(vector)
     output(cur_ofsp_idx) = output(cur_ofsp_idx) + vector(i)
     output(size(offsets) + cur_gainp_idx) = output(size(offsets) + cur_gainp_idx) + &
          vector(i) * (dipole_map(1 + pix_idx(i)) + sky_map(1 + pix_idx(i)))

     <<Increment [[samples_in_ofsp]]>>
     <<Increment [[samples_in_gainp]]>>
  enddo

end subroutine apply_ft
@ %def ftnroutines.apply_ft
The output is a plain array: the first [[size(offsets)]] elements
contain the running sum of the corresponding elements of [[vector]]
within each offset period, while the end tail of [[output]] contains
the running sum of the monopole and dipole observed while measuring
the $i$-th sample, multiplied by that sample.

Note that we reuse the same code used in the implementation of
[[ftnroutines.apply_f]] to increment [[samples_in_ofsp]] and
[[samples_in_gainp]], as the logic and the variable names used here
are the same.

The Python wrapper for [[ftnroutines.apply_ft]] is trivial to
implement:
<<Matrix/vector multiplication functions>>=
def apply_ft(vector, a: OfsAndGains, pix_idx, dipole_map, sky_map):
    log.debug('entering apply_ft')
    return ftnroutines.apply_ft(vector, a.offsets, a.gains,
                                a.samples_per_ofsp, a.samples_per_gainp,
                                pix_idx, dipole_map, sky_map)
@ %def apply_ft
Note that the result is not a [[OfsAndGains]] object, but rather a
plain NumPy array: this might be assigned to the [[a_vec]] component
of a [[OfsAndGains]] object, as the Fortran routine builds it using
the same memory layout we presented above (first offsets, then gains).
Also, the calculation is parallelized using the same principle used to
implement [[apply_F]]: each MPI process produces a [[OfsAndGains]]
object which only contains those offset/gain periods owned by the
process itself, as the input TOD is a local subset of the full TOD.

\begin{figure}[tb]
    \centering
    \includegraphics[width=0.5\textwidth]{figures/diagm.pdf}
    \caption{\label{fig:diagmComputation} How the diagonal
    coefficients of matrix $\matr M$ are computed, in the case of a
    $7\times 7$ matrix and two MPI processes running concurrently. All
    the non-diagonal terms are zero and are not showed here.}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The gain hit map}
\label{sec:diagm}

Let's now concentrate in the multiplication by matrix $\matr M =
\matr{\tilde P}^T \matr{C}_n^{-1} \matr{\tilde P}$
(Eq.~\ref{eq:Mmatrix}), with the $\matr{C}_n^{-1}$ term dropped
according to the hypotheses presented in
Sect.~\ref{sec:computationalTricks}. Matrix $\matr M$ is a diagonal
matrix, where the $p$ diagonal coefficients are the pixel values of a
map which is similar to hit maps. (To get a proper hit map, one would
consider the diagonal elements of the matrix $\matr P^T \matr P$.)

We need to be careful in implementing this calculation, as it is not
as easy to parallelize as it was the case for [[apply_F]] and
[[apply_FT]]. The point is that $\matr M$ is a diagonal matrix whose
coefficients depend on \emph{all} the samples in the full TOD, not
only on the samples of the local TOD. Each diagonal coefficient
$m_{pp}$ contains the sum of the squared values of the gains
associated with each sample of the full TOD which falls within pixel
$p$. Therefore, to correctly consider the contribution from all the
MPI processes we simply need to split the sum among the processes:
\begin{equation}
\label{eq:diagmMPIdecomposition}
m_{pp} = \underbrace{\sum_i \sum_k G_{k}^2 \delta^G_{ki}
\delta^P_{ip}}_\text{overall sum} =
\underbrace{\sum_i \sum_k G_{k}^2 \delta^G_{ki}
\delta^P_{ip}}_\text{MPI process \#1} +
\underbrace{\sum_i \sum_k G_{k}^2 \delta^G_{ki}
\delta^P_{ip}}_\text{MPI process \#2} + \dotsm,
\end{equation}
where $\delta^G_{ki}$ is nonzero only if gain $G_k$ must be applied to
the $i$-th sample of the TOD, and $\delta^P_{ip}$ is nonzero only if
the $i$-th sample of the TOD falls within pixel $p$. Therefore, we
allow each MPI process to compute the sum on its own, and then we
collect all the results from the processes and sum them together. At
the end, each MPI process will compute a subset of the diagonal
coefficients for $\matr M$, as shown in Fig.~\ref{fig:diagmComputation}.

Since we are going to assemble together many sums from the processes,
we implement the algorithm described above in the function
[[sum_local_results]]:
<<Matrix/vector multiplication functions>>=
def sum_local_results(mpi_comm, function, **arguments):
    log.debug('entering sum_local_results')
    result = function(**arguments)

    if mpi_comm:
        totals = np.zeros_like(result)
        mpi_comm.Allreduce(sendbuf=result, recvbuf=totals, op=MPI.SUM)
        return totals
    else:
        return result
@ %def sum_local_results
This function takes as arguments a Python function and a list of
arguments. It runs the function, collect the result and then
coordinates with all the MPI processes to sum it together: this is the
purpose of [[mpi_comm.Allreduce]]. On exit, each MPI process has the
overall sum available in the variable [[totals]].

We can finally turn to the calculation of $\matr M$. Since it is a
diagonal matrix, we just keep in memory its diagonal elements. The
following Fortran function computes one partial (``local'') sum of the
ones in Eq.~\eqref{eq:diagmMPIdecomposition}:
<<Fortran routines used by [[calibrate.py]]>>=
subroutine compute_diagm_locally(gains, samples_per_gainp, pix_idx, output)
  real(kind=8), dimension(:), intent(in) :: gains
  integer(kind=8), dimension(size(gains)), intent(in) :: samples_per_gainp
  integer(kind=8), dimension(:), intent(in) :: pix_idx
  real(kind=8), dimension(:), intent(inout) :: output

  integer(kind=8) :: cur_gainp_idx, samples_in_gainp
  integer(kind=8) :: i

  output = 0
  cur_gainp_idx = 1
  samples_in_gainp = 0

  do i = 1, size(pix_idx)
     output(pix_idx(i) + 1) = output(pix_idx(i) + 1) + gains(cur_gainp_idx)**2
     samples_in_gainp = samples_in_gainp + 1

     if (samples_in_gainp .ge. samples_per_gainp(cur_gainp_idx)) then
        cur_gainp_idx = cur_gainp_idx + 1
        samples_in_gainp = 0
     endif
  enddo

end subroutine compute_diagm_locally
@ %def ftnroutines.compute_diagm_locally
The Python wrapper is the following:
<<Matrix/vector multiplication functions>>=
def compute_diagm_locally(a: OfsAndGains, pix_idx, num_of_pixels: int):
    log.debug('entering compute_diagm_locally')
    result = np.empty(num_of_pixels, dtype='float')
    ftnroutines.compute_diagm_locally(a.gains, a.samples_per_gainp, pix_idx, result)
    return result
@ %def compute_diagm_locally

Computing the full diagonal for $\matr M$ is just a matter of
combining [[ftnroutines.compute_diagm_locally]] with [[sum_local_results]]:
<<Matrix/vector multiplication functions>>=
def compute_diagm(mpi_comm, a: OfsAndGains, pix_idx, num_of_pixels: int):
    log.debug('entering compute_diagm')
    return sum_local_results(mpi_comm, function=compute_diagm_locally,
                             a=a,
                             pix_idx=pix_idx,
                             num_of_pixels=num_of_pixels)
@ %def compute_diagm


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{From TODs to maps}

We now implement the operation described by matrix $\matr{\tilde P}$
(Eq.~\ref{eq:ptildedef}). This operation is similar to the application
of $\matr P$, the pointing matrix, which scans a map into a TOD;
however, $\matr{\tilde P}$ requires each sample in the TOD to be
multiplied by a gain value. As it was the case for [[apply_f]], we can
trivially split this computation among MPI processes by using each
of them to produce one chunk of the overall TOD. Since the samples to
be processed are going to be a huge number, we implement the core of
the calculation in Fortran:
<<Fortran routines used by [[calibrate.py]]>>=
subroutine apply_ptilde(map_pixels, gains, samples_per_gainp, pix_idx, output)
  real(kind=8), dimension(:), intent(in) :: map_pixels
  real(kind=8), dimension(:), intent(in) :: gains
  integer(kind=8), dimension(size(gains)), intent(in) :: samples_per_gainp
  integer(kind=8), dimension(:), intent(in) :: pix_idx
  real(kind=8), dimension(size(pix_idx)), intent(out) :: output

  integer(kind=8) :: cur_gainp_idx, samples_in_gainp
  integer(kind=8) :: i

  output = 0
  cur_gainp_idx = 1
  samples_in_gainp = 0

  do i = 1, size(pix_idx)
     output(i) = output(i) + map_pixels(pix_idx(i) + 1) * gains(cur_gainp_idx)
     <<Increment [[samples_in_gainp]]>>
  enddo

end subroutine apply_ptilde
@ %def ftnroutines.apply_ptilde

The Python wrapper to [[ftnroutines.apply_ptilde]] is trivial to implement:
<<Matrix/vector multiplication functions>>=
def apply_ptilde(map_pixels, a: OfsAndGains, pix_idx):
    log.debug('entering apply_ptilde')
    return ftnroutines.apply_ptilde(map_pixels, a.gains, a.samples_per_gainp, pix_idx)
@ %def apply_ptilde


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{From maps to TODs}

The opposite operation of $\matr{\tilde P}$ is $\matr{\tilde P}^T$,
which bins a TOD into a map after having multiplied each sample by the
associated gain factor. In this case, we make each MPI process produce
a map, and then we sum all the maps together; the reason why this work
is the same as for the computation of $\matr M$
(Sect.~\ref{sec:diagm}).

The following routine returns in [[output]] the map, given the TOD in
[[vector]] and the gains in [[gains]]. Note that [[output]] must
already have been allocated.
<<Fortran routines used by [[calibrate.py]]>>=
subroutine apply_ptildet_locally(vector, gains, samples_per_gainp, pix_idx, output)
  real(kind=8), dimension(:), intent(in) :: vector
  real(kind=8), dimension(:), intent(in) :: gains
  integer(kind=8), dimension(size(gains)), intent(in) :: samples_per_gainp
  integer(kind=8), dimension(:), intent(in) :: pix_idx
  real(kind=8), dimension(:), intent(inout) :: output

  integer(kind=8) :: cur_gainp_idx, samples_in_gainp
  integer(kind=8) :: i

  output = 0
  cur_gainp_idx = 1
  samples_in_gainp = 0

  do i = 1, size(pix_idx)
     output(pix_idx(i) + 1) = output(pix_idx(i) + 1) + vector(i) * gains(cur_gainp_idx)
     <<Increment [[samples_in_gainp]]>>
  enddo
end subroutine apply_ptildet_locally
@ %def ftnroutines.apply_ptildet_locally

The Python wrapper to [[ftnroutines.apply_ptildet_locally]] allocates
space for the result using [[np.empty]]:
<<Matrix/vector multiplication functions>>=
def apply_ptildet_locally(vector, a: OfsAndGains, pix_idx, num_of_pixels: int):
    log.debug('entering apply_ptildet_locally')
    result = np.empty(num_of_pixels, dtype='float')
    ftnroutines.apply_ptildet_locally(vector, a.gains,
                                      a.samples_per_gainp,
                                      pix_idx, result)
    return result
@ %def apply_ptildet_locally

Finally, to apply $\matr{\tilde P}^T$, we need to collect all the maps
from the MPI processes and sum them together. We can use
[[sum_local_results]], which we originally used to sum real numbers,
as it is generic enough to accept vectors as well:
<<Matrix/vector multiplication functions>>=
def apply_ptildet(mpi_comm, vector, a: OfsAndGains, pix_idx,
                  num_of_pixels: int):
    log.debug('entering apply_ptildet')
    return sum_local_results(mpi_comm, function=apply_ptildet_locally,
                             vector=vector,
                             a=a,
                             pix_idx=pix_idx,
                             num_of_pixels=num_of_pixels)
@ %def apply_ptildet


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{More high-level calculations}
\label{sec:highLevelCalculations}

We now turn to the implementation of matrix multiplication with $\matr
Z$ (Eq.~\ref{eq:newZmatrix}) and $\matr A$ (Eq.~\ref{eq:cjgr}).

Multiplication by matrix $\matr Z$ takes a TOD as input and produces a
TOD as output. Its purpose is to ``clean'' the TOD from the sky signal
and the offset baselines, in order to make the residuals due to errors
in the baselines/gains and white noise to show up.

As the implementation of this kind of functions need to process both
monopole and dipole maps, we implement a new datatype that keeps them
together in a consistent way:
<<Datatypes used by [[calibrate.py]]>>=
class MonopoleAndDipole:
    def __init__(self, mask, dipole_map):
        if mask is not None:
            self.monopole_map = np.array(mask, dtype='float')
        else:
            self.monopole_map = np.ones_like(dipole_map)

        self.dipole_map = np.array(dipole_map) * np.array(self.monopole_map)
@ %def MonopoleAndDipole
If a mask is provided, pixels in the monopole map are equal to one
only if the corresponding pixel in the mask nonzero, otherwise they
are set to zero. (If no mask is present, all the pixels are set to 1.)
The dipole map is multiplied by the monopole map so that zero pixels
are copied from the latter to the former.

Multiplication by $\matr Z$ is done by applying the definition:
\begin{equation}
\matr Z = \matr I - \matr{\tilde P} \bigl(\matr M +
\matr{C}_m^{-1}\bigr)^{-1} \matr{\tilde P}^T \matr{C}_n^{-1},
\tag{\ref{eq:newZmatrix} revisited}
\end{equation}
We have all the functions needed to implement this calculation
available, so it is just a matter of combining them in the proper way:
<<Matrix/vector multiplication functions>>=
def apply_z(mpi_comm, vector, a: OfsAndGains, pix_idx, mc: MonopoleAndDipole):
    log.debug('entering apply_z')
    binned_map = apply_ptildet(mpi_comm, vector, a, pix_idx,
                               len(mc.dipole_map))
    diagM = compute_diagm(mpi_comm, a, pix_idx, len(mc.dipole_map))

    nonzero_hit_mask = diagM != 0
    inv_diagM = np.zeros_like(diagM)
    inv_diagM[nonzero_hit_mask] = 1.0 / diagM[nonzero_hit_mask]

    binned_map = np.multiply(binned_map, inv_diagM)
    monopole_dot = np.dot(mc.monopole_map, binned_map)
    dipole_dot = np.dot(mc.dipole_map, binned_map)

    # Compute (m_c^T M^-1 m_c)
    small_matr = np.array([[np.dot(mc.dipole_map,
                                   np.multiply(inv_diagM, mc.dipole_map)),
                            np.dot(mc.dipole_map, inv_diagM)],
                           [np.dot(mc.monopole_map,
                                   np.multiply(inv_diagM, mc.dipole_map)),
                            np.dot(mc.monopole_map, inv_diagM)]])
    small_matr_prod = np.linalg.inv(small_matr) @ np.array([dipole_dot,
                                                            monopole_dot])

    ftnroutines.clean_binned_map(inv_diagM, mc.dipole_map, mc.monopole_map,
                                 small_matr_prod, binned_map)

    return vector - apply_ptilde(binned_map, a, pix_idx)
@ %def apply_z

The call to [[ftnroutines.clean_binned_map]] is required in order to
speed up the code. The Fortran implementation of the function is the
following:
<<Fortran routines used by [[calibrate.py]]>>=
subroutine clean_binned_map(inv_diagM, dipole_map, monopole_map, small_matr_prod, binned_map)
  real(kind=8), dimension(:), intent(in) :: inv_diagM
  real(kind=8), dimension(size(inv_diagM)), intent(in) :: dipole_map
  real(kind=8), dimension(size(inv_diagM)), intent(in) :: monopole_map
  real(kind=8), dimension(2), intent(in) :: small_matr_prod
  real(kind=8), dimension(size(inv_diagM)), intent(inout) :: binned_map

  binned_map = binned_map - inv_diagM * (dipole_map * small_matr_prod(1) + &
       monopole_map * small_matr_prod(2))

end subroutine clean_binned_map
@ %def ftnroutines.clean_binned_map
and it is equivalent to the following Python code:
\begin{verbatim}
binned_map -= inv_diagM * (mc.dipole_map * small_matr_prod[0] +
                           mc.monopole_map * small_matr_prod[1])
\end{verbatim}
However, benckmarks on real-world sized simulations revealed that the
Fortran routines makes the overall speed of the code roughly 10\,\%
faster. This is probably due to the fact that the Python version must
allocate a number of temporary vectors, and that each vector operation
is implemented by means of its own loop. In Fortran, only one loop is
used, and no temporary memory is needed at all.

Let's now turn to the implementation of the multiplication by $\matr
A = \matr{F}^T \matr{C}_n^{-1} \matr Z \matr F = \matr{F}^T \matr Z
\matr F$ (dropping $\matr{C}_n^{-1}$ as usual):
<<Matrix/vector multiplication functions>>=
def apply_A(mpi_comm, a: OfsAndGains, sky_map, pix_idx,
            mc: MonopoleAndDipole, x: OfsAndGains):
    log.debug('entering apply_A')

    vector1 = apply_f(x, pix_idx, mc.dipole_map, sky_map)
    vector2 = apply_z(mpi_comm, vector1, a, pix_idx, mc)
    return apply_ft(vector2, a, pix_idx, mc.dipole_map, sky_map)
@ %def apply_A

Vector $\vect v$ (Eq~\ref{eq:vVector}) is implemented in the same way,
as its definition is very similar to $\matr A$:
<<Matrix/vector multiplication functions>>=
def compute_v(mpi_comm, voltages, a: OfsAndGains, sky_map, pix_idx,
              mc: MonopoleAndDipole):
    log.debug('entering compute_v')
    vector = apply_z(mpi_comm, voltages, a, pix_idx, mc)
    return apply_ft(vector, a, pix_idx, mc.dipole_map, sky_map)
@ %def compute_v


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{The conjugate gradient algorithm}
\label{sec:CGImplementation}

In this section we will implement the Conjugate Gradient (CG)
algorithm, to compute the solution for the equation
\begin{equation*}
\matr{A} \vect{a} = \vect{v}. \tag{\ref{eq:cjgr} revisited}
\end{equation*}
The algorithm has been presented in Sect.~\ref{eq:pcgIntro}, and it
requires to compute the product between $\matr A$ and a vector
containing the ``instrumental parameters'' mentioned in
Fig.~\ref{fig:operators} (page~\pageref{fig:operators}), as well as a
number of dot products.

Our code makes each MPI process run the main iteration of the CG
algorithm concurrently. Our implementation for [[apply_A]] is already
able to handle calculations distributed among many MPI processes;
however, we still miss a function which computes the dot product in
the same context.

The dot products used in the CG algorithm always involve instrumental
parameters, i.e., the [[a_vec]] field of a [[OfsAndGains]] object.
Each MPI process contains only those offset/gains period that
``belong'' to it. Therefore, to compute the global dot product between
two vectors $\vect a$ and $\vect{a}'$
\begin{equation}
\vect a \cdot \vect{a}' = \sum a_i \cdot a'_i,
\end{equation}
we split the sum among the MPI processes as we did in the computation
of $\matr M$ (Eq.~\ref{eq:diagmMPIdecomposition} on page
\pageref{eq:diagmMPIdecomposition}). This is really easy to do:
<<Implementation of the conjugate gradient method>>=
def mpi_dot_prod(mpi_comm, x, y):
    log.debug('entering mpi_dot_prod')

    local_sum = np.dot(x, y)

    if mpi_comm is not None:
        return mpi_comm.allreduce(local_sum, op=MPI.SUM)
    else:
        return local_sum
@

We are going to implement the gradient algorithm presented in
Sect.~\ref{eq:pcgIntro} on page~\pageref{eq:pcgIntro}, with an
optional preconditioner passed in the argument [[pcond]]. Note that if
[[pcond == None]], the algorithm is plain CG. The difference between
plain CG and preconjugate CG is in the term $\vect r$ becoming $\matr
M^{-1} \vect r$. We can incorporate both cases by always using a
variable $z$ ([[z]]) instead of $r$ ([[r]]), which is either set to
$\matr M^{-1} \vect r$ or to $\vect r$, depending on the parameter
[[pcond]] being either [[None]] or not:
<<Set [[z]] by optionally applying [[pcond]] to [[r]]>>=
if pcond is not None:
    z = pcond.apply_to(r)
else:
    z = copy(r)
@
Preconditioners are Python objects which implement the method
[[apply_to]]. We are going to provide a few preconditioners later, in
Sect.~\ref{sec:preconditioners}.

The function [[conjugate_gradient]] returns a 2-tuple containing the
solution $\vect a$ for Eq.~\eqref{eq:cjgr} and a list of the
\emph{stopping factors} $\{s_k\}$, i.e., the values computed at each
iteration which quantify if $\vect{r}_k$ is ``small enough'' or not.
In our case, the stopping factor $s_k$ for $\vect{r}_k$ is defined as

\begin{equation}
s_k \equiv \sqrt{\vect{r}_k \cdot \vect{r}_k}.
\end{equation}
Here is the code that updates [[stopping_factor]] and checks if
convergence has been reached:
<<Update [[stopping_factor]] using [[r.a_vec]]>>=
stopping_factor = np.sqrt(mpi_dot_prod(mpi_comm, r.a_vec, r.a_vec))
log.info('conjugate_gradient: iteration %d/%d, stopping criterion: %.5e',
         k, max_iter, stopping_factor)

list_of_stopping_factors.append(stopping_factor)
if stopping_factor < threshold:
    return a, list_of_stopping_factors
@

The iteration ends at step $k$ iff $s_k < \texttt{threshold}$. Since
it is not granted that the sequence of $s_k$ is monotonically
decreasing, the code keeps the lowest value for $s_k$ and the
corresponding [[OfsAndGains]] object into two variables,
[[best_stopping_factor]] and [[best_a]]. If the loop ends because the
maximum number of iterations ([[max_iter]]) has been reached, this is
the configuration that will be returned to the caller.

<<Implementation of the conjugate gradient method>>=
def conjugate_gradient(mpi_comm, voltages, start_a: OfsAndGains, sky_map,
                       pix_idx, mc: MonopoleAndDipole, pcond=None,
                       threshold=1e-9, max_iter=100):

    log.debug('entering conjugate_gradient')

    a = copy(start_a)
    residual = (compute_v(mpi_comm, voltages, a, sky_map, pix_idx, mc) -
             apply_A(mpi_comm, a, sky_map, pix_idx, mc, a))
    r = ofs_and_gains_with_same_lengths(source=start_a,
                                        a_vec=residual)

    k = 0
    list_of_stopping_factors = []
    <<Update [[stopping_factor]] using [[r.a_vec]]>>
    <<Set [[z]] by optionally applying [[pcond]] to [[r]]>>

    old_r_dot = mpi_dot_prod(mpi_comm, z.a_vec, r.a_vec)
    p = copy(z)
    best_stopping_factor = stopping_factor
    best_a = a

    while True:
        k += 1
        if k >= max_iter:
            return best_a, list_of_stopping_factors

        Ap = apply_A(mpi_comm, a, sky_map, pix_idx, mc, p)
        gamma = old_r_dot / mpi_dot_prod(mpi_comm, p.a_vec, Ap)
        a.a_vec += gamma * p.a_vec
        r.a_vec -= gamma * Ap

        <<Update [[stopping_factor]] using [[r.a_vec]]>>
        if stopping_factor < best_stopping_factor:
            best_stopping_factor, best_a = stopping_factor, copy(a)

        <<Set [[z]] by optionally applying [[pcond]] to [[r]]>>
        new_r_dot = mpi_dot_prod(mpi_comm, z.a_vec, r.a_vec)
        p.a_vec = z.a_vec + (new_r_dot / old_r_dot) * p.a_vec

        old_r_dot = new_r_dot
@ %def conjugate_gradient


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Preconditioning the conjugate gradient algorithm}
\label{sec:preconditioners}

In this section we provide a few preconditioners for the CG algorithm
implemented in Sect.~\ref{sec:CGImplementation}. Preconditioners are a
tool to speed up the convergence of the CG algorithm; they work by
providing an approximation for $\matr{A}^{-1}$, the matrix to be
inverted. In this section, we are going to implement two kinds of
preconditioners:
\begin{enumerate}
\item A traditional Jacobi preconditioner
(Sect.~\ref{sec:JacobiPreconditioner});

\item A more refined preconditioner, which we pompously call the
\emph{full preconditioner} (Sect.~\ref{sec:FullPreconditioner}).
\end{enumerate}
The full preconditioner should be closer to the value of $\matr
A^{-1}$, but it might require considerably more memory. The Jacobi
preconditioner uses far less memory.

There is no general rule that allows to choose the ``right''
preconditioner. You should run a few tests using each of the
preconditioners, as well as one test using no preconditioner at all,
in order to determine which is the most efficient solution. Keep in
mind that any preconditioner should not affect the quality of the
result, but only the speed of convergence. The only notable exception
is when the convergence is so slow that [[conjugate_gradient]] quits
before $s_k$ having reached [[threshold]], because the number of
iterations reached [[max_iter]]: in this case, a preconditioner might
help to get a better solution without the need to increase
[[max_iter]].

Since inverting matrix $\matr A$ is prohibitive (otherwise, why should
we have written all the code presented so far?!), in the two
preconditioners we are going to implement we use the following
approximation:
\begin{equation}
  \matr A = \matr{F}^T \matr{C}_n^{-1} \matr Z \matr F \approx
  \matr{F}^T  \matr{C}_n^{-1} \matr F,
\end{equation}
i.e., we drop the $\matr Z$ term entirely. (Remember that the purpose
of this term subtract the part of a TOD that is due to the sky signal
only and leaves only the noise part.)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{A simple example}

To see what is the meaning of the assumptions made in the previous
paragraph, we introduce here a small example. We consider a TOD
containing just 8 samples, which must be split into three offset
periods and two gain periods. The three offset periods cover 3, 2, and
3 samples respectively; the gain periods cover 2 and 1 gain periods.
Therefore, matrix $\matr F$ has the following shape:
\begin{equation}
\matr F = \begin{pmatrix}
1& 0& 0& D_1& 0\\
1& 0& 0& D_2& 0\\
1& 0& 0& D_3& 0\\
0& 1& 0& D_4& 0\\
0& 1& 0& D_5& 0\\
0& 0& 1& 0& D_6\\
0& 0& 1& 0& D_7\\
0& 0& 1& 0& D_8\\
\end{pmatrix}
\end{equation}
The first three columns encode the information about the three offset
periods, while the latter two columns are related to the two gain
periods.

Our approximation for $\matr{A}$ is therefore
\begin{equation}
\label{eq:pcondExample}
\matr{F}^T \matr{F} = \begin{pmatrix}
3& 0& 0& \sum_{i=1}^3 D_i& 0\\
0& 2& 0& \sum_{i=4}^5 D_i& 0\\
0& 0& 3& 0& \sum_{i=6}^8 D_i\\
\sum_{i=1}^3 D_i& \sum_{i=4}^5 D_i& 0& \sum_{i=1}^5 D_i^2& 0\\
0& 0& \sum_{i=6}^8 D_i& 0& \sum_{i=6}^8 D_i^2
\end{pmatrix}.
\end{equation}
We shall use this example in the next sections to illustrate a few
implementation quirks in the code for our preconditioners.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Using preconditioners for error estimation}
\label{sec:preconditionersAndErrors}

Preconditioners are useful for a topic that is apparently disconnected
from their purpose of speeding up CG convergence: they allow to
quantify the error in the gain and offset estimates produced by
\DaCapo. In explaining this concept, I will use a few ideas from the
theory of destriping algorithms for map-making, see e.g.\
\citet{kurki-suonio2009}. The theory developed in that paper can be
easily adapted to our context.

The fundamental observation is that Eq.~\eqref{eq:chisqfinal} implies
that matrix $\matr A$ is the Fisher matrix of the offset/gains in
$\vect a$, i.e.
\begin{equation}
\matr A_{ij} = \frac{\partial \chi^2}{\partial a_i \partial a_j},
\end{equation}
Thus, once the code finds the solution $\vect{a}$ of the \DaCapo{}
problem and updates all the elements of $\matr A$ with the solution
(offsets, gains, and the sky map), element $ij$ of the \emph{inverse}
of $\matr A$ is the covariance between coefficient $a_i$ and $a_j$.
This is a fundamental observation, because we have just built two
preconditioners that estimate the value of $\matr{A}^{-1}$. We can use
them to estimate the error associated with each offset/gain, with the
following approximation: if $\matr M \approx \matr{A}^{-1}$ is the
preconditioner, then
\begin{equation}
\label{eq:ofsGainError}
\sigma_i \gtrsim \sqrt{\matr M_{ii}}.
\end{equation}
I used $\gtrsim$ in the equation above, because equality only holds if
all the offsets and gains are independent: we are discarding the fact
that off-diagonal elements in $\matr{A}^{-1}$ are nonzero. Thus, the
left side of Eq.~\eqref{eq:ofsGainError} only provides a lower bound
for the offset and gain errors.

With the example in Eq.~\eqref{eq:pcondExample} at hand, we can also
provide a simple interpretation for Eq.~\eqref{eq:ofsGainError}.
Suppose that matrix $\matr{F'}^T \matr{F'}$ in
Eq.~\eqref{eq:fullPrecBlockDiagonal} is diagonal, i.e., we neglect the
off-diagonal terms: we'll see that this is a reasonable assumption
once we suppose that there is no correlation\footnote{This would be
the case, had we decided to estimate each baseline/gain by considering
only the samples within one period, instead of solving all the
baselines and gains at the same time. It is a pedagogically
interesting case, because it helps in understanding the reasoning that
follows.} among the offsets and gains. If the $k$-th element of vector
$\vect{a}$ corresponds to an offset (this is the case for $k = 1, 2,
4$), then the error is
\begin{equation}
\label{eq:offsetErrorExample}
\sigma_k^\text{ofs} = \sqrt{M_{kk}} =
\frac1{\sqrt{\bigl(\matr{F'}^T \matr{C}_n^{-1} \matr{F}\bigr)_{kk}}} =
\frac{\sigma^\text{noise}}{\sqrt N},
\end{equation}
where $\sigma^\text{noise}$ is the noise of the data (supposed
constant), and $N$ is the number of pixel hit count associated with
the baseline. This means that, the more the pixel is observed during
the offset period, the better the estimate of the offset: a reasonable
assumption.

Similarly, if the $k$-th element of vector $\vect{a}$ corresponds to a
gain (in our example, $k = 3, 5$), then
\begin{equation}
\label{eq:gainErrorExample}
\sigma_k^\text{gain} = \sqrt{M_{kk}} =
\frac1{\sqrt{\bigl(\matr{F'}^T \matr{C}_n^{-1} \matr{F}\bigr)_{kk}}} =
\frac{\sigma^\text{noise}}{\sum_i D_i^2}.
\end{equation}
In this case, the goodness of the estimate depends also on the
strongness of the dipole signal: again, this is reasonable. (You can
also check that measure units are consistent both in
Eq.~\ref{eq:offsetErrorExample} and \ref{eq:gainErrorExample}.)

The fact that in the examples above we assumed that matrix
$\matr{F'}^T \matr{F'}$ be block-diagonal implies that the covariance
$\sigma_{kl}$ is zero everywhere: in other words, we are assuming that
the error associated with any offset/gain does not depend on the
others. This is clearly only an approximation.

It is evident that, in order to derive the value of $\sigma_k$, we
need to stop neglecting the $\matr{C}_n$ term in equations like
\eqref{eq:newZmatrix}. (It is $\matr{C}_n$'s duty to make
$\sigma^\text{noise}$ appear in equations \ref{eq:offsetErrorExample}
and \ref{eq:gainErrorExample}.) We are however going to follow a
shortway: we assume that $\matr{C}_n$ is diagonal again, and we assume
that $1/f$ noise is completely captured by the sequence of offsets. In
this way, we can estimate the value of $\sigma^\text{noise}$ to be
used in equations \ref{eq:offsetErrorExample} and
\ref{eq:gainErrorExample} easily from the stream of voltages in the
TOD.

To compute the RMS level of the data, we need to disentangle the white
noise part from the sky signal and the $1/f$ noise. We can do it
easily by means of the following trick. Consider two measurements made
by a detector that have been taken at times $t_0$ and $t_0 + \delta
t$, where $\delta t = 1/\nu_\text{samp}$ is the time interval between
two consecutive readings of the Analog-to-Digital Converter in the
detector. Then from Eq.~\eqref{eq:radiometer} it follows that
\begin{align*}
V(t_0) &= G \bigl(T + D\bigr) + b + N, \\
V(t_0 + \delta t) &= G \bigl(T' + D'\bigr) + b + N',
\end{align*}
where the gain $G$ and the offset $b$ are the same, and $T \approx
T'$, $D \approx D'$. Since the PDF of $N$ is
\begin{equation}
P(N) \propto \exp\left(-\frac{N^2}{2\sigma_n^2}\right),
\end{equation}
it follows that the variable
\begin{equation}
V(t_0 + \delta t) - V(t_0) \approx N' - N
\end{equation}
does not depend on the sky signal nor on the 1/f noise
characteristics, and whose PDF is a Gaussian with zero mean and RMS
equal to $2\sigma_n$. Therefore, we can get an estimate of the RMS of
the white noise by computing the variance between consecutive
differences of the samples in the TOD. Here is a Python function which
implements the idea and which will be used in the implementation of
the two preconditioners:
<<CG preconditioners and error estimation>>=
@jit
def compute_rms(signal: Any, samples_per_period: List[int]) -> Any:
    result = np.empty(len(samples_per_period))
    start_idx = 0
    for i, cur_samples in enumerate(samples_per_period):
        subarray = signal[start_idx:start_idx + cur_samples]
        if cur_samples % 2 > 0:
            result[i] = 0.5 * (np.var(subarray[1::2] - subarray[0:-1:2]))
        else:
            result[i] = 0.5 * (np.var(subarray[1::2] - subarray[0::2]))

        start_idx += cur_samples

    return result
@ %def compute_rms
Note the presence of [[@jit]]: we use Numba to speed it up, as it
might need to run the iteration on a large set of periods.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Row permutation in preconditioners}

The shape of $\matr{F}^T \matr{F}$, our approximation for $\matr{A}$,
is quite sparse, but it is however not easy to invert, as you can see
from Eq.~\eqref{eq:pcondExample}. We can simplify the problem if we
apply a permutation matrix\footnote{A \emph{permutation matrix} is a
square matrix whose effect on a matrix $\matr A$ is to swap the order
of the columns/rows in $\matr A$. The coefficients of $\matr\pi$ are
either 1 or 0, with only one nonzero coefficient per row and column.}
$\matr\pi$ so that each gain column follows the corresponding offset
columns:
\begin{equation}
\label{eq:fullPrecBlockDiagonal}
\matr{F'}^T \matr{F'} = \begin{pmatrix}
3& 0& \sum_{i=1}^3 D_i& 0& 0\\
0& 2& \sum_{i=4}^5 D_i& 0& 0\\
\sum_{i=1}^3 D_i& \sum_{i=4}^5 D_i& \sum_{i=1}^5 D_i^2& 0& 0\\
0& 0& 0& 3& \sum_{i=6}^8 D_i\\
0& 0& 0& \sum_{i=6}^8 D_i& \sum_{i=6}^8 D_i^2
\end{pmatrix},
\end{equation}
which is block-diagonal. This transformation $\matr{F'}^T\matr{F'} =
\matr{\pi}\matr{F}^T\matr{F}$ is possible with the following
permutation matrix:
\begin{equation}
\matr\pi = \begin{pmatrix}
1& 0& 0& 0& 0\\
0& 1& 0& 0& 0\\
0& 0& 0& 1& 0\\
0& 0& 1& 0& 0\\
0& 0& 0& 0& 1\\
\end{pmatrix}.
\end{equation}

Now, what's the point of using $\matr\pi$? The interesting fact about
permutation matrices is that they are orthogonal, i.e., $\matr{\pi}^T
\matr\pi = \matr I$. Therefore,
\begin{equation*}
\bigl(\matr{F}^T \matr F\bigr) \vect{a} =
\bigl(\matr\pi^T \matr{F'}^T \matr F' \matr\pi\bigr) \vect{a},
\end{equation*}
which is intuitive: we can either use $\matr F$ or $\matr F'$, but in
the latter case we must scramble the elements in $\vect a$ using
$\matr\pi$ and the elements in the result using $\matr\pi^T$, if we
want to get the same result. From the fact that
\begin{equation}
\matr{F}^T \matr F = \matr\pi^T \matr{F'}^T \matr F' \matr\pi\\
\end{equation}
we can write its inverse as
\begin{align}
\bigl(\matr{F}^T \matr F\bigr)^{-1} &= \bigl(\matr\pi^T \matr{F'}^T
\matr F' \matr\pi\bigr)^{-1} =\\
&= \bigl(\matr\pi^T\bigr)^{-1} \bigl(\matr{F'}^T
\matr F'\bigr)^{-1} \bigl(\matr\pi\bigr)^{-1} =\\
&= \matr\pi \bigl(\matr{F'}^T \matr F'\bigr)^{-1} \matr\pi^T.
\end{align}
Applying $\matr\pi$ and $\matr\pi^T$ does not require to keep the full
form of matrix $\matr\pi$ in memory: it is enough to properly swap the
terms in the [[for]] loops which implement the calculation. This means
that we can write a Python function which multiplies
$\bigl(\matr{F'}^T\matr{F}\bigr)^{-1}$ by $\vect{a}$ with no
intermediate passages.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Full preconditioner}
\label{sec:FullPreconditioner}

We implement a class, [[FullPreconditioner]] which stores the inverse
of the matrix $\matr{F'}^T \matr{F'}$ like the one in
Eq.~\eqref{eq:fullPrecBlockDiagonal}. Since the inverse of a
block-diagonal matrix of the form
\begin{equation}
\begin{pmatrix}
\matr{A}_{11}& 0& \ldots& 0\\
0& \matr{A}_{22}& \ldots& 0\\
\hdotsfor{4}\\
0& 0& \ldots& \matr{A}_{nn}\\
\end{pmatrix}
\end{equation}
is again a block-diagonal matrix, in the form
\begin{equation}
\label{eq:subDiagonalInverseMatrices}
\begin{pmatrix}
\matr{A}_{11}^{-1}& 0& \ldots& 0\\
0& \matr{A}_{22}^{-1}& \ldots& 0\\
\hdotsfor{4}\\
0& 0& \ldots& \matr{A}_{nn}^{-1}\\
\end{pmatrix},
\end{equation}
we need to compute the inverse of each sub-matrix along the diagonal.
This is the reason why we anticipated that this kind of preconditioner
might require a significant amount of memory; the size of a sub-matrix
is equal to $(n + 1)^2$, where $n$ is the number of offset baselines
per gain period. Since the inverse is not necessarily sparse, we need
to keep every coefficient in memory. For situations\footnote{This was
not the case of the \DaCapo{} implementation used in the Planck 2015
data release, as there was always just one offset period per gain
period: in this case, the sub-diagonal matrices have always shape
$2\times 2$ and can be easily stored in memory.} where there are many
offset periods per gain period, this number might get huge.

The [[FullPreconditioner]] class we are going to implement stores the
inverse of the sub-diagonal matrices related to those offset/gain
periods that are ``owned'' by the MPI process.

<<CG preconditioners and error estimation>>=
class FullPreconditioner:
    def __init__(self, mc: MonopoleAndDipole, pix_idx,
                 samples_per_ofsp, samples_per_gainp):
        assert sum(samples_per_ofsp) == len(pix_idx)

        self.samples_per_ofsp = samples_per_ofsp
        self.samples_per_gainp = samples_per_gainp
        self.ofsp_per_gainp = \
            OfsAndGains.calc_ofsp_per_gainp(samples_per_ofsp, samples_per_gainp)

        assert sum(self.ofsp_per_gainp) == len(self.samples_per_ofsp)

        self.matrices = []
        <<Compute the inverse of each diagonal sub-matrix and save them in [[self.matrices]]>>

    def apply_to(self, a: OfsAndGains) -> OfsAndGains:
        cur_ofsp_idx = 0
        cur_gainp_idx = 0
        a_vec = np.empty(len(a.a_vec))
        result = ofs_and_gains_with_same_lengths(a, a_vec)

        <<Multiply offsets/gains by the sub-diagonal inverse matrices and save the result in [[result]]>>
        return result

    def compute_offset_errors(self, voltages, samples_per_ofsp):
        rms = compute_rms(voltages, samples_per_ofsp)
        return np.sqrt(rms * np.array([np.diag(x)[:-1] for x in self.matrices]).flatten())

    def compute_gain_errors(self, voltages, samples_per_gainp):
        rms = compute_rms(voltages, samples_per_gainp)
        return np.sqrt(rms * np.array([np.diag(x)[-1] for x in self.matrices]))
@ %def FullPreconditioner

In the [[FullPreconditioner.__init__]] method, we calculate each of
the $\matr{A}_{ii}^{-1}$ matrices in
Eq.~\eqref{eq:subDiagonalInverseMatrices}:
<<Compute the inverse of each diagonal sub-matrix and save them in [[self.matrices]]>>=
cur_ofsp_idx = 0
cur_sample_idx = 0
for ofsp_in_cur_gainp in self.ofsp_per_gainp:
    cur_matrix = np.zeros((ofsp_in_cur_gainp + 1,
                           ofsp_in_cur_gainp + 1))

    first_sample = cur_sample_idx
    for i, cur_ofsp in enumerate(samples_per_ofsp[cur_ofsp_idx:(cur_ofsp_idx +
                                                                ofsp_in_cur_gainp)]):
        cur_monopole = mc.monopole_map[pix_idx[cur_sample_idx:(cur_sample_idx + cur_ofsp)]]
        cur_dipole = mc.dipole_map[pix_idx[cur_sample_idx:(cur_sample_idx + cur_ofsp)]]
        cur_matrix[i, i] = np.sum(cur_monopole)
        cur_matrix[ofsp_in_cur_gainp, i] = cur_matrix[i, ofsp_in_cur_gainp] = \
                np.sum(cur_dipole)

        cur_sample_idx += cur_ofsp

    cur_matrix[ofsp_in_cur_gainp, ofsp_in_cur_gainp] = \
        np.sum(mc.dipole_map[pix_idx[first_sample:cur_sample_idx]]**2)

    # If the determinant is not positive, the matrix is not positive definite!
    assert np.linalg.det(cur_matrix) > 0

    self.matrices.append(np.linalg.inv(cur_matrix))
@

In the [[FullPreconditioner.apply_to]] method, the calculation is a
bit more involved than a simple call to [[@]], Python's matrix
multiplication operator. Remember that we need to implement the
equivalent of the $\matr\pi$ permutation, so we have to jump
back-and-forth between offsets and gains:
<<Multiply offsets/gains by the sub-diagonal inverse matrices and save the result in [[result]]>>=
gains = a.gains
offsets = a.offsets
for cur_gainp_idx, num_of_ofsp in enumerate(self.ofsp_per_gainp):
    # y = M^-1 x   for each block in F^T F
    x = np.empty(num_of_ofsp + 1)
    x[0:num_of_ofsp] = offsets[cur_ofsp_idx:(cur_ofsp_idx + num_of_ofsp)]
    x[num_of_ofsp] = gains[cur_gainp_idx]
    y = self.matrices[cur_gainp_idx] @ x

    result.offsets[cur_ofsp_idx:(cur_ofsp_idx + num_of_ofsp)] = y[0:num_of_ofsp]
    result.gains[cur_gainp_idx] = y[num_of_ofsp]
    cur_ofsp_idx += num_of_ofsp
    cur_gainp_idx += 1
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Jacobi preconditioner}
\label{sec:JacobiPreconditioner}

The Jacobi preconditioner approximates $\matr A^{-1}$ by considering
only its diagonal terms, using the following approximation:
\begin{equation}
\matr A^{-1} = \begin{pmatrix}
a_{11}& a_{12}& \ldots& a_{1n}\\
a_{21}& a_{12}& \ldots& a_{2n}\\
a_{31}& a_{12}& \ldots& a_{3n}\\
\hdotsfor{4}\\
a_{n1}& a_{n2}& \ldots& a_{nn}
\end{pmatrix}^{-1} \approx \begin{pmatrix}
a_{11}& 0& \ldots& 0\\
0& a_{22}& \ldots& 0\\
\hdotsfor{4}\\
0& 0& \ldots& a_{nn}
\end{pmatrix}^{-1} = \begin{pmatrix}
\frac1{a_{11}}& 0& \ldots& 0\\
0& \frac1{a_{22}}& \ldots& 0\\
\hdotsfor{4}\\
0& 0& \ldots& \frac1{a_{nn}}
\end{pmatrix}
\end{equation}
Clearly, this approximation works well if matrix $\matr A$ is
diagonal-dominant. It is a considerably simpler preconditioner than
the one implemented in Sect.~\ref{sec:FullPreconditioner}, but it has
the advantage of requiring an amount of memory which is considerably
smaller.

<<CG preconditioners and error estimation>>=
class JacobiPreconditioner:
    def __init__(self, mc: MonopoleAndDipole, pix_idx,
                 samples_per_ofsp, samples_per_gainp):
        self.diagonal = OfsAndGains(offsets=np.zeros(len(samples_per_ofsp)),
                                    gains=np.zeros(len(samples_per_gainp)),
                                    samples_per_ofsp=samples_per_ofsp,
                                    samples_per_gainp=samples_per_gainp)

        <<Compute the inverses of the hit counts and save them in [[self.diagonal]]>>
        <<Compute the inverses of the squared dipole amplitudes and save them in [[self.diagonal]]>>

    def apply_to(self, a: OfsAndGains) -> OfsAndGains:
        return ofs_and_gains_with_same_lengths(a, a.a_vec * self.diagonal.a_vec)

    def compute_offset_errors(self, voltages, samples_per_ofsp):
        rms = compute_rms(voltages, samples_per_ofsp)
        return np.sqrt(rms * self.diagonal.offsets)

    def compute_gain_errors(self, voltages, samples_per_gainp):
        rms = compute_rms(voltages, samples_per_gainp)
        return np.sqrt(rms * self.diagonal.gains)
@ %def JacobiPreconditioner

In the case of the Jacobi preconditioner, there is no need to play
with permutation matrices, so we can process offsets and gains
separatedly. The part of the preconditioning matrix which depends on
the offsets is simply an ``hit count'' of the number of samples that
fall within each offset period. We need to consider any mask the user
might have provided, so we go through [[monopole_map]]:
<<Compute the inverses of the hit counts and save them in [[self.diagonal]]>>=
cur_sample_idx = 0
offsets = self.diagonal.offsets
for row_idx, cur_sample_num in enumerate(samples_per_ofsp):
    cur_sum = np.sum(mc.monopole_map[pix_idx[cur_sample_idx:(cur_sample_idx +
                                                             cur_sample_num)]])
    if cur_sum != 0.0:
        offsets[row_idx] = 1.0 / cur_sum
    else:
        offsets[row_idx] = 1.0

    cur_sample_idx += cur_sample_num
@

The gain part requires us to access [[dipole_map]], which
[[MonopoleAndDipole]] sets to zero in each pixel that was masked. So
the following code automatically takes into account the presence of
masks:
<<Compute the inverses of the squared dipole amplitudes and save them in [[self.diagonal]]>>=
cur_sample_idx = 0
gains = self.diagonal.gains
for row_idx, cur_sample_num in enumerate(samples_per_gainp):
    cur_sum = np.sum(mc.dipole_map[pix_idx[cur_sample_idx:(cur_sample_idx +
                                                           cur_sample_num)]]**2)
    if cur_sum != 0.0:
        gains[row_idx] = 1.0 / cur_sum
    else:
        gains[row_idx] = 1.0

    cur_sample_idx += cur_sample_num
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Putting all together}

We now turn to the initialization of the preconditioner class in the
body of the main function [[configure_main]]. We need a global
variable which contains a dictionary, [[PCOND_DICT]]: this will be
used to determine which preconditioner class to initialize, depending
on the string provided by the user in the configuration file.
<<CG preconditioners and error estimation>>=
PCOND_DICT = {'none': None,
              'full': FullPreconditioner,
              'jacobi': JacobiPreconditioner}
@ %def PCOND_DICT
The keys for this dictionary are the strings the user must provide in
the configuration file.

We can now implement the code in [[calibrate_main]] that initialize
the preconditioner:
<<Configure the preconditioner>>=
try:
    pcond_class = PCOND_DICT[configuration.pcond]
except KeyError:
    log.error('Unknown preconditioner "%s", valid choices are: %s',
              configuration.pcond,
              ', '.join(['"{0}"'.format(x) for x in PCOND_DICT.keys()]))
    sys.exit(1)

if pcond_class is not None:
    pcond = pcond_class(mc=mc,
                        pix_idx=tod.pix_idx,
                        samples_per_ofsp=local_samples_per_ofsp,
                        samples_per_gainp=local_samples_per_gainp)
else:
    pcond = None
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Implementation of the \DaCapo{} algorithm}

Let's move to the most juicy part of the work: the implementation of
the actual \DaCapo{} algorithm. It might be a good idea to have a look
at Sect.~\ref{sec:DaCapoMathematicalModel} again, before continuing.

Since \DaCapo{} is an iterative procedure that improves the
gain/offset solution, we need a starting guess for the gains $G_k$ and
offsets $b_k$ in Eq.~\eqref{eq:radiometer}, as well as the sky map
$\vect{\tilde m}$ (Eq.~\ref{eq:tildeMap}). We are going to adopt the
shortest route for $b_k$ and $\vect{\tilde m}$ by setting them all
equal to zero, but for $G_k$ we pick a more elaborate\footnote{In
theory, any positive, nonzero number for $G_k$ would allow the
algorithm to converge. However, picking a more accurate solution makes
the code converge faster.} choice. Specifically, we take
Eq.~\eqref{eq:radiometer} and consider the approximation
\begin{equation}
V_i = G_k \bigl(T_i + D_i\bigr) + b_k + N_i \approx G_k\,D_i, + \xi
\end{equation}
which means that we suppose that $T_i \ll D_i$, and that the noise
part reduces to a constant offset $\xi$. In this case, $G_k$ is the
slope of the least-squares solution of the fit between $D_i$ and
$V_i$. We implement this model in the following function:
<<Implementation of the \DaCapo{} algorithm>>=
def guess_gains(voltages, pix_idx, dipole_map, samples_per_gainp):
    log.debug('entering guess_gains')

    result = np.empty(len(samples_per_gainp), dtype='float')
    cal_start = 0
    for gainp_idx, gainp_len in enumerate(samples_per_gainp):
        cal_end = cal_start + gainp_len
        cur_fit = scipy.polyfit(x=dipole_map[pix_idx[cal_start:cal_end]],
                                y=voltages[cal_start:cal_end],
                                deg=1)
        result[gainp_idx] = cur_fit[0]

        cal_start += gainp_len

    return result
@ %def guess_gains
Note that we do not consider any Galactic mask the user might have
provided in the parameter file. After all, our guess for the set of
$G_k$ is just a starting point.

The function we are now going to implement, [[da_capo]], implements
the \DaCapo{} algorithm and returns a variety of data structures.
Instead of returning them as an anonymous tuple, we create a named
tuple for the sake of clarity:
<<Implementation of the \DaCapo{} algorithm>>=
DaCapoResults = namedtuple('DaCapoResults',
                           ['ofs_and_gains',
                            'sky_map',
                            'list_of_cg_rz',
                            'list_of_dacapo_rz',
                            'cg_wall_times',
                            'converged',
                            'dacapo_wall_time'])
@

We are now ready to implement [[da_capo]], an implementation of the
\DaCapo{} algorithm. The way this function splits the job among the
MPI processes is similar to [[conjugate_gradient]]: each process only
``sees'' the gain/offset periods that belong to it, and it keeps in
memory only part of the TOD. The [[compute_map_corr]] function will be
presented later; its purpose is to update the current estimate for the
sky map.

<<Implementation of the \DaCapo{} algorithm>>=
def da_capo(mpi_comm, voltages, pix_idx, samples_per_ofsp, samples_per_gainp,
            mc: MonopoleAndDipole, mask=None, pcond=None, threshold=1e-9, max_iter=10,
            cg_threshold=1e-9, max_cg_iter=100) -> DaCapoResults:
    log.debug('entering da_capo')

    dacapo_prof = Profiler()

    sky_map = np.zeros_like(mc.dipole_map)
    start_gains = guess_gains(voltages, pix_idx, mc.dipole_map, samples_per_gainp)
    old_a = OfsAndGains(offsets=np.zeros(len(samples_per_ofsp)),
                        gains=start_gains,
                        samples_per_ofsp=samples_per_ofsp,
                        samples_per_gainp=samples_per_gainp)

    iteration = 0
    cg_wall_times = []
    list_of_cg_rz = []
    list_of_dacapo_rz = []
    while True:
        log.info('da_capo: iteration %d/%d', iteration + 1, max_iter)

        cg_prof = Profiler()
        new_a, rz = conjugate_gradient(mpi_comm, voltages, old_a, sky_map, pix_idx,
                                       mc, pcond=pcond, threshold=cg_threshold,
                                       max_iter=max_cg_iter)
        list_of_cg_rz.append(rz)
        cg_wall_times.append(cg_prof.toc())
        sky_map_corr = compute_map_corr(mpi_comm, voltages, old_a, new_a,
                                        pix_idx, mc.dipole_map, sky_map)
        sky_map += sky_map_corr

        stopping_factor = mpi_abs_max(mpi_comm, new_a.a_vec - old_a.a_vec)
        list_of_dacapo_rz.append(stopping_factor)
        log.info('da_capo: stopping factor %.3e (threshold is %.3e)',
                 stopping_factor, threshold)

        if stopping_factor < threshold:
            log.info('da_capo: convergence reached after %d steps', iteration)
            return DaCapoResults(ofs_and_gains=new_a,
                                 sky_map=sky_map,
                                 list_of_cg_rz=list_of_cg_rz,
                                 list_of_dacapo_rz=list_of_dacapo_rz,
                                 converged=True,
                                 cg_wall_times=cg_wall_times,
                                 dacapo_wall_time=dacapo_prof.toc())

        old_a = new_a
        iteration += 1

        if iteration >= max_iter:
            log.info('da_capo: maximum number of iterations reached (%d)', max_iter)
            return DaCapoResults(ofs_and_gains=new_a,
                                 sky_map=sky_map,
                                 list_of_cg_rz=list_of_cg_rz,
                                 list_of_dacapo_rz=list_of_dacapo_rz,
                                 converged=False,
                                 cg_wall_times=cg_wall_times,
                                 dacapo_wall_time=dacapo_prof.toc())
@ %def da_capo

The [[da_capo]] function uses the helper function [[mpi_abs_max]],
which computes the maximum absolute value of a vector which is
distributed\footnote{An historical anecdote: the first versions of
this code avoided calling [[allreduce]] on the result of [[np.max]].
The bug went unnoticed during the development, but once the code was
used in production, it hanged in some repeatable circumstances. It
turns out that if you forgot to call [[allreduce]], then it happened
sometimes that some MPI processes reach convergence before the others
(because their [[new_a.a_vec]] was approximately equal to
[[old_a.a_vec]]). They quit the loop within [[da_capo]], while the
other processes kept running it and hanged once one of the many
[[allreduce]] calls used by [[conjugate_gradient]] was reached: they
were waiting wait an answer from the ``lucky'' processes forever.
During development, I used to use very small values for [[max_iter]],
so that the convergence was never fully reached in my tests.} on the
MPI processes:
<<Implementation of the \DaCapo{} algorithm>>=
def mpi_abs_max(mpi_comm, vec):
    log.debug('entering mpi_abs_max')

    local_max = np.max(np.abs(vec))

    if mpi_comm is not None:
        return mpi_comm.allreduce(local_max, op=MPI.MAX)
    else:
        return local_max
@ %def mpi_abs_max

The [[compute_map_corr]] function applies Eq.~\eqref{eq:tildeMap} in
order to derive a better estimate for the sky map $\vect{\tilde m}$.
(Remember that, during the first iteration, we set this map to zero
identically.) The intuitive meaning of the following operations is the
following:
\begin{enumerate}
\item We destripe the TOD by means of $\matr F$, and save the
destriped TOD in [[diff_tod]];

\item We create the map using [[apply_ptildet]]; this function creates
a map using a simple binning algorithm, which is the right thing to do
here since we suppose that [[diff_tod]] no longer contains $1/f$ noise.
\end{enumerate}

<<Implementation of the \DaCapo{} algorithm>>=
def compute_map_corr(mpi_comm, voltages, old_a: OfsAndGains, new_a: OfsAndGains,
                     pix_idx, dipole_map, sky_map):
    log.debug('entering compute_map_corr')

    diff_tod = voltages - apply_f(new_a, pix_idx, dipole_map, sky_map)
    map_corr = apply_ptildet(mpi_comm, diff_tod, old_a, pix_idx, len(dipole_map))
    normalization = compute_diagm(mpi_comm, old_a, pix_idx, len(dipole_map))
    result = np.ma.array(map_corr, mask=(np.abs(normalization) < 1e-9),
                         fill_value=0.0)
    return (result / normalization).filled()
@ %def compute_map_corr

Once the [[da_capo]] call has finished, the main loop in
[[calibrate_main]] needs to gather all the offsets and gains spread
among the MPI processes, in order to allow them to be saved (see
Sect.~\ref{sec:saveCalibrateResults}). We must use some care in doing
this task, as each MPI process has its own set of gains and offsets,
whose lengths might differ from those held by the other processes. We
implement a simple wrapper around MPI4Py's function [[Gatherv]], which
has the purpose of efficiently collecting NumPy arrays of varying
lengths from MPI processes\footnote{MPI4Py also provides a simpler
function, [[Gather]], which assumes that all the processes send the
same amount of samples.}). We first need to run [[mpi_comm.gather]]
once to retrieve the length of each process' array, and then
[[mpi_comm.Gatherv]] to send the actual array's data:
<<Miscellaneous functions>>=
def gather_arrays(mpi_comm, array : Any, root=0) -> Any:
    lengths = mpi_comm.gather(len(array), root=root)
    if mpi_comm.Get_rank() == root:
        recvbuf = np.empty(sum(lengths), dtype=array.dtype)
    else:
        recvbuf = None

    mpi_comm.Gatherv(sendbuf=array, recvbuf=(recvbuf, lengths), root=root)
    return recvbuf
@ %def gather_arrays
Note that only one of the MPI processes (traditionally called the
``root process'') will get the full sequence of offsets and gains.
This is the reason why we initialized [[recvbuf]] only for the
root process.

We can now use [[gather_arrays]] to collect the gains and offsets. If
the user specified a preconditioner, we retrieve errors as well,
otherwise we set all the errors to zero:
<<Gather the results from the MPI processes and save them in [[coll_offsets]] and [[coll_gains]]>>=
coll_gains = gather_arrays(mpi_comm, da_capo_results.ofs_and_gains.gains)
coll_offsets = gather_arrays(mpi_comm, da_capo_results.ofs_and_gains.offsets)
if pcond is not None:
    coll_gain_errors = gather_arrays(mpi_comm,
                                     pcond.compute_gain_errors(tod.signal,
                                                               local_samples_per_gainp))

    coll_offset_errors = gather_arrays(mpi_comm,
                                       pcond.compute_offset_errors(tod.signal,
                                                                   local_samples_per_ofsp))

else:
    log.warning('no preconditioner used, all offset/gain errors will be set to zero')
    coll_offset_errors = np.zeros_like(coll_offsets)
    coll_gain_errors = np.zeros_like(coll_gains)
@

Now that we have all the results, it is time to save them to disk.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Saving the results of the computation}
\label{sec:saveCalibrateResults}

We are at the end of the implementation. What is left off is the
record of the results into a FITS file. Note that the following code
is going to be executed only by the MPI process with rank \#0. (If
this surprises you, go back and take a look at the definition of
[[calibrate_main]]).


\subsection{Gains and offsets}

The first step is to save offsets and gains. As we explained in
Sect.~\ref{sec:calibrateParams}, we are going to save a copy of the
parameter file in the output files. We use a nice trick learned during
the ``Birds of a Feather'' FITS session held during ADASS 2016, when
M.~Taylor explained how to embed metadata in the primary header of a
FITS file. The idea is to convert textual information in a
one-dimensional vector of raw bytes, which is saved as a $n\times 1$
bitmap image in the primary header of a FITS file. We use this trick
to save the contents of the configuration file (kept in the
[[parameter_file_contents]] of [[CalibrateConfiguration]] objects).
<<Save the results of the calibration in the output file>>=
primary_hdu = fits.PrimaryHDU(data=configuration.parameter_file_contents)
@

Extracting such information from a fits file requires to convert the
bytes back into characters and join them together to get a string. The
following code shows how to print this information on the terminal:
\begin{verbatim}
with fits.open('file.fits') as f:
    print(''.join([chr(x) for x in f[0].data]))
\end{verbatim}

We add a few more information in the header of the FITS file. Nearly
all of these are already in the parameter file encoded in the primary
HDU data, but having them available in the FITS metadata can be handy:
<<Save the results of the calibration in the output file>>=
primary_hdu.header.add_comment(configuration.comment)
primary_hdu.header['WTIME'] = (da_capo_results.dacapo_wall_time,
                               'Wall clock time [s]')
primary_hdu.header['MPIPROC'] = (mpi_comm.Get_size(),
                                 'Number of MPI processes used')
primary_hdu.header['CONVERG'] = (da_capo_results.converged,
                                 'Has the DaCapo algorithm converged?')
primary_hdu.header['GPPEROP'] = (configuration.periods_per_cal_constant,
                                 'Number of ofs periods per each gain period')
primary_hdu.header['CGSTOP'] = (configuration.cg_stop, 'Stopping factor for CG')
primary_hdu.header['CGMAXIT'] = (configuration.cg_maxiter, 'Maximum number of CG iterations')
primary_hdu.header['DCSTOP'] = (configuration.dacapo_stop, 'Stopping factor for DaCapo')
primary_hdu.header['DCMAXIT'] = (configuration.dacapo_maxiter, 'Maximum number of DaCapo iterations')
primary_hdu.header['PCOND'] = (configuration.pcond, 'Kind of preconditioner')
primary_hdu.header['NSIDE'] = (configuration.nside, 'Resolution of the map used by DaCapo')

if mask is None:
    fsky = 100.0
else:
    fsky = len(mask[mask > 0]) * 100.0 / len(mask)
primary_hdu.header['FSKY'] = (fsky, 'Fraction of the sky used by DaCapo')
@

We store the primary HDU in a list, which we extend immediately with
the HDU containing a copy of the index file. The fact that they are
the very first HDUs in the file allow this kind of files to be used as
index files, i.e., we can pass the file containing the results of a
computation in the line [[index_file]] of a parameter file to be
passed to another call to [[calibrate.py]]. This allows to easily
re-run computations with slightly different parameters for \DaCapo.
<<Save the results of the calibration in the output file>>=
hdu_list = [primary_hdu] + index.store_in_hdus()
@

We now add two HDUs containing the offsets $b_k$ and the gains $G_k$,
respectively:
<<Save the results of the calibration in the output file>>=
cols = [fits.Column(name='OFFSET', array=np.array(coll_offsets).flatten(), format='1D'),
        fits.Column(name='ERR', array=np.array(coll_offset_errors).flatten(), format='1D'),
        fits.Column(name='NSAMPLES', array=samples_per_ofsp, format='1J')]
hdu_list.append(fits.BinTableHDU.from_columns(cols, name='OFFSETS'))

cols = [fits.Column(name='GAIN', array=np.array(coll_gains).flatten(), format='1D'),
        fits.Column(name='ERR', array=np.array(coll_gain_errors).flatten(), format='1D'),
        fits.Column(name='NSAMPLES', array=samples_per_gainp, format='1J')]
hdu_list.append(fits.BinTableHDU.from_columns(cols, name='GAINS'))
@
We do not save these HDUs immediately, as the user might have
specified to save some other information in the file as well.


\subsection{Sky map}

If the user wants to save the sky map in the output file, we need to
add another HDU to [[hdu_list]]. We are not using Healpy's
[[write_map]] function, as this function always creates a new,
stand-alone FITS file containing the map. Since we want the map to be
saved together with the gains and the offsets, we build the HDU on our
own using the basic functions provided by [[astropy.io.fits]], and
append the HDU to the list of HDUs to be saved.
<<Save the results of the calibration in the output file>>=
if configuration.save_map:
    col = fits.Column(name='SIGNAL', format='D', array=da_capo_results.sky_map)
    hdu = fits.BinTableHDU.from_columns([col], name='SKYMAP')

    hdu.header['PIXTYPE'] = ('HEALPIX', 'HEALPIX pixelisation')
    hdu.header['ORDERING'] = ('RING', 'Pixel ordering scheme, either RING or NESTED')
    hdu.header['NSIDE'] = (configuration.nside, 'Healpix''s resolution parameter')
    hdu.header['FIRSTPIX'] = (0, 'First pixel # (0-based)')
    hdu.header['LASTPIX'] = (healpy.nside2npix(configuration.nside) - 1,
                             'Last pixel # (0-based)')
    hdu.header['INDXSCHM'] = ('IMPLICIT', 'Indexing: IMPLICIT or EXPLICIT')
    hdu.header['OBJECT'] = ('FULLSKY', 'Sky coverage, either PARTIAL or FULLSKY')

    hdu_list.append(hdu)
@


\subsection{Convergence information}

The last chunk of data to save contains the information about the
convergence of each Conjugated Gradient iteration. We save each
iteration in a separate tabular HDU and append them to the output file.
<<Save the results of the calibration in the output file>>=
if configuration.save_convergence is not None:
    for idx, cur_cg_rz_list, cur_dacapo_rz, cg_wall_time \
        in zip(range(len(da_capo_results.list_of_dacapo_rz)),
               da_capo_results.list_of_cg_rz,
               da_capo_results.list_of_dacapo_rz,
               da_capo_results.cg_wall_times):

        col = fits.Column(name='RZ', array=cur_cg_rz_list, format='1D')
        hdu = fits.BinTableHDU.from_columns([col], name='RZ{0:04d}'.format(idx))
        hdu.header['DACAPORZ'] = (cur_dacapo_rz, 'DaCapo stopping factor')
        hdu.header['WTIME'] = (cg_wall_time, 'CG wall clock time [s]')
        hdu_list.append(hdu)
@


\subsection{Flushing data and closing the file}

Now that every HDU is in [[hdu_list]], we save the file to disk. The
[[overwrite=True]] option tells Astropy not to complain if it must
overwrite an existing file.

<<Save the results of the calibration in the output file>>=
fits.HDUList(hdu_list).writeto(configuration.output_file_name,
                               overwrite=True)
log.info('gains and offsets written into file "%s"',
         configuration.output_file_name)
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Validation of the code}

In this chapter, we implement a set of tests that verify the
correctness of the functions implemented in [[calibrate.py]].
We want these tests to be as much automated as possible: we want the
computer to check the correctness of what we have implemented in the
previous chapters, with almost no use intervention. In this way, we
can safely change our code and restructure it, with (almost) no fear
of breaking it unknowingly, provided that we run such tests regularly.

There are two kinds of tests that are usually implemented in software
projects:
\begin{enumerate}
\item \emph{Unit tests} aim to test the behaviour of small routines,
which usually do not depend on a complex environment and can be
executed pretty fast.

\item \emph{Integration tests} stress the whole software architecture,
by running an end-to-end execution of the software and checking the
overall results. They can take considerable time to run.
\end{enumerate}

We are going to implement both kinds of tests in this chapter.

\section{Unit tests}

We are going to use Python's [[unittest]] module to implement both of
them. It is a module from the standard library which implements a nice
way of organizing unit tests in ``test cases'', i.e., groups of tests
doing similar things.

We implement here only unit tests for those routines that implement
the mathematical model of the \DaCapo{} algorithm. The idea is to
build fake TODs containing a handful of samples, and then to build
matrices like $\matr F$, $\matr A$, $\matr Z$ in their full form, and
compare the result of a direct matrix computation with the results
returned by functions like [[apply_f]], [[apply_A]], and [[apply_z]].

Here is the skeleton of the [[dacapo-test.py]] script. It can be run
from the command line with the command [[make check]]:
<<dacapo-test.py>>=
# -*- encoding: utf-8 -*-

import unittest as ut

import numpy as np
from calibrate import OfsAndGains, MonopoleAndDipole, \
    apply_f, apply_ft, compute_diagm, apply_ptilde, apply_ptildet, \
    apply_z, apply_A, compute_v, conjugate_gradient, compute_map_corr, \
    da_capo, JacobiPreconditioner, FullPreconditioner, compute_rms, \
    get_dipole_temperature

<<Helper test functions>>

class TestDaCapo(ut.TestCase):
    def setUp(self):
        <<Build the data structures containing the test TOD and save them in [[self]]>>

    <<Test [[apply_f]]>>
    <<Test [[apply_ft]]>>
    <<Test [[compute_diagm]]>>
    <<Test [[apply_ptilde]]>>
    <<Test [[apply_ptildet]]>>
    <<Test [[apply_z]]>>
    <<Test [[apply_A]]>>
    <<Test [[compute_v]]>>
    <<Test [[compute_map_corr]]>>
    <<Test [[get_dipole_temperature]]>>
    <<Test the preconditioners>>

class TestMiscellanea(ut.TestCase):
    <<Test [[compute_rms]]>>
@

The way we test things is the following: provided that we implemented
the code for a matrix operation like the product \begin{equation}
\end{equation} we compare the result of the call to the Python
function with the result we get by carrying the full matricial
calculation. This is a suitable approac because the TOD we are going
to use has only a few samples and can therefore all the matrices we
use are small enough to be kept entirely in memory.

To compare the results between full calculations and the call to
the Python function we implemented in the previous chapters, we
introduce a helper function:
<<Helper test functions>>=
def check_vector_match(name, vector1, vector2, rtol=1e-05):
    try:
        assert np.allclose(vector1, vector2, rtol=rtol)
        print('test "{0}" passed'.format(name))

    except:
        # This catches both failed assertions and mismatches in the
        # shapes of vec1/vec2
        print('ERROR: test "{0}" failed'.format(name))
        print('       vectors {0} and {1} are not the same'
              .format(vector1, vector2))
        raise
@ %def check_vector_match
The purpose of this function is to verify that all the elements in the
two vectors [[vector1]] and [[vector2]] are ``close enough'' (we
should not expect them to be exactly equal, because of the unavoidable
rounding errors). If this does not happen, the code warns the user
that the condition has not been satisfied and re-raises the exception.
The warning message prints the two vectors, so that the user can see
where the problem is. The [[rtol]] parameter quantifies the concept of
``close enough'': refer to the NumPy documentation of [[allclose]] for
a precise explanation.

To verify the behaviour of the functions defined in the previous
chapters, we must have some fake input data first. In the next
section, we will instruct the code about how to create them.

\subsection{Creating test data}

The method [[TestDaCapo.setUp]] is called when the tests are going to
start. Its purpose is to build the data structures that will be used
in the actual tests. We begin by building a sky map:
<<Build the data structures containing the test TOD and save them in [[self]]>>=
self.num_of_pixels = 3
self.sky_map = np.array([-0.4, 0.2, 0.2])
self.D = np.sin(2 * np.pi * np.array([0, 1/3, 2/3]))
self.mc = np.array([self.D, [1, 1, 1]]).T
self.mon_and_dip = MonopoleAndDipole(mask=[1, 1, 1], dipole_map=self.D)
self.signal_sum_map = self.D + self.sky_map
@
Because of the way we defined functions in Chapter~\ref{ch:daCapo}, it
is not strictly needed to encode maps in Healpix format: it is enough
for the code to know how to associate a direction in the sky with a
pixel number. This is handy in our situation, as Healpix' lowest
resolution is [[NSIDE=1]], which has 12 pixels: too many for the
modest requirements we have here! We build a sky map made by just
three pixels, in [[self.sky_map]]. Note that the dot product between
the map and the dipole [[self.D]] is zero, and that the average value
of [[self.sky_map]] is zero: this was done in order to satisfy the
constraints in Eq.~\ref{eq:dipoleConstraint} and
Eq.~\ref{eq:monopoleConstraint}.

Now that we have a map, we build a TOD and we specify the length of
the offset and calibration periods. This is the time stream of the
pixels seen at times $t_1, t_2, \ldots t_{11}$:
<<Build the data structures containing the test TOD and save them in [[self]]>>=
self.pix_idx = np.array([0, 0, 1, 0, 1, 2, 2, 2, 0, 1, 0])
@
This corresponds to the pointing matrix
\begin{equation}
\label{eq:pTestMatrix}
P = \begin{pmatrix}
1& 0& 0\\
1& 0& 0\\
0& 1& 0\\
1& 0& 0\\
0& 1& 0\\
0& 0& 1\\
0& 0& 1\\
0& 0& 1\\
1& 0& 0\\
0& 1& 0\\
1& 0& 0
\end{pmatrix}.
\end{equation}

Next, we specify the offsets and gains: we choose here to use the same
length for the two offset/gain periods, namely, 6 and 5 samples:
<<Build the data structures containing the test TOD and save them in [[self]]>>=
self.ofs_and_gains = OfsAndGains(offsets=np.array([10.0, 10.5]),
                                 gains=np.array([4.1, 4.2]),
                                 samples_per_ofsp=[6, 5],
                                 samples_per_gainp=[6, 5])

self.a_vec = self.ofs_and_gains.a_vec
self.Gext = np.repeat(self.ofs_and_gains.gains,
                      self.ofs_and_gains.samples_per_gainp)
self.bext = np.repeat(self.ofs_and_gains.offsets,
                      self.ofs_and_gains.samples_per_ofsp)
@
We save the [[a_vec]] member of [[ofs_and_gains]] in a new
variable, because we are going to use it many times: this saves
typing. The members [[self.Gext]] and [[self.bext]] contain the
same offsets and gains in [[self.ofs_and_gains]], but they have been
repeated in order to make these two vectors of the same length as
[[self.pix_idx]]:
\begin{align}
\mathtt{self.Gext} &= \bigl(\underbrace{10.0\ 10.0\ 10.0\ 10.0\ 10.0\
10.0}_\text{6 times}\ \underbrace{10.5\
10.5\ 10.5\ 10.5\ 10.5}_\text{5 times}\bigr), \\
\mathtt{self.Bext} &= \bigl(\underbrace{4.1\ 4.1\ 4.1\ 4.1\ 4.1\
4.1}_\text{6 times}\ \underbrace{4.2\ 4.2\
4.2\ 4.2\ 4.2}_\text{5 times}\bigr).
\end{align}

Having members like [[self.Gext]] and [[self.pixidx]] is very handy
for doing calculations, but if we want to check the correctness of the
implementation, we should have matrices like $\matr P$, $\matr F$,
etc.\ in their full form. The following function takes the data we
have just defined and creates the matrices we need:
<<Helper test functions>>=
def build_test_matrices(num_of_samples, num_of_pixels, pix_idx,
                        mc, signal_sum_map, ofs_and_gains):
    <<Build matrix $\matr P$ from [[pix_idx]]>>
    <<Build matrix $\matr F$ from [[ofs_and_gains]]>>
    <<Build matrices $\matr M$, $\matr Z$, and $\matr A$>>

    return P, F, ptilde, M, Z, A
@ %def build_test_matrices

Matrix $\matr P$ is built using [[pix_idx]]. The result will be like the
matrix shown in Eq.~\eqref{eq:pTestMatrix}:
<<Build matrix $\matr P$ from [[pix_idx]]>>=
P = np.zeros((len(pix_idx), num_of_pixels), dtype='int')
for i, pixel in enumerate(pix_idx):
    P[i][pixel] = 1
@

Next, we build $\matr F$. Our aim is to get a matrix like
Eq.~\eqref{eq:matrF} in page \pageref{eq:matrF}, which in our example
would be
\begin{equation}
\label{eq:fTestMatrix}
\matr F = \begin{pmatrix}
1& 0& \text{[[self.D[0]]]} + \text{[[self.sky_map[0]]]}& 0\\
1& 0& \text{[[self.D[0]]]} + \text{[[self.sky_map[0]]]}& 0\\
1& 0& \text{[[self.D[1]]]} + \text{[[self.sky_map[1]]]}& 0\\
1& 0& \text{[[self.D[0]]]} + \text{[[self.sky_map[0]]]}& 0\\
1& 0& \text{[[self.D[1]]]} + \text{[[self.sky_map[1]]]}& 0\\
1& 0& \text{[[self.D[2]]]} + \text{[[self.sky_map[2]]]}& 0\\
0& 1& 0& \text{[[self.D[2]]]} + \text{[[self.sky_map[2]]]}\\
0& 1& 0& \text{[[self.D[2]]]} + \text{[[self.sky_map[2]]]}\\
0& 1& 0& \text{[[self.D[0]]]} + \text{[[self.sky_map[0]]]}\\
0& 1& 0& \text{[[self.D[1]]]} + \text{[[self.sky_map[1]]]}\\
0& 1& 0& \text{[[self.D[0]]]} + \text{[[self.sky_map[0]]]}
\end{pmatrix},
\end{equation}
for the pointing matrix shown in Eq.~\eqref{eq:pTestMatrix}. We build
the matrix using two [[for]] loops: the first one builds the offset
part (first 2 columns in the example above), the second one build the
sky part.
<<Build matrix $\matr F$ from [[ofs_and_gains]]>>=
F = np.zeros((num_of_samples, len(ofs_and_gains.offsets) + len(ofs_and_gains.gains)))
start_idx = 0
for col, ofs_len in enumerate(ofs_and_gains.samples_per_ofsp):
    F[start_idx:(start_idx + ofs_len), col] = 1
    start_idx += ofs_len

start_idx = 0
for col, gain_len in enumerate(ofs_and_gains.samples_per_gainp):
    F[start_idx:(start_idx + gain_len), len(ofs_and_gains.offsets) + col] = \
        signal_sum_map[pix_idx[start_idx:(start_idx + gain_len)]]
    start_idx += gain_len
@

Finally, we build matrices $\matr M$ (Eq.~\ref{eq:Mmatrix}, on
page~\pageref{eq:Mmatrix}), $\matr Z$ (Eq.~\ref{eq:newZmatrix}), and
$\matr A$ (Eq.~\ref{eq:Amatrix}) using the definitions (remember that
in Python\ 3 the [[@]] operator is matrix multiplication):
<<Build matrices $\matr M$, $\matr Z$, and $\matr A$>>=
Gext = np.repeat(ofs_and_gains.gains, ofs_and_gains.samples_per_gainp)
ptilde = np.array([P[i] * Gext[i] for i in range(len(Gext))])
M = ptilde.T @ ptilde
invM = np.linalg.inv(M)
mat2x2 = mc.T @ invM @ mc
MCminv = invM - invM @ mc @ np.linalg.inv(mat2x2) @ mc.T @ invM
Z = np.eye(num_of_samples) - ptilde @ MCminv @ ptilde.T
A = F.T @ Z @ F
@
This completes the implementation of [[build_test_matrices]]. Let's
turn back to the implementation of [[TestDaCapo.setUp]], and use
[[build_test_matrices]] to create the matrices and save them into
[[self]] members:
<<Build the data structures containing the test TOD and save them in [[self]]>>=
self.P, self.F, self.ptilde, self.M, self.Z, self.A = \
    build_test_matrices(len(self.bext), self.num_of_pixels, self.pix_idx,
                        self.mc, self.signal_sum_map, self.ofs_and_gains)
@

Finally, we build the TOD:
<<Build the data structures containing the test TOD and save them in [[self]]>>=
self.tod = self.Gext * (self.P @ self.signal_sum_map) + self.bext
@ %def TestDaCapo.setUp

\subsection{Implementing unit tests}

Now that [[self]] has been populated with fake input data, we can turn
to the verification of the functions implemented in
Chapter~\ref{ch:daCapo}. The first function we test is [[apply_f]],
which performs a multiplication between matrix $\matr F$
(Eq.~\ref{eq:matrF}) and a vector of offsets and gains:
<<Test [[apply_f]]>>=
def testApplyF(self):
    check_vector_match('apply_f',
                       self.F @ self.a_vec,
                       apply_f(self.ofs_and_gains, self.pix_idx,
                               self.D, self.sky_map))
@ %def TestDaCapo.testApplyF
The code is really easy to follow: we use [[check_vector_match]] to
test that the result of the matrix multiplication between [[self.F]]
and [[self.a_vec]] matches the result of the call [[apply_f]] (which,
as you might remember, does not build any matrix in memory).

Next, we test [[apply_ft]]:
<<Test [[apply_ft]]>>=
def testApplyFT(self):
    check_vector_match('apply_ft',
                       self.F.T @ self.tod,
                       apply_ft(self.tod, self.ofs_and_gains, self.pix_idx,
                                self.D, self.sky_map))
@ %def TestDaCapo.testApplyFT

Next, we test [[compute_diagm]]:
<<Test [[compute_diagm]]>>=
def testComputeDiagM(self):
    check_vector_match('compute_diagm',
                       np.diag(self.M),
                       compute_diagm(None, self.ofs_and_gains,
                                     self.pix_idx, self.num_of_pixels))
@ %def TestDaCapo.testComputeDiagM

Next, we test [[apply_ptilde]]; nothing new here:
<<Test [[apply_ptilde]]>>=
def testApplyPtilde(self):
    check_vector_match('apply_ptilde',
                       self.ptilde @ self.sky_map,
                       apply_ptilde(self.sky_map, self.ofs_and_gains,
                                    self.pix_idx))
@ %def TestDaCapo.testApplyPtilde

In testing [[apply_ptildet]], we face a problem. The function calls
MPI functions, but requiring to start $N$ MPI processes to run an unit
test is an overkill: remember that unit tests are meant to be quick
and easy to run. Therefore, we pass [[None]] to the [[mpi_comm]]
parameter: if you remember how we implemented this function, you will
recall that in this case the function completely avoids MPI. Now you
can understand one of the reasons behind that choice.
<<Test [[apply_ptildet]]>>=
def testApplyPtildet(self):
    check_vector_match('apply_ptildet',
                       self.ptilde.T @ self.tod,
                       apply_ptildet(None, self.tod, self.ofs_and_gains,
                                     self.pix_idx, self.num_of_pixels))
@ %def TestDaCapo.testApplyPtildet

In testing [[apply_z]], we use the same trick of setting [[mpi_comm]]
to [[None]]:
<<Test [[apply_z]]>>=
def testApplyZ(self):
    check_vector_match('apply_z',
                       self.Z @ self.tod,
                       apply_z(None, self.tod, self.ofs_and_gains,
                               self.pix_idx, self.mon_and_dip))
@ %def TestDaCapo.testApplyZ

The same for  [[compute_v]]:
<<Test [[compute_v]]>>=
def testComputeV(self):
    check_vector_match('compute_v',
                       self.F.T @ self.Z @ self.tod,
                       compute_v(None, self.tod, self.ofs_and_gains, self.sky_map,
                                 self.pix_idx, self.mon_and_dip))
@ %def TestDaCapo.testComputeV

Let's now discuss function [[apply_A]]. This function implements the
matrix multiplication by $\matr A$, where the vector that is to be
multiplied is a sequence of offsets and gains. However, remember that
the value of the coefficients in $\matr A$ depend on the
value of the offsets and the gains we are using within the \DaCapo{}
main loop (see the implementation of the [[da_capo]] function). If we
want to properly test [[apply_A]], we should multiply it by a vector
that it does not coincide with the gains and offsets used to build
$\matr A$ itself, because this would be a corner case, not very
representative of the general case. We therefore build a new
[[OfsAndGains]] object, [[ofs_gains_guess]], which contains offsets
and gains that are unrelated to the ones in [[self.ofs_and_gains]]. Of
course, [[self.ofs_and_gains]] and [[ofs_gains_guess]] share the same
structure (two periods of 6 and 5 samples each):
<<Test [[apply_A]]>>=
def testApplyA(self):
    ofs_gains_guess = OfsAndGains(offsets=np.zeros(2),
                                  gains=np.ones(2),
                                  samples_per_ofsp=self.ofs_and_gains.samples_per_ofsp,
                                  samples_per_gainp=self.ofs_and_gains.samples_per_gainp)
    check_vector_match('apply_A',
                       self.A @ ofs_gains_guess.a_vec,
                       apply_A(None, self.ofs_and_gains, self.sky_map,
                               self.pix_idx, self.mon_and_dip,
                       ofs_gains_guess))
@ %def TestDaCapo.testApplyA

We apply the same trick in testing [[compute_map_corr]], building a
new TOD ([[fake_tod]]) which contains the numbers in the range
$0\ldots 10$:
<<Test [[compute_map_corr]]>>=
def testComputeMapCorr(self):
    fake_tod = np.arange(len(self.tod))
    check_vector_match('compute_map_corr',
                       np.linalg.inv(self.ptilde.T @ self.ptilde) @
                       self.ptilde.T @ (fake_tod - self.F @ self.a_vec),
                       compute_map_corr(None, fake_tod, self.ofs_and_gains,
                                        self.ofs_and_gains,
                                        self.pix_idx, self.D, self.sky_map))
@ %def TestDaCapo.testComputeMapCorr

We want now to test the implementation of the two CG preconditioners,
[[JacobiPreconditioner]] and [[FullPreconditioner]]. Since the kind of
test we are going to apply is the same for the two objects, we
implement the test in a generic method of the [[TestDaCapo]] class.
This method accepts a [[precObj]] as a parameter, which is the class
to be used in initializing the preconditioner. The test exercises the
[[conjugate_gradient]] method, which supposes that the offsets and
gains are unknown. We must therefore not use objects like [[self.A]],
[[self.Z]] and so on, because they were built using the exact solution
of the CG problem! We build a new guess for the offsets and gains in
[[ofs_gains_guess]], and we call [[build_test_matrices]] to build the
matrices we need from these guesses:
<<Test the preconditioners>>=
def precondition_check(self, precObj):
    ofs_gains_guess = OfsAndGains(offsets=np.zeros(2),
                                  gains=np.ones(2),
                                  samples_per_ofsp=self.ofs_and_gains.samples_per_ofsp,
                                  samples_per_gainp=self.ofs_and_gains.samples_per_gainp)

    P, F, ptilde, M, Z, A = build_test_matrices(len(self.bext),
                                                self.num_of_pixels, self.pix_idx,
                                                self.mc, self.D,
                                                ofs_gains_guess)

    if precObj:
        pcond = precObj(mc=self.mon_and_dip,
                        pix_idx=self.pix_idx,
                        samples_per_ofsp=self.ofs_and_gains.samples_per_ofsp,
                        samples_per_gainp=self.ofs_and_gains.samples_per_gainp)
    else:
        pcond = None

    cg_a, _ = conjugate_gradient(None, self.tod, ofs_gains_guess,
                                 np.zeros_like(self.D),
                                 self.pix_idx, self.mon_and_dip, pcond=pcond)
    check_vector_match('conjugate_gradient',
                       np.linalg.inv(A) @ F.T @ Z @ self.tod, cg_a.a_vec)

    result = da_capo(mpi_comm=None,
                     voltages=self.tod,
                     pix_idx=self.pix_idx,
                     samples_per_ofsp=self.ofs_and_gains.samples_per_ofsp,
                     samples_per_gainp=self.ofs_and_gains.samples_per_gainp,
                     mc=self.mon_and_dip,
                     pcond=pcond)
    check_vector_match('da_capo (offsets)',
                       self.ofs_and_gains.offsets, result.ofs_and_gains.offsets)
    check_vector_match('da_capo (gains)',
                       self.ofs_and_gains.gains, result.ofs_and_gains.gains)
    check_vector_match('da_capo (map)',
                       self.sky_map, result.sky_map)
@ %def TestDaCapo.testPreconditioner
Note that we test both [[conjugate_gradient]] and [[da_capo]].

We can now check the code using the two preconditioners and no
preconditioner at all:
<<Test the preconditioners>>=
def testNoPreconditioner(self):
    self.precondition_check(None)

def testJacobiPreconditioner(self):
    self.precondition_check(JacobiPreconditioner)

def testFullPreconditioner(self):
    self.precondition_check(FullPreconditioner)
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Testing other functions}

There are a few other functions we should test. One of them is
[[compute_rms]], which we used in
Sect.~\ref{sec:preconditionersAndErrors} to estimate the error of the
offset/gain estimates produced by \DaCapo. Here we generate a number
of random samples which follows the ``normal'' distribution, i.e., a
Gaussian with zero mean and RMS equal to one, and we check that
[[compute_rms]] is close enough to the expected value:
<<Test [[compute_rms]]>>=
def testComputeRMS(self):
    chunks = [40000, 70000, 50000]
    samples = np.random.randn(np.sum(chunks))
    check_vector_match('compute_rms',
                       np.repeat(1.0, len(chunks)),
                       compute_rms(samples, chunks),
                       rtol=5e-2)
@

Another function to test is [[get_dipole_temperature]]. We test it
both with and without the relativistic frequency-dependent correction:
<<Test [[get_dipole_temperature]]>>=
def testDipoleTemperature(self):
    # Speed of the Solar System according to Planck 2015, in m/s
    solsysspeed = np.array([-357948.22884313,   60840.92963476,  -71640.11501626])
    tcmb = 2.72548

    # No quadrupolar correction
    assert np.allclose(get_dipole_temperature(tcmb, solsysspeed, [0, 0, 1]),
                       -0.0006532168171509646)

    assert np.allclose(get_dipole_temperature(tcmb, solsysspeed, [0, 0, 1], 30e9),
                       -0.0006511369998784711)

    assert np.allclose(get_dipole_temperature(tcmb, solsysspeed, [0, 0, 1], 44e9),
                       -0.0006511328936482864)

    assert np.allclose(get_dipole_temperature(tcmb, solsysspeed, [0, 0, 1], 70e9),
                       -0.0006511213786323892)

    assert np.allclose(get_dipole_temperature(tcmb, solsysspeed, [0, 0, 1], 143e9),
                       -0.0006510659240198485)
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Integration tests}

In this section we will build a more complex test, which will exercise
all the functionality provided by [[index.py]] and [[calibrate.py]].
In other words, we are going to implement an \emph{integration test}
that run an end-to-end simulation and check the validity of the
results.

Our integration test consists of the following parts:
\begin{enumerate}
\item A program which produces a simulated TOD and saves it in FITS
files;

\item A set of scripts that instruct [[index.py]] and [[calibrate.py]]
about how to read the file created in the previous step;

\item A program which loads the results saved by [[calibrate.py]] and
compare them with the expected values.
\end{enumerate}


\subsection{Creation of simulated timelines}
\label{sec:longTestTimelines}

We are going to simulate the observation of the sky made by one
detector onboard a spacecraft which observes a substantial part of the
sky.

\subsubsection{Modelling the sky signal}

We use a model of the sky that only contains the CMB (as measured by
Planck) and the dipole. The first signal is read from a FITS file,
while the dipole is generated on-the-fly using
[[get_dipole_temperature]]; this is the task of a new Python function,
[[create_dipole_map]]:
<<Implementation of [[create_dipole_map]]>>=
def create_dipole_map(nside: int):
    '''Create a Healpix map containing the CMB dipole signal in K_CMB'''
    t_cmb_k = 2.7
    solsys_colat_rad = 1.76560508
    solsysdir_long_rad = 2.97323038
    solsysspeed_m_s = 370082.2332
    solsys_speed_vec_m_s = solsysspeed_m_s * \
        np.array([np.sin(solsys_colat_rad) * np.cos(solsysdir_long_rad),
                  np.sin(solsys_colat_rad) * np.sin(solsysdir_long_rad),
                  np.cos(solsys_colat_rad)])
    return get_dipole_temperature(
        t_cmb_k=t_cmb_k,
        solsys_speed_vec_m_s=solsys_speed_vec_m_s,
        directions=healpy.pix2vec(nside, np.arange(healpy.nside2npix(nside))))
@ %def create_dipole_map

We need to create a TOD containing \emph{uncalibrated} samples, but
the CMB map and the dipole map measure the signal using thermodynamic
temperatures. Therefore, we need a function which takes a TOD in
Kelvin and applies Eq.~\ref{eq:radiometer}
(page~\pageref{eq:radiometer}) to the temperatures: this is the
purpose of [[decalibrate_tod]]:
<<Implementation of [[decalibrate_tod]]>>=
def decalibrate_tod(tod, offsets, gains, samples_per_ofsp, samples_per_gainp):
    start_idx = 0
    cur_ofsp = 0
    cur_gainp = 0
    samples_in_gainp = 0
    while start_idx < len(tod):
        tod[start_idx:start_idx + samples_per_ofsp] = \
            tod[start_idx:start_idx + samples_per_ofsp] * gains[cur_gainp] + \
            offsets[cur_ofsp]

        start_idx += samples_per_ofsp
        cur_ofsp += 1
        samples_in_gainp += samples_per_ofsp
        if samples_in_gainp >= samples_per_gainp:
            cur_gainp += 1
            samples_in_gainp = 0
@ %def decalibrate_tod


\subsubsection{Creating the pointing information}
\label{sec:simulatingPointingInformation}

\begin{figure}
    \centering
    \includegraphics{figures/long_test_scanning.pdf}

    \caption{\label{fig:longTestScanning} Geometry of the scanning
    strategy for the long test described in
    Sect.~\protect\ref{sec:longTestTimelines}. The code creates the
    scanning strategy for the detector considered in the simulation by
    going through a series of coordinate systems. (A) shows the beam
    reference frame, with the $z_b$ axis aligned with the main beam of
    the detector. (B) shows that the $z_b$ axis of the beam reference
    frame and the $z_\text{sp}$ axis of the spacecraft are separated
    by an angle $\theta_1$; the spacecraft is spinning around its
    $z_\text{sp}$ axis, so that the main beam of the detector
    describes circles in the sky. (C) refers to the Ecliptic reference
    frame, where the $z_\text{sp}$ axis of the spacecraft is separated
    by the Celestial North Pole by an angle $\theta_2$. The spacecraft
    axis rotates around the North Pole, as shown by the gray circle. }

\end{figure}

The scanning strategy used by our imaginary experiment is shown in
Fig.~\ref{fig:longTestScanning}. The spacecraft spins around its $z$
axis, and which in turns performs a rotation on the sky. This allows
to scan the sky in large circles, which is the best solution if one
plans to use the dipole to calibrate the data. The code uses
quaternions to model rotations; the four quaternions [[quat1]],
[[quat2]], [[quat3]], and [[quat4]] correspond to the following
transformations:
\begin{enumerate}

\item [[quat1]] rotates the reference frame of the beam into the
reference frame of the spacecraft (from panel A to panel B of
Fig.~\ref{fig:longTestScanning}); this is achieved as a rotation
around the $x$ axis;

\item [[quat2]] implements the rotation around axis $z_\text{sp}$
(panel B), with the number of rotations per minute specified in the
parameter [[rpm1]];

\item [[quat3]] rotates the spacecraft's reference frame into the
Ecliptic reference frame (from panel B to panel C);

\item [[quat4]] implements the rotation around axis $z_\text{ecl}$,
with the number of rotations per minute specified in the parameter
[[rpm1]].
\end{enumerate}

Here is the code; the [[times]] parameter is an array containing the
times of each sample in the output, and it is measured in seconds.
<<Implementation of [[generate_pointings]]>>=
def generate_pointings(rpm1: float, rpm2: float, times: Any, num_of_samples:
                       int):
    quat1 = np.tile(
        qa.rotation([1., 0, 0], np.pi / 2.), num_of_samples).reshape(-1, 4)
    quat2 = qa.rotation([0, 0, 1.], 2. * np.pi * rpm1 / 60.0 * times)
    quat3 = np.tile(
        qa.rotation([1., 0, 0], np.pi / 3.), num_of_samples).reshape(-1, 4)
    quat4 = qa.rotation([0, 0, 1.], 2. * np.pi * rpm2 / 60.0 * times)

    fullquat = qa.mult(quat4, qa.mult(quat3, qa.mult(quat2, quat1)))
    vectors = qa.rotate(fullquat, np.array([0., 0., 1.]))
    return healpy.vec2ang(vectors[:, 0:3])
@ %def generate_pointings

\subsubsection{Running the simulation}

After having implemented the helper functions discussed in the
previous paragraphs, we now turn to the program that will actually
create the TOD. The executable requires the output path where to write
the simulated TOD to be specified on the command line; as usual, we
use the [[click]] library for this:
<<create-test-files.py>>=
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
'''Create a set of FITS files to be used as test input for the DaCapo
calibration codes.
'''

import logging as log
import os.path
from typing import Any

from astropy.io import fits
import numpy as np
import healpy
import click
from calibrate import get_dipole_temperature
import quaternionarray as qa

<<Implementation of [[create_dipole_map]]>>
<<Implementation of [[write_simulated_tod]]>>
<<Implementation of [[generate_pointings]]>>
<<Implementation of [[decalibrate_tod]]>>

@click.command()
@click.argument('outdir',
                type=click.Path(exists=True,
                                dir_okay=True,
                                file_okay=False))
def main(outdir: str):
    log.basicConfig(level=log.INFO,
                    format='[%(asctime)s %(levelname)s] %(message)s')

    <<Set up the parameters for the simulation>>
    <<Create the model map of the sky signal>>
    <<Create the pointing information and save the hit map>>
    <<Build the calibrated TOD>>
    <<Create a set of offsets/gains and decalibrate the TOD>>
    <<Save the decalibrated TOD to disk>>

    <<Create a parameter file for [[index.py]]>>
    <<Create a set of parameter files for [[calibrate.py]]>>

if __name__ == '__main__':
    main()
@

The steps in the [[main]] function should be straightforward to
understand from their description. We now implement them one by one.

First, we define a few parameters to be used in the simulation, namely
(1) the number of samples per each offset period, (2) the number of
samples per each gain period, and (3) the number of samples in the
whole TOD. The following numbers are arbitrary and can be changed
according to one's own taste:
<<Set up the parameters for the simulation>>=
samples_per_ofsp = 5000  # Number of samples
samples_per_gainp = samples_per_ofsp * 10
num_of_samples = samples_per_gainp * 20

assert num_of_samples % samples_per_ofsp == 0
assert num_of_samples % samples_per_gainp == 0
assert samples_per_gainp % samples_per_ofsp == 0
@

We turn now to the creation of the sky model. We use a CMB map loaded
from a FITS file and a generated dipole map (using
[[create_dipole_map]]); we compute the peak-to-peak amplitude of the
dipole and store it in [[dipole_amplitude]], because we will need it
later:
<<Create the model map of the sky signal>>=
galaxy_map = healpy.read_map(
    os.path.join('maps', 'COM_CMB_IQU-commander_256_ecliptic.fits'),
    verbose=False)
nside = healpy.npix2nside(len(galaxy_map))

dipole_map = create_dipole_map(nside=nside)
dipole_amplitude = np.max(dipole_map) - np.min(dipole_map)
@

\begin{figure}
    \centering
    \includegraphics[height=5cm]{figures/long_test_hits.pdf}

    \caption{\label{fig:longTestHits} Hit map of the pointings
    generated by the program [[create-test-files.py]].}

\end{figure}

Now we create the pointing information. The [[times]] variable is an
array containing the time of each sample in the output simulation; as
above, the value used for the angular speed of the two transformations
discussed in Sect.~\ref{sec:simulatingPointingInformation} is
arbitrary and can be changed at one's wishes:
<<Create the pointing information and save the hit map>>=
times = np.linspace(0, 86400., num_of_samples)
theta, phi = generate_pointings(rpm1=1.,
                                rpm2=1. / (24. * 60.),
                                times=times,
                                num_of_samples=num_of_samples)
pixidx = healpy.ang2pix(nside, theta, phi)
healpy.write_map(os.path.join(outdir, 'long_test_hits.fits.gz'),
                 np.bincount(pixidx, minlength=healpy.nside2npix(nside)),
                 overwrite=True)
@
We save the hit count map in a FITS file, because it is useful for
debugging purposes. An image of the map is shown in
Fig.~\ref{fig:longTestHits}.

The calibrated TOD contains a white noise component that is scaled
according to the dipole's peak-to-peak amplitude: in this way, we are
sure that the S/N ratio of the calibrator is always good enough to
permit a good calibration:
<<Build the calibrated TOD>>=
tod = (galaxy_map[pixidx] + dipole_map[pixidx] + np.random.randn() *
       dipole_amplitude * 1e-5)
@

We now turn [[tod]] into a decalibrated sequence of voltages:
<<Create a set of offsets/gains and decalibrate the TOD>>=
offsets = np.random.randn(num_of_samples //
                          samples_per_ofsp) * np.sqrt(np.var(dipole_map))
gains = (np.random.randn(num_of_samples // samples_per_gainp) + 50.0)
decalibrate_tod(tod, offsets, gains, samples_per_ofsp, samples_per_gainp)
@

We are now ready to save the TOD to disk:
<<Save the decalibrated TOD to disk>>=
tod_file_name = 'long_test_tod.fits'
write_simulated_tod(file_name=os.path.join(outdir, tod_file_name),
                    time=times,
                    theta=theta,
                    phi=phi,
                    tod=tod,
                    offsets=offsets,
                    gains=gains,
                    samples_per_ofsp=samples_per_ofsp,
                    samples_per_gainp=samples_per_gainp)
log.info('file "%s" written', tod_file_name)
@

Having the TOD is however not enough to run an integration test. We
need to have parameter files for [[index.py]] and [[calibrate.py]],
and we need them to be consistent with the assumptions we have used in
creating the TOD. Therefore, the code now creates the parameter files
from scratch. Let's start from the file for [[index.py]]:
<<Create a parameter file for [[index.py]]>>=
index_file_name = os.path.join(outdir, 'long_test_index.fits')
with open(os.path.join(outdir, 'long_test_index.ini'), 'wt') as f:
    f.write('''[input_files]
path = {path}
mask = {tod_file_name}
hdu = 1
column = TIME

[periods]
length = {samples_per_ofsp}

[output_file]
file_name = {index_file_name}
'''.format(path=outdir,
           tod_file_name=tod_file_name,
           index_file_name=index_file_name,
           samples_per_ofsp=times[samples_per_ofsp]))
@

Regarding the parameter file for [[calibrate.py]], there are many
possible choices for the kind of analysis done by the code. We choose
to exercise all the possibilities provided by [[calibrate.py]] in
terms of preconditioners, so we generate a set of calibration files,
each with a different setting for [[pcond]]:
<<Create a set of parameter files for [[calibrate.py]]>>=
for pcond in ('none', 'jacobi', 'full'):
    ini_file_name = os.path.join(
        outdir, 'long_test_calibrate_{0}.ini'.format(pcond))
    output_file_name = os.path.join(
        outdir, 'long_test_results_{0}.fits'.format(pcond))

    with open(ini_file_name, 'wt') as f:
        f.write('''[input_files]
index_file = {index_file_name}
signal_hdu = 1
signal_column = SIGNAL
pointing_hdu = 1
pointing_columns = THETA, PHI

[dacapo]
t_cmb_K = 2.72548
solsysdir_ecl_colat_rad = 1.7656131194951572
solsysdir_ecl_long_rad = 2.995889600573578
solsysspeed_m_s = 370082.2332
nside = {nside}
periods_per_cal_constant = {gainp_per_ofsp}
cg_stop_value = 1e-9
cg_max_iterations = 100
dacapo_stop_value = 1e-9
dacapo_max_iterations = 20
pcond = {pcond}

[output]
file_name = {output_file_name}
save_map = yes
save_convergence = yes
comment = "Long duration test"
'''.format(index_file_name=index_file_name,
           gainp_per_ofsp=samples_per_gainp // samples_per_ofsp,
           output_file_name=output_file_name,
           pcond=pcond,
           nside=nside))

log.info('file "%s" written', ini_file_name)
@

\subsubsection{Writing the TOD on disk}

We have still to implement [[write_simulated_tod]]; we save the input
values of the gains and the offset in the TOD, so that it will be easy
to check the validity of the estimates produced by [[calibrate.py]] at
the end of the integration test:

<<Implementation of [[write_simulated_tod]]>>=
def write_simulated_tod(file_name: str, time: Any, theta: Any, phi: Any, tod: Any,
              offsets: Any, gains, samples_per_ofsp: int, samples_per_gainp:
              int):
    hdu1 = fits.BinTableHDU.from_columns([
        fits.Column(name='TIME', array=time, format='J', unit='s'),
        fits.Column(name='THETA', array=theta, format='E', unit='rad'),
        fits.Column(name='PHI', array=phi, format='E', unit='rad'),
        fits.Column(name='SIGNAL', array=tod, format='D', unit='V')])

    hdu2 = fits.BinTableHDU.from_columns([
        fits.Column(name='OFFSET', array=offsets, format='D', unit='V')],
                                         name='OFFSETS')
    hdu2.header['OFSSAMP'] = (samples_per_ofsp, 'Samples per offset period')

    hdu3 = fits.BinTableHDU.from_columns([
        fits.Column(name='GAIN', array=gains, format='D', unit='V/K')],
                                         name='GAINS')
    hdu3.header['GAINSAMP'] = (samples_per_gainp, 'Samples per gain period')
    hdu3.header['GAINSOFS'] = (samples_per_gainp // samples_per_ofsp,
                               'Offset periods per gain period')

    hdulist = fits.HDUList([fits.PrimaryHDU(), hdu1, hdu2, hdu3])
    hdulist.writeto(file_name, overwrite=True)
@ %def write_simulated_tod

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Verifying the results}

Once we have the simulated TOD and the parameter files for
[[index.py]] and [[calibrate.py]], we are supposed to run the programs
and examine the results. Therefore, we need some tool to verify that
the results produced by [[calibrate.py]] match the expectations; in
our case, we want to compare the gains estimated by [[calibrate.py]]
with the gains used as input by [[create-test-files.py]]. This is easy
to do, as [[create-test-files.py]] saves the input gains in a separate
tabular HDU of the TOD FILE, named [[GAINS]], and this is the same
name used by [[calibrate.py]] for the HDU in the output file which
contains the estimated gains. The following script loads the [[GAIN]]
column from the [[GAINS]] HDU in two FITS files and compare them,
returning nonzero if the two gains have a significant mismatch:
<<check-gains.py>>=
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
'''Usage: {basename} FILE1 FILE2

Check the consistency between the gains in FILE1 and FILE2. The two
files can have been created either using create-test-files.py or
calibrate.py.
'''

import os.path
import sys
from astropy.io import fits
import numpy as np


def main():
    if len(sys.argv) != 3:
        print(__doc__.format(basename=os.path.basename(sys.argv[0])))
        sys.exit(1)

    file1_name, file2_name = sys.argv[1:3]

    with fits.open(file1_name) as f:
        file1_gains = f['GAINS'].data.field('GAIN')

    with fits.open(file2_name) as f:
        file2_gains = f['GAINS'].data.field('GAIN')

    assert np.allclose(file1_gains, file2_gains,
                       rtol=2e-2), 'Gains do not match'


if __name__ == '__main__':
    main()
@

The [[Makefile]] included in the source code implements a command,
[[fullcheck]], which calls in turn [[create-test-files.py]],
[[index.py]], [[calibrate.py]], and [[check-gains.py]], and verifies
the correct execution of the code for all the preconditioners
implemented by [[calibrate.py]]. It is ran automatically by the Docker
container presented in Sect.~\ref{sec:dockerContainer}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\appendix

\chapter{Index of symbols}

Here we provide a list of the symbols used in the code. Each reference is of
the form \texttt{nL}, where \texttt{n} is the number of the page and \texttt{L}
a letter specifying the code chunk within that page starting from ``a''.
Underlined references point to the definition of the symbol.

\vspace{2em}
\nowebindex

\bibliography{dacapo_calibration}

\end{document}