-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdacapo_calibration.nw
5968 lines (5159 loc) · 248 KB
/
dacapo_calibration.nw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% -*- mode: text -*-
% In Emacs, use "C-x f" to set the fill column; use 80
\documentclass[a4paper,10pt,twoside]{book}
\usepackage{a4wide}
\usepackage{amsmath}
\usepackage{algorithmic}
\usepackage{bm}
\usepackage{noweb}
\usepackage{fancyhdr}
\usepackage{url}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{ccaption}
\usepackage{textcomp}
\usepackage{titlesec}
\usepackage[round]{natbib}
\usepackage{fontspec}
\usepackage[dvipsnames]{xcolor}
\usepackage{verbatim}
\usepackage[bitstream-charter]{mathdesign}
\usepackage[T1]{fontenc}
\newcommand{\DaCapo}{\texttt{Da\,Capo}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\DeclareTextCommandDefault{\nobreakspace}{\leavevmode\nobreak\ }
\noweboptions{smallcode,longchunks}
\def\nwendcode{\endtrivlist \endgroup}
\let\nwdocspar=\par
%\def\nwendcode{\endtrivlist \endgroup \vfil\penalty10\vfilneg}
%\let\nwdocspar=\smallbreak
% Command used to indicate a section
\newcommand{\sectmark}{\S\ }
\titleformat{\section}[block]
{\centering\normalfont\bfseries}
{\sectmark\thesection.}{.5em}{}
\titleformat{\subsection}[runin]
{\normalfont\bfseries}
{\thesubsection.}{.5em}{}[. ]
\titleformat{\subsubsection}[runin]
{\normalfont\bfseries}
{}{.2em}{}[. ]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pagestyle{fancy}
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\sectionmark}[1]{%
\markright{\thesection.\ #1}}
\fancyhf{}
\fancyhead[L,RO]{\bfseries\thepage}
\fancyhead[LO]{\bfseries\rightmark}
\fancyhead[RE]{\bfseries \DaCapo{} calibration}
\fancypagestyle{plain}{%
\fancyhf{}
\fancyfoot[C]{\thepage}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\matr}[1]{\bm{#1}}
\newcommand{\vect}[1]{\bm{#1}}
\captionnamefont{\small\bfseries}
\captiontitlefont{\small\itshape}
\newcommand\myshade{85}
\definecolor{hypercolor}{RGB}{29,152,29}
\colorlet{mylinkcolor}{hypercolor}
\colorlet{mycitecolor}{hypercolor}
\colorlet{myurlcolor}{hypercolor}
\hypersetup{
pdftitle={Calibrating TODs using the DaCapo algorithm},
pdfauthor=Maurizio Tomasi,
pdfsubject={Commented implementation of a Python program to simulate
the calibration of TODs},
pdfkeywords={CMB {data analysis} {optics}},
pdfborder={0 0 0},
linkcolor = mylinkcolor!\myshade!black,
citecolor = mycitecolor!\myshade!black,
urlcolor = myurlcolor!\myshade!black,
colorlinks = true,
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\bibliographystyle{plainnat}
\title{Calibrating TODs using the \DaCapo{} algorithm}
\author{Maurizio Tomasi}
\maketitle
\frontmatter
\tableofcontents
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter*{Introduction}
\label{ch:introduction}
This document presents the implementation of a Python\ 3 program that
is able to calibrate Time Ordered Data (TOD) measured by a CMB space
survey using the signal of the CMB dipole. The underlying assumption
is that the signal of the dipole is strong enough to be used for a
photometric calibration, and that the scanning strategy of the
spacecraft is such that the full dynamic range of the dipole signal is
explored regularly during the survey.
The algorithm used for the calibration is \DaCapo, which was developed
by Dr.~E.~Keih\"anen and is presented in
\citet{planck2015.lfi.calibration}. The algorithm uses many concepts
developed in the context of destriping
\citep{burigana.1997.destriping,keihanen2004}, and it is able to
produce an estimate of three quantities at the same time:
\begin{enumerate}
\item A time-dependent calibration factor;
\item A timestream of offsets, used to model the $1/f$ part of the
noise in the TOD;
\item An estimate of the sky map.
\end{enumerate}
This implementation of the algorithm includes two Python\ 3 programs:
the first program, [[index.py]], scans the potentially large set of
FITS files containing the TODs and creates an \emph{index file}, which
is used by the second program, [[calibrate.py]], to decide how to
split the workload among the MPI processes used in the computation;
then, [[calibrate.py]] runs the \DaCapo{} algorithm and output the set
of products listed above. The most critical parts of the code have
been written in Fortran\ 2003, in order to improve the execution
speed; Python bindings have been generated automatically using
[[f2py]].
The source code of the program is provided in this document in its
full form. It has been typeset using
\texttt{noweb}\footnote{\url{http://www.cs.tufts.edu/~nr/noweb/}.}, a
\emph{literate programming} tool which allows to produce source code
files and \LaTeX{} documentation at the same time. This document is
meant to be read from the first to the last page, like a narrative: my
hope is that it is readable enough for other people to understand a
number of concepts related to High Performance Computing, i.e., the
management of large quantities of data, the distribution of work among
MPI processes, and the ways one can produce scientific results that
are both reliable and reproducible.
My biggest thanks go to Elina~Keih\"anen, which helped me a lot in
understanding the details of the \DaCapo{} algorithm and gave me a few
suggestions that simplified my work considerably. I would also like to
thank the CORE systematics working group for their suggestions in how
to improve the scientific quality of the code's results: in
particular, Jacques Delabrouille, Ted Kisner, Diego Molinari, and
Paolo Natoli.
\mainmatter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Presentation of the \DaCapo{} algorithm}
\label{ch:daCapoIntroduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:introduction}
In this document we will provide the complete source code of a program
which reads a set of Time Ordered Data (TOD) files and performs a
photometric calibration of the data using the dipole signal. We use
the [[noweb]]
system\footnote{\url{http://www.cs.tufts.edu/~nr/noweb/}.} by Norman
Ramsey to implement the code: two programs, called \emph{tangler} and
\emph{weaver}, are used to extract the source code to compile
([[notangle]]) and the \LaTeX{} file used to produce a standalone
document ([[noweave]]) from the same document, which is a text file
with extension \texttt{.nw}.
The code is written in Python 3, and it reads a set of FITS files
containing Time-Ordered Information (TOI), i.e., tables containing the
uncalibrated samples produced by a detector scanning the sky and the
timing and pointing information. The program disentangles the noise,
the dipolar signal caused by the Doppler effect induced by the motion
of the spacecraft, and the sky signal, and it produces the following
outputs:
\begin{enumerate}
\item A timestream of gain constants, used to perform the photometric
calibration;
\item A calibrated sky map, in Kelvin;
\item A set of offsets, which give a rough approximation of the $1/f$
component of the noise.
\end{enumerate}
The program uses the \DaCapo{} algorithm developed by
Dr.~E.~Keih\"anen and described in
\citet{planck2015.lfi.calibration}. The algorithm is briefly presented
in Sect.~\ref{sec:DaCapoMathematicalModel}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Modelling the output of a detector scanning the sky}
\label{sec:DaCapoMathematicalModel}
\begin{figure}[tbh]
\centering
\includegraphics{figures/core_scan_example.pdf}
\caption{\label{fig:coreScanExample} Example of a TOD stream. The
overall signal (black line) is mainly due to the sum of two
components, the smooth component $G_k\,D_i$ (thick gray line) due
to the dipole and the fast-changing $G_k T_i$ component due to the
Galaxy and the CMB (thin gray line), plus a noise component due to
$1/f$ fluctuations and white noise.}
\end{figure}
In this work we consider a detector which measures the signal entering
the optical system from a fixed direction of the sky. Each sample in
the Time Ordered Data (TOD) produced by the detector can be modelled
using the following equation:
\begin{equation}
\label{eq:radiometer}
V_i = G_k \bigl(T_i + D_i\bigr) + b_k + N_i,
\end{equation}
where $V_i$ is an uncalibrated sample ($[V_i] = \text{V}$), $G_k$ is
the gain\footnote{We assume from now on that the output of the
detector is a voltage. However, the code runs fine as long as the
detector's output is some quantity that is proportional to the
temperature of the object within the main beam of the instrument.}
($[G] = \text{V/K}$), $T_i$ is the signal associated with Galactic
emissions and the CMB ($[T_i] = \text{K}$), $D_i$ is the signal of the
CMB dipole (it might already have been convolved with the beam
response $\gamma$, i.e., $D_i \equiv \gamma * D$; in any case, $[D_i]
= \text{K}$), $b_k$ is the offset used to model the slow $1/f$ noise
variations ($[b_k] = \text{V}$), and $N_i$ is a white noise component
($[N_i] = \text{V}$). For a realistic example of the behaviour in time
of these components, see Fig.~\ref{fig:coreScanExample}
\begin{figure}[tbh]
\centering
\includegraphics[height=5cm]{figures/1fnoise.pdf}
\caption{\label{fig:1fnoise} How noise is modelled by \DaCapo. A
$1/f^2$ noise realization is approximated with a set of offsets,
each calculated by averaging the noise samples over periods of
0.5\,s (\textit{left}). In this example, the $1/f^2$ noise knee
frequency is $f_\text{knee} = 0.5\,\text{Hz}$, which corresponds
to a period of $2\,\text{s}$: this is four times greater than the
period of each offset (\textit{middle}). The difference between
the $1/f^2$ noise samples and the offsets reveals an almost pure
white noise component (\textit{right}).}
\end{figure}
The idea behind Eq.~\eqref{eq:radiometer} is to consider the noise
part as composed by a set of constant baselines $b_k$, which
approximate the $1/f$ component of the noise, and a purely white noise
component. This works if the period of each baseline is smaller than
$1/f_\text{knee}$, with $f_\text{knee}$ being the knee frequency of
the $1/f$ noise power spectrum. See Fig~\ref{fig:1fnoise}.
\begin{figure}[tbh]
\centering
\includegraphics{figures/TOD_indexing.pdf}
\caption{\label{fig:todindexing} How samples and calibration
periods are indexed in this document. A minimal TOD containing
$N=11$ samples is shown here. It has been arbitrarily split into
two calibration periods (indexed by $k$), containing 6 and 5
samples respectively. The first 6 samples are calibrated using the
same gain $G_{k=1}$, the latter 5 using $G_{k=2}$. Note that,
given some $i$, the value of $k$ can be always derived
unambiguously.}
\end{figure}
The \DaCapo{} algorithm was developed by E.~Keih\"anen for the
calibration of the Planck/LFI timelines released in 2015. Its purpose
is to find an optimal estimate for the gains $G_k$ and the offsets
$b_k$. In the case of the 2015 Planck data release, both the gains and
the offsets were calculated once every hour. I call such time frame a
\emph{calibration period}, and I index it using the symbol $k$: thus,
$G_k$ and $b_k$ are the gain and offset to be used for all the samples
acquired during the $k$-th calibration period. (Later, we will
consider the more interesting case where the calibration period and
the offset period are of different lengths.) The $i$ symbol is
reserved for indexing samples in a TOD. See Fig.~\ref{fig:todindexing}
for an example.
The two unknowns of Eq.~\eqref{eq:radiometer} are $G_k$ and $b_k$,
while all the other quantities are supposed either to be known ($D_i$,
$V_i$) or unimportant for the computation ($N_i$, $T_i$ if we apply a
mask first). The value $T_i$ is typically taken from a pixelized map,
so
\begin{equation}
T_i = \sum_p P_{ip} m_p,
\end{equation}
where $P$ is the so-called \emph{pointing matrix}, a rectangular
matrix of size $M \times N$, where $N$ is the number of elements in
the TOD and $M$ is the number of pixels in the map. The index $p$ runs
over all the $M$ pixels in the map. The value of the element $P_{ip}$
is 1 if the $i$-th sample in the TOD has been measured while the
instrument was pointing towards pixel $p$, 0 otherwise.
An example of a pointing matrix is the following (assuming a TOD with length $N
= 9$ and a map with $M = 3$ pixels):
\begin{equation}
\label{eq:pexample}
\matr{P} = \begin{pmatrix}
1& 0& 0\\
1& 0& 0\\
0& 1& 0\\
1& 0& 0\\
0& 1& 0\\
0& 0& 1\\
0& 0& 1\\
1& 0& 0\\
0& 0& 1\\
\end{pmatrix}.
\end{equation}
An important property of this matrix is that the product $P^T P$ is a
square diagonal matrix where each diagonal element at position $pp$ is
equal to the number of times pixel $p$ has been observed. In our
example:
\begin{equation}
\matr{P}^T\matr{P} = \begin{pmatrix}
4& 0& 0\\
0& 2& 0\\
0& 0& 3\\
\end{pmatrix}.
\end{equation}
To determine the gains, the \DaCapo{} algorithm minimizes the quantity
\begin{equation}
\label{eq:chisq}
\chi^2 = \sum_i \frac{\bigl(V_i - V^\text{model}_i\bigr)^2}{\sigma_i^2},
\end{equation}
where
\begin{equation}
V^\text{model}_i = G_k \left(\sum_p P_{ip} m_p + D_i\right) + b_k
\end{equation}
is our model for the output of the detector, which supposes no white noise and
considers a static sky signal not dependent on time (thus, variable sources have
to be masked before the application of this algorithm).
Since Eq.~\eqref{eq:chisq} is quadratic in its unknowns,
\citet{planck2015.lfi.calibration} linearizes it in the following way:
\begin{equation}
V^\text{model}_i =
G_k \left(D_i + \sum_p P_{ip} m^0_p\right)
+ G^0_k \sum_p P_{ip} \bigl(m_p - m^0_p\bigr)
+ \left(\bigl(G_k - G^0_k\bigr) \sum_p P_{ip} \bigl(m_p - m^0_p\bigr)\right)
+ b_k
\end{equation}
(in the Planck 2015 paper there is a typo: the part $\sum_p P_{ip}$ is
missing from the third term in the right side), where $m^0$ and $G^0$
are the results of the previous iteration. The algorithm iterates to
converge towards the solution. Since the third term is of the second
order, it can be dropped, and the model becomes
\begin{equation}
\label{eq:vmodapprox}
V^\text{model}_i \approx
G_k \left(D_i + \sum_p P_{ip} m^0_p\right)
+ G^0_k \sum_p P_{ip} \tilde{m}_p
+ b_k,
\end{equation}
where
\begin{equation}
\tilde{m}_p = m_p - m^0_p.
\end{equation}
To further simplify Eq.~\eqref{eq:chisq}, we rewrite part of the terms in
Eq.~\eqref{eq:vmodapprox} by introducing a new matrix $F_{ij}$, where $j$ is an
index which runs over all the available calibration periods, and a vector $a$
which contains all the unknowns ($G_k$ and $b_k$):
\begin{equation}
\sum_j F_{ij} a_j \equiv G_k \left( D_i + \sum_p P_{ip} m^0_p \right) + b_k.
\end{equation}
Taking as example the pointing matrix in Eq.~\eqref{eq:pexample}, we imagine
here that there are two calibration periods, with 5 and 3 samples respectively.
Then,
\begin{equation}
\label{eq:matrF}
\matr F = \begin{pmatrix}
1& 0& D_1 + m^0_1& 0\\
1& 0& D_1 + m^0_1& 0\\
1& 0& D_2 + m^0_2& 0\\
1& 0& D_1 + m^0_1& 0\\
1& 0& D_2 + m^0_2& 0\\
0& 1& 0& D_3 + m^0_3\\
0& 1& 0& D_3 + m^0_3\\
0& 1& 0& D_1 + m^0_1\\
0& 1& 0& D_3 + m^0_3\\
\end{pmatrix},
\qquad \vect{a} = \begin{pmatrix}
b_1\\
b_2\\
G_1\\
G_2\\
\end{pmatrix}.
\end{equation}
Note that $\matr F$ contains both the dipole signal and the previous guess $m^0$
for the sky map. If we rewrite the second term in the right hand of
Eq.~\eqref{eq:vmodapprox} as
\begin{equation}
\label{eq:ptildedef}
\matr{\tilde{P}}_{ip} = G^0_k \matr{P}_{ip},
\end{equation}
then Eq.~\eqref{eq:chisq} reaches its final form
\begin{equation}
\label{eq:chisqfinal}
\chi^2 = \bigl(\vect{V} - \matr{\tilde{P}}\,\vect{\tilde{m}} - \matr{F} \vect{a}\bigr)^T \matr{C}_n^{-1} \bigl(\vect{V} -
\matr{\tilde{P}}\,\vect{\tilde{m}} - \matr{F}\,\vect{a}\bigr),
\end{equation}
with $C_n$ being the noise covariance matrix for the noise component
$N_i$. Equation~\eqref{eq:chisqfinal} is easy to understand: it
requires to minimize the difference between the measured vector
$\vect{V}$ and the vector $\matr{F}\vect{a}$, which was created by
scanning the sky map (dipole and Galaxy) and assuming some gains $G_k$
and a fixed $1/f$ noise realization $b_k$. The term
$\matr{\tilde{P}}\vect{\tilde{m}}$ takes into account the
discrepancies between the map used in building $\matr{F}\vect{a}$
(produced during the previous iteration) and the new map being
estimated now.
The calibration solution is found by minimizing Eq.~\eqref{eq:chisqfinal} with
respect to $\tilde{m}$ and then with respect to $a$. The solution $a$ must
satisfy the equation
\begin{equation}
\label{eq:cjgr}
\matr{A} \vect{a} = \vect{v},
\end{equation}
where
\begin{align}
\label{eq:Amatrix} \matr{A} &= \matr{F}^T \matr{C}_n^{-1} \matr{Z} \matr{F},\\
\label{eq:vVector} \vect{v} &= \matr{F}^T \matr{C}_n^{-1} \matr{Z} \vect{V},\\
\label{eq:zdef} \matr{Z} &= \matr{I} - \matr{\tilde P}\bigl(\matr{\tilde P}^T \matr{C}_n^{-1} \matr{\tilde{P}}\bigr)^{-1} \matr{\tilde{P}}^T \matr{C}_n^{-1}.
\end{align}
(The definition of $\matr Z$ in \citet{planck2015.lfi.calibration} has
a typo.)
Once a first estimate $\vect a_1$ for $\vect a$ is found, we can solve
for $\vect{\tilde m}_1$:
\begin{equation}
\label{eq:tildeMap}
\vect{\tilde m}_1 = \bigl(\matr{\tilde P}^T \matr{C}_n^{-1} \matr{\tilde
P}\bigr)^{-1} \matr{\tilde P}^T \matr{C}_n^{-1} \bigl(\vect V - \matr
F \vect a_1\bigr),
\end{equation}
and then we iterate again, starting from $(\vect a_1, \vect{\tilde
m}_1)$ to find an improved solution $(\vect a_2, \vect{\tilde m}_2)$
and so on until we reach convergence.
\citet{planck2015.lfi.calibration} explains that the equations derived
so far have serious degeneracies, that prevent them from being solved
in any practical purpose. The most obvious one is the ambiguity
between $D_i$ and $T_i$: in the extreme case $T_i \propto D_i$ (a sky
signal similar to the dipole), there would be a perfect degeneracy
between $G_k$ and the sky map. In a more realistic case, $T_i$ would
be the sum of a dipole-free component and a nonzero dipolar component,
and the degeneracy would be smaller but still present. Therefore, the
authors introduced a constraint to the equations, which is related to
the peculiar way the dipole is used in the calibration. We only report
their results here:
\begin{enumerate}
\item We need to introduce a new matrix, $\matr{m}_c$, which is a
$M\times 2$ matrix containing the dipole and the monopole:
\begin{equation}
\label{eq:constrainMatrix}
\matr{m}_c = \begin{pmatrix}
D_1& 1\\
D_2& 1\\
D_3& 1\\
\hdotsfor{2}\\
D_M& 1
\end{pmatrix}.
\end{equation}
\item The definition of matrix $\matr{A}$ (Eq.~\ref{eq:cjgr}) remains
the same, but matrix $\matr{Z}$ becomes
\begin{equation}
\label{eq:newZmatrix}
\matr{Z} = \matr{I} - \matr{\tilde P} \bigl(\matr{M} +
\matr{C}_m^{-1}\bigr)^{-1} \matr{\tilde{P}}^T \matr{C}_n^{-1},
\end{equation}
with
\begin{align}
\label{eq:Mmatrix}
\matr{M} &= \matr{\tilde{P}}^T \matr{C}_n^{-1} \matr{\tilde{P}},\\
\label{eq:MCmMatrix}
\bigl(\matr{M} + \matr{C}_m^{-1}\bigr)^{-1} &= \matr{M}^{-1} -
\matr{M}^{-1} \matr{m}_c \bigl(\matr{m}_c^T \matr{M}^{-1}
\matr{m}_c\bigr)^{-1} \matr{m}_c^T \matr{M}^{-1}.
\end{align}
\item With these modifications, the solution has the property that
there is no dipole in the map $\matr{m}$, in the sense that
\begin{equation}
\label{eq:dipoleConstraint}
\sum_p \matr{m}_p \matr{D}_p = 0.
\end{equation}
Moreover, that the monopole of the map is zero, i.e.,
\begin{equation}
\label{eq:monopoleConstraint}
\left<\matr m\right> = 0.
\end{equation}
\end{enumerate}
In the remainder of this document, I will show how the codes presented
in this document can be used to perform the aforementioned
calculations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{How to install and build the code}
Before presenting the implementation of the code, I show here how the
source code can be obtained and compiled. The code has been developed
on Linux machines, but it should be fairly portable to POSIX
architectures (e.g., Mac OS X). Windows users might have some
problems, as the code relies on the
Healpy\footnote{\url{https://github.com/healpy/healpy}.} library,
which has not been ported to Windows at the time of writing (December
2016).
\subsection{Installing the code from source}
To install and compile the source code of this program, you will need
the following dependencies:
\begin{enumerate}
\item Python\ 3.x (version 3.5 was used in the development; the code
might work with earlier versions, but this has not been
tested);
\item Noweb (\url{http://www.cs.tufts.edu/~nr/noweb/}), for creating
the [[.py]] source codes and this documentation: you can
install it under Ubuntu with the command [[apt install noweb]];
\item NumPy (\url{http://www.numpy.org/}), for low-level vector/matrix
computations;
\item SciPy (\url{https://www.scipy.org/}), for higher-level
vector/matrix computations;
\item AstroPy (\url{http://www.astropy.org/}), for writing/reading
FITS files;
\item MPI4py (\url{https://pythonhosted.org/mpi4py/}), for MPI
communications;
\item Numba (\url{http://numba.pydata.org/}), to speed up a few
Python routines;
\item Healpy (\url{https://github.com/healpy/healpy}), used to work
with Healpix maps;
\item Click (\url{http://click.pocoo.org/5/}), for implementing a
smart command-line interface;
\item Autopep8, to nicely re-indent the Python files produced by
Noweb;
\item A fortran compiler that is supported by [[f2py]] ([[gfortran]]
and Intel Fortran compiler are ok);
\end{enumerate}
After you have all the requirements installed, you can download the
source codes discussed in this text from the GitHub repository hosted
at the following URL:
\begin{center}
\url{https://github.com/ziotom78/dacapo_calibration}
\end{center}
\begin{table}
\begin{tabular}{lll}
Variable& Default value& Notes\\
[[NOWEAVE]]& [[noweave]]& Part of Noweb\\
[[NOTANGLE]]& [[notangle]]& Part of Noweb\\
[[CPIF]]& [[cpif]]& Part of Noweb\\
[[TEX2PDF]]& [[lualatex]]& \\
[[BIBTEX]]& [[bibtex]]& \\
[[PYTHON]]& [[python3]]& \\
[[F2PY]]& [[f2py]]& Bundled with NumPy\\
[[AUTOPEP8]]& [[autopep8]]& You can use [[yapf]] as an alternative\\
[[DOCKER]]& [[sudo docker]]& You might want to remove [[sudo]]\\
[[MPIRUN]]& [[mpirun -n 2]]& Used for the integration tests\\
[[INKSCAPE]]& [[inkscape]]& Only used to build the PDF file
\end{tabular}
\caption{\label{tab:makeVariables} List of the variables that can
be defined in [[configuration.mk]].}
\end{table}
To compile the program, just run [[make]]. If the defaults used by
[[Makefile]] do not suit you, you can create a file named
[[configuration.mk]] in the same directory where [[Makefile]] resides,
which can provide new values for the variables listed in
Table~\ref{tab:makeVariables}.
\subsection{Using Docker to describe the installation}
\label{sec:dockerContainer}
We provide here a Docker file which installs all the requirements in a
virtual machine based on the [[miniconda3]]
image\footnote{\url{https://hub.docker.com/r/continuumio/miniconda3/}.}
provided by Continuum Analytics. Docker is a containerization platform
that allows to spawn small Virtual Machines (VMs) running some
specific task within a Linux system. Docker's advantage over other
virtualization systems (e.g., QEmu, VirtualBox, VMWare\ldots) is that
such machines are really fast to start. They are typically used to run
a service or a task that needs to be run in an isolated and controlled
environment.
In our context, Docker is the ideal solution to test for missing
dependencies in the environment we use. We describe an environment,
called \emph{image}, by means of one of Docker's preconfigured VMs
available on the Docker's Hub website (\url{https://hub.docker.com/}).
There are plenty of them, we'll pick one of the official Anaconda
Python images, by Continuum Analytics. These virtual machines are
based on well-known Linux distributions (Continuum Analytics uses
Debian), and they come with a basic Python distribution already
installed. For our purposes, Anaconda's VMs are not enough, as we need
a few more packages not included there. So we specify a set of
commands that have to be run in order to configure the VM properly.
The recipe to build a Docker image must be written in a text file,
usually called [[Dockerfile]]. Our [[Dockerfile]] contains the name of
the base Anaconda Python's VM, the name of the mantainer (myself), and
a set of commands to be run from the command line which prepare the
VM. In such commands (the lines beginning with [[RUN]]) we can use
Debian's and Anaconda's standard tools to install programs and
libraries:
<<Dockerfile>>=
FROM continuumio/miniconda3
MAINTAINER Maurizio Tomasi <maurizio.tomasi@unimi.it>
RUN apt-get -y update && \
apt-get -y upgrade && \
apt-get install -y \
apt-utils \
g++ \
gcc \
gfortran \
git \
inkscape \
make \
noweb \
openmpi-bin \
texlive-latex-base \
texlive-latex-extra
RUN conda install \
astropy \
click \
matplotlib \
mpi4py \
numba \
numpy \
pytest \
scipy
RUN pip install \
autopep8 \
healpy \
quaternionarray
CMD git clone https://github.com/ziotom78/dacapo_calibration && \
cd dacapo_calibration && \
make all && \
make fullcheck
@
The last [[CMD]] command will be explained later.
To build the Docker image, you have to enter the [[docker]]
directory within the source code repository and run [[docker build]]:
\begin{verbatim}
$ sudo docker build -t="ziotom78:dacapo" .
\end{verbatim}
This task needs to be redone only when [[Dockerfile]] changes. It will
take a while, since it needs to build the entire VM, running all the
[[RUN]] command lines and downloading the necessary packages from
Debian's, Anaconda's and PyPI's repositories: once this is done,
subsequent runs will be much faster.
Once the image has been built, you can run the associated
\emph{container}, which is basically an instance of the VM based on
some image (refer to the Docker's documentation for more details about
the terminology):
\begin{verbatim}
$ sudo docker run --name dacapo "ziotom78:dacapo"
\end{verbatim}
When we use [[docker run]], and only in this case, Docker will run the
command on the line beginning with [[CMD]], thus downloading the most
up-to-date version of the source code presented in this document,
compiling it and running the full set of tests.
The advantage of creating a Docker machine is that the full sequence
of operations needed to configure a barebone system in order to
install and run the code is fully documented.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Running the programs}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/FITS_structure.pdf}
\caption{\label{fig:FitsStructure} Structure of a FITS file
created by the program [[calibrate.py]]. The primary HDU and the
first two extension HDU contain copies of the input files used to
run the calculation; they are used to allow users to trace the
history of the computation that produced the results saved in this
file. HDUs \#3, \#4, and \#5 contain the solution of the \DaCapo{}
problem, namely, the offsets and gains that are in $\vect a$
(Eq.~\protect\ref{eq:cjgr}) and the sky map $\vect m$
(Eq.~\protect\ref{eq:tildeMap}). HDU \#6 and following contain
information about the rate of convergence of the algorithm; in
this example we show just two of them, but the effective number
depends on the details of the computation.}
\end{figure}
The programs we are going to implement in the next chapters works on
the following assumptions:
\begin{enumerate}
\item The user must provide one or more FITS files containing the TODs
that must be used as an input for the calculation;
\item A program, [[index.py]] (implemented in Ch.~\ref{ch:indexPy}),
scans the FITS files one by one and counts the number of samples in
each of them, optionally flagging unwanted data; the output of this
program is an \emph{index file};
\item A second program, [[calibrate.py]] (Ch.~\ref{ch:daCapo}),
implements the \DaCapo{} algorithm. It divide the TODs among the MPI
processes according to the information read from the index file, it
performs the calculations shown above, and it collects the results at
the end of the computation. Results are saved in one FITS file, whose
structure is shown in Fig.~\ref{fig:FitsStructure}.
\end{enumerate}
To see an example of how to invoke the code, suppose that you have a
simulated/measured TOD saved in a set of files in some directory
([[$]] indicates the command-line prompt, where you type commands):
\begin{verbatim}
$ ls /storage/my_data/tods/
tod_0001.fits
tod_0002.fits
tod_0003.fits
tod_0004.fits
\end{verbatim}
Using CFITSIO's [[listhead]]\footnote{\url{http://}.} example program,
we can see the structure of each file. Each of them contains one
tabular HDU with a number of columns:
\begin{verbatim}
$ listhead /storage/my_data/tods/tod_0001.fits
[...snip...]
TTYPE1 = 'TIME '
TFORM1 = 'J '
TUNIT1 = 's '
TTYPE2 = 'PIXIDX '
TFORM2 = 'J '
TTYPE3 = 'SIGNAL '
TFORM3 = 'D '
TUNIT3 = 'V '
[...snip...]
\end{verbatim}
The first column ([[TIME]]) contains the time when each sample was
acquired (in seconds); the second column ([[PIXIDX]]) contains the
index of the pixels on the sky sphere that are associated with each
sample; the third column contains the actual, uncalibrated samples
measured by the detector. We assume that all the samples must be used
in the computation, i.e., there is no need to flag the samples. (Our
program will be able to discard bad data, but we are keeping this
example as simple as possible.)
The first step to do is to run the program [[index.py]], which read
the input parameters from a file, named the \emph{parameter
file}. It is a text file which follows the traditional syntax of INI
files; in our case, it is the following (file [[examples/index.ini]]):
\begin{verbatim}
[input_files]
path = /storage/my_data/tods
mask = tod_????.fits
hdu = 1
column = TIME
[periods]
length = 1536
[output_file]
file_name = index.fits
\end{verbatim}
The section [[input_files]] specifies where to load the files, and
which is the HDU and column containing timing information; this is
needed by [[index.py]] to decide the length of each offset period. The
[[periods]] section specifies how long each offset period must be, in
seconds. Finally, the [[output_file]] section specifies the name of
the file that will be produced by the program.
To run [[index.py]], we must pass the INI file as the first and only
argument on the command line:
\begin{verbatim}
python3 index.py examples/index.ini
\end{verbatim}
The program will read the TOD files one by one and will create the
file [[index.fits]]. We are now ready to run the \DaCapo{}
calibration.
The [[calibrate.py]] program is more complex than [[index.py]], but it
works in a similar way. It requires the user to specify input
parameters through a parameter file, which in our example is the
following:
\begin{verbatim}
[input_files]
index_file = index.fits
signal_hdu = 1
signal_column = SIGNAL
pointing_hdu = 1
pointing_columns = PIXIDX
[dacapo]
t_cmb_K = 2.72548
solsysdir_ecl_colat_rad = 1.7656131194951572
solsysdir_ecl_long_rad = 2.995889600573578
solsysspeed_m_s = 370082.2332
nside = 256
periods_per_cal_constant = 1
cg_stop_value = 1e-9
[output_file]
file_name = dacapo_results.fits
comment = Dummy
\end{verbatim}
The [[input_files]] section specifies the location of the index file
and the columns containing the pointing information\footnote{Pointing
information can either be provided as a couple (colatitude,
longitude), in radians, in which case the Healpix pixelization is
assumed, or as a sequence of integer pixel indexes, like in this
example.} and the signal. The [[dacapo]] section is used to configure
the way calculations are made. Finally the [[output_file]] section
tells the program how to save the results of the computation.
The structure of the output file written by [[calibrate.py]] is
complex: it contains the estimates for the unknowns in the \DaCapo{}
problem (gains $G_k$, offsets $b_k$, and the sky map $T_i$), but also
many other ancillary information. This is a variation of an idea
proposed by K.~Riebe in a
talk\footnote{\url{http://www.adass2016.inaf.it/index.php/participant-list/14-talk/124-riebe-kristin}.}
given at the XXVI
conference\footnote{\url{http://www.adass2016.inaf.it/}.} of the
Astronomical Data Analysis Software and Systems (ADASS): to have a
``provenance model'' that records the processing steps leading from
the input raw data to the scientific products. This allows users to
check the quality of the products, as well as to search for possible
error sources. The structure of the FITS file shown in
Fig.~\ref{fig:FitsStructure} allows the user to reconstruct the path
from the input TODs to the estimates for $G_k$, $b_k$, and $T_i$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Mapping the input data}
\label{ch:indexPy}
Before implementing the algorithm described in
Chapter~\ref{ch:daCapoIntroduction}, we need to decide which is the
best way to load the input data to be used by \texttt{calibrate.py},
our implementation of the \DaCapo{} algorithm. Our program is likely to
be applied to large amount of data: considering one detector with a
sampling frequency of $\sim 10^2\,\text{Hz}$, in one year such
detector will produce $\sim 10^9$ samples. Considering that each
sample is 8 bytes wide and must be associated with ancillary
information (e.g., pointing angles, flags), storing these data in
memory is likely to require hundreds of GB of memory. Therefore, we
are going to use MPI to implement a program which splits the necessary
computations among a number of computing units; each unit will load
only a subset of the whole data. (This approach works because, as we
shall see, the operations described in
Chapter~\ref{ch:daCapoIntroduction} can be easily done concurrently.)
In order to do this, we first need to devise a strategy to efficiently
split the data among the MPI processes.
The subject of this chapter is the implementation of an ancillary
program, \texttt{index.py}, which scans the input data and writes and
\emph{index file}. The contents of the index file will allow
\texttt{calibrate.py} to quickly decide which file each MPI process
needs to read.
This is the skeleton of the program, we'll discuss its implementation
through this chapter:
<<index.py>>=
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
from collections import namedtuple
from configparser import ConfigParser
from enum import Enum
from glob import glob
from typing import Any, List, Union
import logging as log
import os.path, sys
from astropy.io import fits
from numba import jit
import click
import numpy as np
__version__ = '1.1.1'
<<Flagging code>>
<<Datatypes used by [[index.py]]>>
<<Functions to calculate the length of each period>>
<<Functions to load the parameters for [[index.py]] from an INI file>>
<<Functions to write the index file to disk>>
<<Other functions used by [[index.py]]>>
<<Implementation of the [[index_main]] function>>
if __name__ == '__main__':
index_main()
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{What is an index file?}
\label{sec:inputMappingIntroduction}
As said above, the amount of data to be read by our program is going
to be quite relevant. It is unlikely that such data will be saved in
one huge file: typically they will be split into many FITS files that
will be loaded by the many MPI processes.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/period_lengths.pdf}
\caption{\label{fig:periods} Example of splitting a TOD into
offset and calibration periods: the length of each offset is
proportional to the number of samples, not on their time span.
Shorter periods are likely to contain more flagged data than
others. The length of an offset period (labeled as 1, 2, 3,
\ldots) is arbitrary and variable, but calibration periods
(labeled as A, B, C, D) must be made of an integer number of
offset periods. Offset periods are shorter than calibration
periods because they have to keep track of $1/f$ noise, which
fluctuates faster than gain variations in typical situations. A
further complication stems from the fact that the TOD might be
split into FITS files (\texttt{file01.fits}, \texttt{file02.fits})
whose boundaries do not coincide with the boundaries of the
period, as shown here. Nevertheless, we will implement
[[calibrate.py]] with the constraint that each MPI process only
loads an integer number of offset/calibration baselines.}
\end{figure}
We introduce now a requirement on the way data are split into
offset/calibration periods: we recall that a \emph{calibration period}
is a time interval for which the gain $G$ is constant for all the
samples acquired during the same interval, while an \emph{offset
period} is the same, but for the offset $b$ (see
Eq.~\ref{eq:radiometer}). From now on, we allow the two periods to be
different, with the constraint that a calibration period must be an
integer number of offset periods. (This stems from the fact that
offset periods need to keep track of $1/f$ noise, which typically show
faster variations than gain changes.) See Fig.~\ref{fig:periods} for
an example. In this way, when we will implement [[calibrate.py]] in
Chapter~\ref{ch:daCapo}, we will assign an integer number of
calibration periods to each MPI process, and this will consequently
imply that each process will get an integer number of offset periods
as well.
This is the most logical way to split data in order to implement the
\DaCapo{} algorithm using MPI. However, it complicates data loading
from disk, because the data might be split in files at boundaries that
do not corrispond with the offset/calibration periods. The most
efficient solution would be to make each MPI process load the data
from disk (i.e., the first MPI process reads the first file, the
second one reads the second file, and so on). Then, the processes
should exchange data among them using MPI calls, until each process
has the data it needs: in the example shown in Fig.~\ref{fig:periods},
after MPI process \#1 has read \texttt{file01.fits}, it should ask
process \#2 for the first samples in \texttt{file02.fits}. But this
solution is complicated to implement: as each process must both send
and receive data, the possibility of deadlocks\footnote{A
\emph{deadlock} happens when some process A halts after having