-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-content.tex
319 lines (262 loc) · 16.2 KB
/
05-content.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
\newcommand{\orgmode}{\texttt{org-mode}}
\newcommand{\Makefile}{\mintinline{shell}{Makefile}}
\newcommand{\context}{\mintinline{shell}{context}}
\newcommand{\CC}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}}
\def\CC{{C\nolinebreak[4]\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}
\section{Executive summary}
QMCkl is a library of kernels needed for \ac{QMC} simulations. The current \ac{WP}
focuses on the pedagogical version of the library. Since deliverable D1.1, multiple kernels
were implemented and the design was improved thanks to the feedback of HPC experts and code
developers. A Python interface has been written. Autotools is now used as a build system,
enabling an easy installation on many different architectures, and enabling also an easy
packaging for Linux distributions and package managers such as Spack and Guix.
QMCkl has been introduced in the three \ac{QMC} codes of TREX, and a significant speedup
was measured.
\section{Introduction}
\subsection{Summary of report D1.1}
\Ac{QMCkl} aims at providing a high-performance
implementation of the main kernels of the \ac{QMC} methods.
The current Work Package, WP1, focuses on defining the \ac{API}, the tests,
and a \emph{pedagogical} presentation of the main algorithms.
In WP3, the \ac{HPC} experts use this repository as a reference to re-write
optimized versions of the functions proposed in this pedagogical version of the library.
To maximize portability to unknown architectures, we have chosen to provide a \ac{API} that
is compatible with the C programming language. The foundations of the library are written in C, but Fortran
was chosen to express the kernels because it is more convenient for \ac{QMC} scientists than C to express linear algebra.
The dependencies on external software are kept minimal to facilitate the installation, and the
BSD 3-clause license was chosen to make the adoption of the library easy for private software companies.
As this version of the library is intended to be mainly used as documentation,
we chose to use literate programming\cite{knuth_1992} in {\orgmode}
format.~\cite{schulte_2012,orgmode} The documentation
is first written, and the code illustrates the documentation. From the same source files,
the documentation is generated both in text and HTML formats, as well as the C code of the library.
This practice helps documentation to always be consistent with the code.
The documentation of the current status of the library is available
at \url{https://trex-coe.github.io/qmckl}, and the source code is
available on the GitHub repository at \url{https://github.com/trex-coe/qmckl}.
\subsection{QMCkl's objective}
\newcommand{\Nelec}{N_{\text{elec}}}
\newcommand{\Nelecup}{N_{\text{elec}}^\uparrow}
\newcommand{\Nelecdn}{N_{\text{elec}}^\downarrow}
\newcommand{\br}{\mathbf{r}_1,\dots,\mathbf{r}_{\Nelec}}
\newcommand{\brup}{\mathbf{r}_1,\dots,\mathbf{r}_{\Nelec^\uparrow}}
\newcommand{\brdn}{\mathbf{r}_{\Nelec^\uparrow+1},\dots,\mathbf{r}_{\Nelec}}
The most common form of wave function used in \ac{QMC} simulations is
expressed as a linear combination of Slater determinants $D_I$ multiplied by
a Jastrow correlation factor $\exp(J)$:
\begin{equation}
\Psi(\br) = \left( \sum_I c_I\, D_I(\br) \right) \exp \left( J(\br) \right).
\end{equation}
The Jastrow factor is a positive function of the electron-electron,
electron-nucleus, and electron-electron-nucleus distances.
The Slater determinants are expressed as products of $\uparrow$-spin
$\downarrow$-spin determinants:
\begin{eqnarray}
D_I(\br) = D_i^\uparrow(\brup)\, D_j^\downarrow(\brdn),
\end{eqnarray}
where $\Nelec = \Nelecup + \Nelecdn$ is the number of
electrons, the sum of the numbers of up- and down-spin electrons.
The determinants are obtained from the Slater matrix, built on the basis of \acp{MO},
\begin{equation}
D_i^\uparrow(\brup) = \det(S_i^\uparrow) = \left|
\begin{array}{ccc}
\phi_{i(1)}(\mathbf{r}_1) & \dots & \phi_{i(\Nelecup)}(\mathbf{r}_1) \\
\vdots & \ddots & \vdots \\
\phi_{i(1)}(\mathbf{r}_{\Nelecup}) & \dots & \phi_{i(\Nelecup)}(\mathbf{r}_{\Nelecup})
\end{array}
\right|
\end{equation}
and each Slater matrix differs from the other ones by the indices $\{
i(1), \dots, i(\Nelecup) \}$ of the \acp{MO} composing it.
The \acp{MO} $\phi$ are linear combinations of \acp{AO} $\chi$:
\begin{equation}
\phi_j(\mathbf{r}) = \sum_k C_{kj} \chi_k(\mathbf{r}),
\end{equation}
each \ac{AO} being a function centered on a nucleus with a radial and an angular part.
The angular part is expressed as spherical harmonics $Y_{lm}$ or as a polynomial, and
the angular part is generally a linear combination of Gaussian functions:
\begin{equation}
\chi_k(\mathbf{r}) = (x-X_A)^a (y-Y_A)^b (z-Z_A)^c \sum_l d_{kl}\, \exp \left(
-\gamma_{kl} |\mathbf{r}-\mathbf{R}_A|^2 \right)
\end{equation}
where $\mathbf{r} = (x,y,z)$ is the coordinate of an electron and
$\mathbf{R_A} = (X_A,Y_A,Z_A)$ is the coordinate of nucleus $A$.
At each step of the \ac{QMC} simulation, the wave function must be
evaluated at the electron positions. The dynamics of the electrons
also require the gradient of the wave function with respect to electron
coordinates, and the computation of the kinetic energy requires the Laplacian
of the wave function. The computation of the gradient and Laplacian of
a Slater determinant requires the adjugate of the Slater matrix:
\begin{equation}
\nabla_j D_i = \sum_k \left[\nabla S_i \right]_{jk} \left[\mathrm{adj}(S_i)\right]_{kj}.
\end{equation}
The quantities involved in the calculation of the wave function are the most
important kernels to implement in QMCkl, because they represent the main
computational bottleneck of \ac{QMC} simulations.
\section{Work done since D1.1}
\subsection{New functionalities}
\subsubsection{Implemented kernels}
The most heavily used kernels were implemented. These are:
\begin{itemize}
\item the \acp{AO} with their gradients and Laplacian
\item the \acp{MO} with their gradients and Laplacian
\item the Jastrow factor in the form implemented in CHAMP
\item adjugate of a matrix, using Shermann-Morrison-Woodbury or
specialized functions for small matrices
\end{itemize}
For the computation of the non-local component of the
\ac{ECP}, it is often needed to evaluate the \acp{AO} of
\acp{MO} at extra points. For these points, the gradients and
Laplacian are not required, so we also provide functions that return only
the values for these specific cases.
\subsubsection{Improvement of design}
In the previous design of the library, when a QMCkl function was called the computed
data was copied back into an array provided by the user. We have now
introduced the zero copy function, enabling the possibility to directly compute the
values inside user-allocated memory. However, this possibility
can be dangerous in some situations, because the user of the library has to allocate
the array as expected by the library (on CPU or on GPU). Therefore, these in-place functions
can be called by the user, but they are suggested as dangerous
optimizations to be done when the user is sure that everything works
well with the standard version of the function.
In the initial design, the \acp{AO} and \acp{MO} were computed at the
electron positions. We realized that we could generalize the
usage of QMCkl beyond QMC, for example, for the computation of
\acp{AO} or \acp{MO} on a grid. Such computations arise in the context
of numerical integration, such as in \ac{DFT} or transcorrelated methods,
and also in the context of three-dimensional visualization of \acp{MO},
the electron density, or other functionals such as the \ac{ELF}.\cite{silvi_1994}
In this context, the points on which the functions are evaluated no longer
correspond to the position of the electron. Hence, we have introduced the concept of
\emph{point}. The concept of \emph{electron} is introduced later for the
computation of multi-electron quantities such as the electron repulsion potential,
Slater determinants, or Jastrow factors.
In the initial design, the number of electrons (or points,
now) was fixed equal to the number of electrons in the system
multiplied by the number of random walkers. The fact that \acp{MO} need to be
computed both at the electron positions and at the extra points required
for \acp{ECP} raised a design problem. We have redesigned
the library so that the number of points can be different from one
call to the next one to simplify the use of the library.
TREXIO\cite{trexio} is the library and file format developed in \ac{WP}2.
To simplify even more the usage of QMCkl, we have introduced the
possibility of loading the input data from a TREXIO file in a single
call, avoiding loading all the input parameters one by one. Now, using
QMCkl has become extremely simple if the trial wave function can be
stored in a TREXIO file. Today, TREXIO is interfaced with all the
codes used in TREX and with the external codes we use, so this
modification was a significant improvement.
\subsubsection{Python interface}
We have introduced the automatic generation of a Python binding of
QMCkl using SWIG. This required the creation of \emph{safe} interfaces,
in which when a pointer to an array is passed to a function, the size
of the array is passed as well. This was quite an important change in
the API, but it also allows one to check that the arrays given by the user
are large enough to avoid writing in illegal memory domains.
\subsubsection{Portability}
Portability is an important pillar of QMCkl. Therefore, it is
continuously tested on different architectures (Linux, MacOS) with
different compilers (GNU, Intel, Nvidia, IBM). Today, all possible
combinations compile successfully and pass the tests.
%\subsection{Documentation}
%% TODO
% Examples.org
% Installation instructions
% Tutorials
\subsection{Building the library}
\subsubsection{Autotools}
After having organized training in Autotools\cite{Autotools} and CMake\cite{cmake} in WP6,
we have completely rewritten the QMCkl build process and we are now using
Autotools. As a testing framework is integrated in Autotools, we have
removed the dependency on the external testing framework munit.\cite{munit}
The Autotools scripts are written such that out-of-source builds can
be made. This is important because it enables the possibility to have
multiple build directories configured differently: with or without the
activation of HPC-optimized functions, with different compilers, etc.
In addition, the out-of-source possibility is recommended for creating RPM
or Debian packages, so we have decided to implement this possibility
from the very beginning.
We have also implemented different building options, such as
activating or not the creation of the Python binding, or the
documentation, which can be quite time consuming to build. We also have
added a simple flag to use the external QMCkl-dgemm library developed in WP3 for
fast small-matrix multiplications.
\subsubsection{Packaging}
We have prepared the environment such that packaging will be
simple. As we are now using Autotools, the integration into the Spack
package manager\cite{spack} is trivial. We have created a package for Spack, but
we are waiting for the library to be ready to be released before we
push the Spack package upstream.
Similarly, we have prepared packages for Guix\cite{guix} and Python-pip that
will be pushed in the next release.
Many of our users are using CMake and not Autotools for their code. So
we also provide a \texttt{FindQMCKL} file to help CMake detect
the presence of QMCkl on the system. In addition, we generate files for
\texttt{pkg-config} such that linking a code with QMCkl should be as
simple as possible.
\subsection{Usage of QMCkl in external software}
We have started to use QMCkl in our software. The first use was for
the TREXIO-tools package.
TREXIO-tools is a repository of tools to interact with TREXIO files.
This tool is often used by developers who write an interface of their code with
the TREXIO library. As the \acp{MO} are expected to be orthonormal,
this tool checks that the wave function written in the TREXIO file is
consistent by numerically computing the overlap matrix of \acp{MO} and
comparing it with the identity matrix. If this test fails, it indicates to the
developer that the data are inconsistent and that there is an error in the
definition of the basis set.
In the early days, this computation was performed in Python. Now, if
the QMCKl Python interface is detected on the system, this computation
is performed with the QMCkl Python binding. Orders of magnitudes were
gained in terms of performance, and now the number of grid points can be made large enough to have an acceptable precision in the
numerical integrals.
This example demonstrates that QMCkl can be very useful beyond the \ac{QMC}
framework, for performing three-dimensional visualizations of molecular
orbitals of large systems, for example, or for the computation of the
\acp{AO} on a grid for \ac{DFT} calculations. In this context, we are
initiating a collaboration with the Qubit Pharmaceuticals company,
which is interested in using QMCkl in their software.
\subsubsection{Feedback from HPC experts}
There has been a strong interaction between WP1 and WP3 for the
development of high-performance versions of the library. In this
collaboration, some important feedback has been given for redesigning
the internal data-structures, to facilitate the development of
high-performance functions.
For example, the creation of QMCkl-dgemm library needed us to make abstract the
definition of vectors, matrices and tensors in QMCkl such that standard
arrays are used in the absence of QMCkl-dgemm, but more complex data
structures such as tiled arrays can be used in HPC-optimized variants.
Another important feedback was the importance of having a simple and clear
implementation of algorithms such as the one proposed in this \ac{WP}.
The GPU experts first started by porting to GPU the \ac{MO} kernels
using the CPU-optimized HPC versions developed in WP3. The optimized CPU code
is much more complicated because it takes advantage of pointer arithmetic to help
CPU compilers produce efficient code. GPU compilers are not yet mature, and simple
patterns are better understood to produce efficient GPU code. Facing difficulties
for the \ac{MO} kernels, the HPC experts went back to the simple kernels
proposed in this \ac{WP}, and the GPU porting was less complicated.
\subsubsection{Feedback from users}
We have started the integration of QMCkl in QMC=Chem, CHAMP and
TurboRVB. For each code, unexpected problems appeared that required some
adjustments in the \ac{API}. For example, TurboRVB expects the basis functions
to be decontracted and expressed with spherical harmonics.
When using QMCkl in CHAMP, we realized that only a subset of \acp{MO}
needed to be computed. QMC=Chem doesn't need double precision numbers for the
\acp{AO} and \acp{MO}. Now that we have received feedback from three different
codes, we believe that a large amount of the use cases have been explored and that the
next users will be able to integrate QMCkl into their code without too much
effort. It should be noted that significant speedups were observed when using
the HPC-optimized version of the library developed in WP3. This will be detailed
in a WP3 report.
\section{Conclusion}
Today, the TREX QMC codes are taking advantage of QMCkl and have
observed a significant speedup: up to $1.5\times$ for TurboRVB where the computation of AOs was introduced, and $3.5\times$ for CHAMP where both the AOs and MOs are now computed with QMCkl.
In addition, QMCkl can be used beyond the QMC framework,
for more conventional methods such as \ac{DFT}. There is still more work to be
done in the library, such as the implementation of additional kernels for
periodic systems, different forms of Jastrow factors, or an efficient computation of the
adjugate of a large sparse matrix.
From the point of view of the codes, some modifications still have to be done to better use
the library. Indeed, the library performs best when a large number of points is given at once, and right now the codes make one call of the library per walker.
Finally, additional benchmarks need to be written to test the numerical accuracy of each kernel using Verificarlo.
This will enable increasing the performance of the library by reducing the expected precision in a controlled way.