-
Notifications
You must be signed in to change notification settings - Fork 0
/
integration.tex
204 lines (185 loc) · 15.3 KB
/
integration.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
\section{Integration Tools}
\label{sec:integration}
\subsection{Darshan}
Darshan is divided into two main parts: 1) \emph{darshan-runtime} which contains the instrumentation of the characterization tool and produces a single log file at the end of each execution summarizing the I/O access patterns used by the application~\cite{darshan-doc}. 2) \emph{darshan-util} which is intended for analyzing log files produced by darshan-runtime~\cite{darshan-doc}. The \Darshan{} focuses on the \emph{darshan-runtime}~\cite{darshan-runtime} as this is where the source code of I/O event data is recorded by Darshan.
Darshan tracks the start, duration and end time of an application run via the C function \emph{clock\_gettime()} and converts the result into seconds and passes the result to a struct that is then used to report the summary log files~\cite{darshangithub}. Therefore, in order to retrieve the \emph{absolute timestamp} and include it into the I/O event data during run time, a time struct pointer was added to the function call that used \emph{clock\_gettime()} in \emph{darshan-runtime}. This pointer was passed through all of Darshan's modules and the \emph{absolute timestamp} was collected. This was the preferred method as it required minimal changes to Darshan's source code and no additional overhead or latency between the function call and recording of the \emph{absolute timestamp}.
\begin{figure}
\centering
%\includegraphics[trim=0.2cm 8.5cm 12cm 0cm ,clip,width=0.9\linewidth]{figs/fig_kokkos_ldms.png}
\includegraphics[trim=0 1cm 1 1.5cm ,clip,width=1.2\linewidth]{figs/ldms-overview.png}
\caption{Overview of the LDMS event data collection. Application data is \emph{pushed} by publishing data to the \emph{LDMS Streams} which is then \emph{pushed} through the LDMS transport to the \emph{LDMS Streams} LDMS aggregator (right) where the data is \emph{pushed} to the streams subscriber (Tag A) and stored to a database.}
\label{f:LDMS Overview}
\end{figure}
\subsection{LDMS Streams}
\emph{LDMSD} refers to an LDMS daemon that provides the capability of data
collection, transport, and/or storage. An \emph{LDMSD's} configuration,
including plugins, determines its functionality and capabilities~\cite{ldmsgithubwiki}.
Daemons on the compute nodes run sampler plugins and transport is achieved
through multi-hop \emph{aggregation}. In this work we utilized two levels of
LDMS aggregator daemons~\cite{ldmsgithubwiki} with the second level utilizing
the DSOS storage plugin to store the I/O event data.
%which can utilize storage plugins to store any sets of data into various
%formats so long as it's specified beforehand.
\begin{figure}
\centering
%\includegraphics[trim=0.2cm 8.5cm 12cm 0cm ,clip,width=0.9\linewidth]{figs/fig_kokkos_ldms.png}
\includegraphics[trim={3.5cm 0 0 0},clip,
width=1.15\linewidth]{figs/darshan-connector.png}
\caption{Overview of the \connector{} design and how it collects I/O data for each read, write, open and close events per rank from Darshan. The LDMS library must be linked against the Darshan build in order to utilize the \emph{LDMS Streams} functionality and store plugins.}
\label{f:Darshan Connector}
\end{figure}
The \Darshan{} leverages the LDMS transport to support the injection and transport
of application I/O data which requires a \emph{push-based} method to reduce the
amount of memory consumed and data loss on the node as well as reduce the latency
between the time in which the event occurs and when it is recorded. A \emph{pull-based}
method would require buffering to hold an unknown number of events between subsequent
pulls. Also, the transported data format needs to support \emph{variable-length}
events because the I/O event data will vary in size.
We created an I/O targeted LDMS streams store that utilizes the LDMS publish-subscribe
bus capability, \emph{LDMS Streams}, to publish I/O event data. \emph{LDMS Streams}
is intended for publishing and subscribing to an \emph{LDMS streams tag}.
%This leads to the LDMS \emph{publish-subscribe bus} capability, \emph{LDMS Streams}, which has been enhanced to support I/O event data. This capability is intended for publishing and subscribing to an \emph{LDMS streams tag}.
This tag needs to be specified in LDMS daemons and \emph{plugins} in order to publish event data to \emph{LDMS Streams} and receive this published \emph{LDMS Streams} data that match the tag. This process and the \emph{push-based} method can be seen in Figure~\ref{f:LDMS Overview}. Event data can be specified as either \texttt{string} or \texttt{JSON} format. The \emph{LDMS Streams} API was modified to support long application connections and message injections. \emph{LDMS Streams} uses best effort without a reconnect or resend for delivery and does not cache it's data so the published data can only be received after subscription. \emph{LDMS Streams} enables data
from any source to be injected into the LDMS transport.
\subsection{Darshan Connector}
The most recent version of Darshan allows for full tracing of application I/O workloads using their DXT instrumentation module which can be enabled and disabled as desired at runtime. DXT provides high-fidelity traces for an application's I/O workload vs Darshan's traditional I/O summary data and currently traces POSIX and MPI-IO layers~\cite{darshan-runtime}. Our design leverages the additional I/O tracing Darshan's DXT module plugin provides through the new \connector{} capability.
The \connector{} functionality collects both DXT data and Darshan's original I/O
data and optionally publishes a message in JSON format to the \emph{LDMS Streams}
interface as seen in Figure 3. The \emph{absolute timestamp} is also included in
this message with the given name "timestamp". LDMS then transports the I/O event
data across the aggregators and stores it to a \emph{DSOS} database where Grafana
can access and query it. the \connector{} currently uses a single unique
\emph{LDMS Stream tag} for this data source. For the file level access types
that DXT does not trace or that have different name-value pairs, a value of "N/A"
or "-1" is given in the JSON message. For example, Darshan's POSIX module (IEE
standard that establishes a set of guidelines for compatibility and portability
between operating systems) traces the number of bytes read and written per
operation (e.g. \texttt{"max\_byte"}) and number of times access alternated between
reads and writes (e.g. \texttt{"switches"}). The I/O operations shown in
Figure~\ref{f:CSV Header and Output} are \emph{"open"} and \emph{"close"} which
are not applicable to these traces. Therefore, \texttt{"max\_byte"}
and \texttt{"switches"} are not present in Darshan and their corresponding JSON
message names are set to \texttt{"-1"} by the \connector{}.
Darshan has a large number of metrics it uses for I/O tracing and
post-processing calculations. Currently our framework
collects a subset of these metrics to publish to \emph{LDMS Streams},
as presented in Figure \ref{f:CSV Header and Output}. These metrics
will provide the ability to create new I/O behavior analyses and
visualizations to get further insights of the application I/O behavior,
and reveal correlations between I/O performance variability and system behavior.
\begin{figure}
\centering
%\includegraphics[trim=0.2cm 8.5cm 12cm 0cm ,clip,width=0.9\linewidth]{figs/fig_kokkos_ldms.png}
\includegraphics[trim={0 3cm 0 0},clip, width=1.01\linewidth]{figs/darshan-csv-json.png}
\caption{Output of a MPI-IO Darshan test run in the JSON
format (top image), and the CSV file header (bottom).
%The JSON message is published to the \emph{LDMS Streams}
%interface where it is then converted to CSV and stored to
%\emph{DSOS}.
The \texttt{name:value} pairs in light blue
indicate meta data stored, while the light purple indicates
the file level access data not applicable to POSIX.
%and are given the values of \texttt{"N/A"} or \texttt{"-1"}.
The \texttt{"seg"} is a list containing multiple \texttt{name:value} pairs.}
\label{f:CSV Header and Output}
\end{figure}
Table \ref{table:metrics} depicts the names and definitions of each
metric in the JSON file. Depending on the \texttt{"type"} input, the
absolute directory of the Darshan file output and executable will be
recorded and published to \emph{LDMS Streams}. If \texttt{"type"} is
set to \texttt{"MET"} (e.g. "meta"), the absolute directories will be
recorded. Otherwise, it will receive the value "N/A" if set to
\texttt{"MOD"}(e.g. "module"). The \texttt{"type"} will be set to
"MET" for open I/O events, which are the Darshan I/O records that have
permanent values during the application execution, such as the rank,
file and node name. The \texttt{"type"} is set to "MOD" for all other
I/O events to reduce the message size and latency when sending the
data through an HPC production system pipeline.
\begin{table*}[h]
\centering
\resizebox{\textwidth}{!}{
\begin{tabular}{|l|l|}
\hline
\texttt{uuid} & User ID of the job run\\ \hline
\texttt{exe} & Absolute directory of the application executable\\ \hline
\texttt{module} & Name of the Darshan module data being collected\\ \hline
\texttt{ProducerName} & Name of the compute node the application is running on\\ \hline
\texttt{switches} & Number of times access alternated between read and write\\ \hline
\texttt{file} & Absolute directory of the filename where the operations are performed\\ \hline
\texttt{rank} & Rank of the processes at I/O\\ \hline
\texttt{flushes} & Number of "flush" operations. It is the HDF5 file
flush operations for H5F, and the dataset flush
operations for H5\\ \hline
\texttt{record\_id} & Darshan file record ID of the file the dataset belongs to\\ \hline
\texttt{max\_byte} & Highest offset byte read and written per operation\\ \hline
\texttt{type} & The type of JSON data being published: \texttt{MOD} for gathering module data or \texttt{MET} for gathering static meta data\\ \hline
\texttt{job\_id} & The Job ID of the application run\\ \hline
\texttt{op} & Type of operation being performed (i.e. read, write, open, close)\\ \hline
\texttt{cnt} & The count of the operations performed per module per rank. Resets to 0 after each "close" operation\\ \hline
\texttt{seg} & A list containing metrics names per operation per rank\\ \hline
\texttt{seg:pt\_sel} & HDF5 number of different access selections\\ \hline
\texttt{seg:dur} & Duration of each operation performed for
the given rank (i.e. a rank takes "X" time to perform a r/w/o/c operation)\\ \hline
\texttt{seg:len} & Number of bytes read/written per operation per rank\\ \hline
\texttt{seg:ndims} & HDF5 number of dimensions in dataset's dataspace\\ \hline
\texttt{seg:reg\_hslab} & HDF5 number of regular hyperslabs\\ \hline
\texttt{seg:irreg\_hslab} & HDF5 number of irregular hyperslabs\\ \hline
\texttt{seg:data\_set} & HDF5 dataset name\\ \hline
\texttt{seg:npoints} & HDF5 number of points in dataset's dataspace\\ \hline
\texttt{seg:timestamp} & End time of given operation per rank (in epoch time)\\ \hline
\end{tabular}}
\caption{Metrics defined in the JSON file published to the \emph{Darshan LDMS Integration}.}
\label{table:metrics}
\end{table*}
%\section{Additional Components}
%This section will provide a high level overview on how and why the \Darshan approach implemented the DSOS and Grafana tools for storage, analysis and visualization.
\subsection{Storage: DSOS Database}
DSOS is built on the Scalable Object Store (SOS) database~\cite{sosgithub} and was
intended to address the domain-specific needs of large-scale HPC monitoring. It was
chosen as the preferred monitoring database at Sandia because it allows for interaction
via a command line interface which allows for fast query testing and data examination.
DSOS also provides scalable data ingest and the ability to query large volumes of data
which is required for the large amounts of data to be ingested and stored.
%However, a different storage that had similar capabilities could be used instead.
To sort though the published \emph{LDMS Streams} data, combinations of the job ID, rank
and timestamp are used to create joint indices where each index provided a different
query performance. An example of this is using \texttt{job\_rank\_time} which will
order the data by job, rank, and timestamp. This index can provide better performance
when searching for a specific job within the database.
\subsection{Analysis and Visualization: HPC Web Services}
The HPC Web Services~\cite{ClusterAV} is an infrastructure that provides the
analysis and visualization components of our approach.
%Any data queries start from a front-end application and transferred to a back-end application that are running on an HPC cluster. In this case
The front-end for the service website is Grafana~\cite{grafana-website} and the
back-end consists of Python analysis modules. The HPC Web Services also provide
instant analysis where data can be analyzed and viewed in real time as opposed
to the traditional method of querying the results of analyzed data from a separate database.
\begin{figure}
\centering
%\includegraphics[trim=0.2cm 8.5cm 12cm 0cm ,clip,width=0.9\linewidth]{figs/fig_kokkos_ldms.png}
\includegraphics[trim=3cm 2cm 0cm 0cm, clip,width=1.2\linewidth]{figs/darshan-integration.png}
\caption{Overview of the \Darshan{} where the \connector{} is used to intercept the I/O behavior Darshan is collecting utilizes the various tools to publish, store, analyze and view runtime I/O behavior.}
\label{f:FrameworkOverview}
\end{figure}
Grafana is an open-source visualization application that provides various charts,
graphs and alerts for supported data sources. It can support multiple data formats
but is best suited for timeseries data. It has storage plugins for many database
technologies in order to query and render data from multiple data sources. The
\Darshan{} implemented a storage plugin for the DSOS database in order to query this
data and visualize it on the Grafana web interface~\cite{grafana-website}
using the HPC Web Services infrastructure. An overview of the this integration
can be seen in Figure~\ref{f:FrameworkOverview}.
Python analysis modules are used to produce meaningful visualizations on the queried
data from the DSOS database. With these modules, queried data is converted into a
pandas dataframe to allow for easier application of complex calculations, transformations
and aggregations on the data. The type of analysis module is specified in the Grafana
web interface. These python modules applied to our \Darshan{} data will demonstrate
how runtime I/O data can provide further insights and understanding into application
I/O behavior, patterns, performance variability, and any correlations these may have with
behaviors of system components.
%\RED{
%\begin{itemize}
%\item Explain how LDMS is integrated into Darshan. Give an overview on how we implement LDMS Streams into the Darshan DXT section to collect run time I/O data and push to a JSON file that then gets aggregated by LDMS and stored to SOS. Then explain how this stored data is then queried and displayed in a Grafana dashboard.
%\item Add a pic of current Darshan LDMS integration setup (.png)
%\end{itemize}
%e\IEEEauthorblockN{3\textsuperscript{rd} Ana Luisa V. Solórzano}