Autotuning under Tight Budget Constraints: @@latex: \@@ A Transparent Design of Experiments Approach
A large amount of resources is spent writing, porting, and optimizing scientific
and industrial High Performance Computing applications, which makes autotuning
techniques fundamental to lower the cost of leveraging the improvements on
execution time and power consumption provided by the latest software and
hardware platforms. Despite the need for economy, most autotuning techniques
still require large budgets of costly experimental measurements to provide good
results, while rarely providing exploitable knowledge after optimization. The
contribution of this paper is a user-transparent autotuning technique based on
Design of Experiments that operates under tight budget constraints by
significantly reducing the measurements needed to find good optimizations. Our
approach enables users to make informed decisions on which optimizations to
pursue and when to stop. We present an experimental evaluation of our approach
and show it is capable of leveraging user decisions to find the best global
configuration of a GPU Laplacian kernel using half of the measurement budget
used by other common autotuning techniques. We show that our approach is also
capable of finding speedups of up to \boldmath$50×$, compared to gcc
’s -O3
, for
some kernels from the SPAPT benchmark suite, using up to \boldmath$10×$ fewer
measurements than random sampling.
Optimizing code for objectives such as performance and power consumption is fundamental to the success and cost-effectiveness of industrial and scientific endeavors in High Performance Computing (HPC). A considerable amount of highly specialized time and effort is spent in porting and optimizing code for GPUs, FPGAs and other hardware accelerators. Experts are also needed to leverage bleeding edge software improvements in compilers, languages, libraries and frameworks. The objective of techniques for the automatic configuration and optimization of HPC applications, or autotuning, is to decrease the cost and time needed to adopt efficient hardware and software. Typical autotuning targets include algorithm selection, source-to-source transformations and compiler configuration.
Autotuning can be studied as a search problem where the objective is to minimize software or hardware metrics. The exploration of the search spaces defined by code and compiler configurations and optimizations presents interesting challenges. Such spaces grow exponentially with the number of parameters, and are also difficult to explore extensively due to the often prohibitive costs of hardware utilization, program compilation and execution times. Developing autotuning strategies capable of producing good optimizations while minimizing resource utilization is therefore essential. The capability of acquiring knowledge about an optimization problem is also crucial in an autotuning strategy, since this knowledge can decrease the cost of subsequent optimizations of the same application or for the same hardware.
It is common and usually effective to use search meta-heuristics such as genetic algorithms and simulated annealing in autotuning. These strategies attempt to exploit local search space properties, but are generally incapable of exploiting global structures. Seymour et al. \cite{seymour2008comparison}, Knijnenburg et al. \cite{knijnenburg2003combined}, and Balaprakash et al. \cite{balaprakash2011can,balaprakash2012experimental} report that these strategies are not more effective than a naive uniform random sample of the search space, and usually rely on a large number of measurements or restarts to achieve performance improvements. Search strategies based on gradient descent are also commonly used in autotuning, and also rely on a large number of measurements, but their effectiveness diminishes significantly in search spaces with complex local structures. Automated machine learning autotuning strategies \cite{beckingsale2017apollo,falch2017machine,balaprakash2016automomml} are promising for building models for predicting important parameters, but still rely on a sizable data set for training.
In summary, search strategies based on meta-heuristics, gradient descent and machine learning require a large number of measurements to be effective, and are usually incapable of providing knowledge about search spaces to users. Since these strategies are not transparent, at the end of each autotuning session it is difficult to decide if and where further exploration is warranted, and often impossible to know which parameters are responsible for the observed improvements. After exploring a search space, deducing any of the space’s global properties confidently is impossible, since the space was automatically explored with unknown biases.
The contribution of this paper is an autotuning strategy that leverages existing knowledge about a problem by using an initial performance model that is refined iteratively using performance measurements, statistical analysis, and user input. Our strategy places a heavy weight on decreasing autotuning costs by using a Design of Experiments (DoE) methodology to minimize the number of experiments needed to find optimizations. Each iteration uses Analysis of Variance (ANOVA) tests and linear model regressions to identify promising subspaces and parameter significance. An architecture- and problem-specific performance model is built iteratively and with user input, which enables making informed decisions about which regions of the search space are worth exploring.
We evaluate the performance of our approach by optimizing a Laplacian Kernel for
GPUs, where the search space, the global optimum, and a performance model
approximation are known. The budget of measurements was tightly constrained on
this experiment. Speedups and budget utilization reduction achieved by our
approach on this setting motivated a more comprehensive performance evaluation.
We chose the Search Problems in Automatic Performance Tuning
(SPAPT) \cite{balaprakash2012spapt} benchmark suite for this evaluation, where
we obtained diverse results. Out of the 17 SPAPT kernels benchmarked, no speedup
could be found for three kernels, but uniform random sampling performed well on
all others. For eight of the kernels, our approach found speedups of up to
gcc
’s -O3
with no code transformations, while using up to
The rest of this paper is organized as follows. Section Background presents related work on source-to-source transformation, which is the main optimization target in SPAPT kernels, on autotuning systems and on search space exploration strategies. Section Autotuning with Design of Experiments discusses the implementation of our approach in detail. Section Design of Experiments discusses the DoE, ANOVA, and linear regression methodology we used to develop our approach. Section Performance Evaluation presents the results on the performance evaluation on the GPU Laplacian Kernel and on the SPAPT benchmark suite. Section Conclusion discusses our conclusions and future work.
This section presents the background and related work on source-to-source transformation, autotuning systems and search space exploration strategies.
Our approach can be applied to any autotuning domain that expresses optimization as a search problem, although the performance evaluations we present in Section Performance Evaluation were obtained in the domain of source-to-source transformation. Several frameworks, compilers and autotuners provide tools to generate and optimize architecture-specific code \cite{hartono2009annotation,videau2017boast,tiwari2009scalable,yi2007poet,ansel2009petabricks}. We used BOAST \cite{videau2017boast} and Orio \cite{hartono2009annotation} to perform source-to-source transformations targeting parallelization on CPUs and GPUs, vectorization, loop transformations such as tiling and unrolling, and data structure size and copying.
John Rice’s Algorithm Selection framework \cite{rice1976algorithm} is the precursor of autotuners in various problem domains. In 1997, the PHiPAC system \cite{bilmes1997optimizing} used code generators and search scripts to automatically generate high performance code for matrix multiplication. Since then, systems approached different domains with a variety of strategies. Dongarra et al. \cite{dongarra1998automatically} introduced the ATLAS project, that optimizes dense matrix multiplication routines. The OSKI \cite{vuduc2005oski} library provides automatically tuned kernels for sparse matrices. The FFTW \cite{frigo1998fftw} library provides tuned C subroutines for computing the Discrete Fourier Transform. Periscope \cite{gerndt2010automatic} is a distributed online autotuner for parallel systems and single-node performance. In an effort to provide a common representation of multiple parallel programming models, the INSIEME compiler project \cite{jordan2012multi} implements abstractions for OpenMP, MPI and OpenCL, and generates optimized parallel code for heterogeneous multi-core architectures.
A different approach is to combine generic search algorithms and problem representation data structures in a single system that enables the implementation of autotuners for different domains. The PetaBricks \cite{ansel2009petabricks} project provides a language, compiler and autotuner, enabling the definition and selection of multiple algorithms for the same problem. The ParamILS framework \cite{hutter2009paramils} applies stochastic local search algorithms to algorithm configuration and parameter tuning. The OpenTuner framework \cite{ansel2014opentuner} provides ensembles of techniques that search the same space in parallel, while exploration is managed by a multi-armed bandit strategy.
Figure fig:sampling_comparison shows the contour of a search space defined by a
function of the form
Measurement of the kernels optimized on the performance evaluations in Section Performance Evaluation can exceed 20 minutes, including the time of code transformation, compilation, and execution. Measurements in other problem domains can take much longer to complete. This strengthens the motivation to consider search space exploration strategies capable of operating under tight budget constraints. These strategies have been developed and improved by statisticians for a long time, and can be grouped under the DoE term.
The D-Optimal sampling strategies shown on the two rightmost bottom panels of Figure fig:sampling_comparison are based on the DoE methodology, and leverage previous knowledge about search spaces for an efficient exploration. These strategies provide transparent analyses that enable focusing on interesting subspaces. In the next section we describe our approach to autotuning based on the DoE methodology.
An experimental design determines a selection of experiments whose objective is to identify the relationships between factors and responses. While factors and responses can refer to different concrete entities in other domains, in computer experiments factors can be configuration parameters for algorithms and compilers, and responses can be the execution time or memory consumption of a program. Each possible value of a factor is called a level. The effect of a factor on the measured response, without the factor’s interactions with other factors, is the main effect of that factor. Experimental designs can be constructed with different goals, such as identifying the main effects or building an analytical model for the response.
In this section we discuss in detail our iterative DoE approach to autotuning. Figure fig:doe_anova_strategy presents an overview of our approach, with numbered steps. In step 1 we define the factors and levels that compose the search space of the target problem, in step 2 we select an initial performance model, and in step 3 we generate an experimental design. We run the experiments in step 4 and then, as we discuss in the next section, we identify significant factors with an ANOVA test in step 5. This enables selecting and fitting a new performance model in steps 6 and 7. The new model is used in step 8 for predicting levels for each significant factor. We then go back to step 3, generating a new design for the new problem subspace with the remaining factors. Informed decisions made by the user at each step guide the outcome of each iteration.
Step 1 of our approach is to define target factors and which of their levels are worth exploring. Then, the user must select an initial performance model in step 2. Compilers typically expose many 2-level factors in the form of configuration flags, and the performance model for a single flag can only be a linear term, since there are only 2 values to measure. Interactions between flags and numerical factors such as block sizes in CUDA programs or loop unrolling amounts are also common. Deciding which levels to include for these kinds of factors requires more careful analysis. For example, if we suspect the performance model has a quadratic term for a certain factor, the design should include at least three factor levels. The ordering between the levels of other compiler parameters, such as \texttt{-O(0,1,2,3)}, is not obviously translated to a number. Factors like these are named categorical, and must be treated differently when constructing designs in step 3 and analyzing results in step 5.
We decided to use D-Optimal designs because their construction techniques enable mixing categorical and numerical factors in the same screening design, while biasing sampling according to the performance model. This enables the autotuner to exploit global search space structures if we use the right model. When constructing a D-Optimal design in step 3 the user can require that specific points in the search space are included, or that others are not. Algorithms for constructing D-Optimal designs are capable of adapting to these requirements by optimizing a starting design. Before settling on D-Optimal designs, we explored other design construction techniques such as the Plackett-Burman \cite{plackett1946design} screening designs shown in the next section, the contractive replacement technique of Addelman-Kempthorne \cite{addelman1961some} and the direct generation algorithm by Grömping and Fontana \cite{ulrike2018algorithm}. These techniques have strong requirements on design size and level mixing, so we opted for a more flexible technique that would enable exploring a more comprehensive class of autotuning problems.
After the design is constructed in step 3, we run each selected experiment in step 4. This step can run in parallel since experiments are independent. Not all target programs run successfully in their entire input range, making runtime failures common in this step. The user can decide whether to construct a new design using the successfully completed experiments or to continue to the analysis step if enough experiments succeed.
After running the ANOVA test in step 5, the user should apply domain knowledge to analyze the ANOVA table and determine which factors are significant. Certain factors might not appear significant and should not be included in the regression model. Selecting the model after the ANOVA test in step 6 also benefits from domain knowledge.
A central assumption of ANOVA is the homoscedasticity of the response, which can
be interpreted as requiring the observed error on measurements to be independent
of factor levels and of the number of measurements. Fortunately, there are
statistical tests and corrections for lack of homoscedasticity. Our approach
uses the homoscedasticity check and correction by power transformations from the
car
package \cite{fox2011car} of the R
language.
We fit the selected model to our design’s data in step 7, and use the fitted model in step 8 to find levels that minimize the response. The choice of the method used to find these levels depends on factor types and on the complexity of the model and search space. If factors have discrete levels, neighborhood exploration might be needed to find levels that minimize the response around predicted levels. Constraints might put predicted levels on an undefined or invalid region on the search space. This presents challenge, because the borders of valid regions would have to be explored.
In step 8 we also fix factor levels to those predicted to achieve best performance. The user can also decide the level of trust placed on the prediction at this step, by keeping other levels available. In step 8 we perform a reduction of problem dimension by eliminating factors and decreasing the size of the search space. If we identified significant parameters correctly, we will have restricted further search to better regions. In the next section we present a simple fictional application our approach that illustrates the fundamentals of the DoE methodology, screening designs and D-Optimal designs.
In this section we first present the assumptions of a traditional DoE methodology using an example of 2-level screening designs, an efficient way to identify main effects. We then discuss techniques for the construction of efficient designs for factors with arbitrary numbers and types of levels, and present D-Optimal designs, the technique used in this paper.
Screening designs identify parsimoniously the main effects of 2-level factors in
the initial stages of studying a problem. While interactions are not considered
at this stage, identifying main effects early enables focusing on a smaller set
of factors on subsequent experiments. A specially efficient design construction
technique for screening designs was presented by Plackett and
Burman \cite{plackett1946design} in 1946, and is available in the FrF2
package \cite{gromping2014frf2} of the R
language \cite{team2018rlanguage}.
Despite having strong restrictions on the number of factors supported,
Plackett-Burman designs enable the identification of main effects of
Assuming a linear relationship between factors and the response is fundamental for running ANOVA tests using a Plackett-Burman design. Consider the linear relationship
where
The next example illustrates the screening methodology. Suppose we wish to
minimize a performance metric
Suppose that, for the purpose of this example, the computation is done by a very
expensive black-box procedure. Note that factors
To study this problem efficiently we decided to construct a Plackett-Burman
design, which minimizes the experiments needed to identify significant factors.
The analysis of this design will enable decreasing the dimension of the problem.
Table \ref{tab:plackett} presents the Plackett-Burman design we generated. It
contains high and low values, chosen to be
library(FrF2)
library(xtable)
set.seed(3138989)
get_cost <- function(data) {
return((-1.5 * as.numeric(data$x1)) + (1.3 * as.numeric(data$x3)) +
(1.6 * as.numeric(data$x1) * as.numeric(data$x3)) +
(1.35 * as.numeric(data$x8) * as.numeric(data$x8)) +
(3.1 * as.numeric(data$x5)) + (-1.4 * as.numeric(data$x7)) +
rnorm(nrow(data), sd = 0.6))
}
objective_data <- expand.grid(seq(-1, 1, 0.2),
seq(-1, 1, 0.2),
seq(-1, 1, 0.2),
seq(-1, 1, 0.2),
seq(-1, 1, 0.2))
names(objective_data) <- c("x1", "x3", "x5",
"x7", "x8")
objective_data$Y <- get_cost(objective_data)
options(warn = -1)
design <- pb(12, factor.names = c("x1", "x2", "x3",
"x4", "x5", "x6",
"x7", "x8", "d1",
"d2", "d3"))
options(warn = 0)
design$Y <- get_cost(design)
names(design) <- c("$x_1$", "$x_2$", "$x_3$",
"$x_4$", "$x_5$", "$x_6$",
"$x_7$", "$x_8$", "$d_1$",
"$d_2$", "$d_3$", "$Y$")
cap <- "Randomized Plackett-Burman design for factors $x_1, \\dots, x_8$, using 12 experiments and ``dummy'' factors $d_1, \\dots, d_3$, and computed response $\\mathbf{Y}$"
tab <- xtable(design, caption = cap, label = "tab:plackett")
align(tab) <- "ccccccccccccc"
print(tab, booktabs = TRUE,
include.rownames = FALSE,
caption.placement = "top",
size = "\\scriptsize",
table.placement="b",
sanitize.text.function = function(x){x})
So far, we have performed steps 1, 2, and 3 from
Figure fig:doe_anova_strategy. We use our initial assumption in
Equation \eqref{eq:linear_assumption} to identify the most significant factors by
performing an ANOVA test, which is step 5 from
Figure fig:doe_anova_strategy. The results are shown in
Table \ref{tab:anova_linear}, where the significance of each factor is
interpreted from the F-test and
We see on Table \ref{tab:anova_linear} that factors
library(extrafont)
names(design) <- c("x1", "x2", "x3",
"x4", "x5", "x6",
"x7", "x8", "d1",
"d2", "d3", "Y")
regression <- lm(Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = design)
par(family = 'serif')
MEPlot(regression, main = NULL, pch = 19,
lwd = 0, cex.xax = 2.9, cex.main = 3.1,
cex.axis = 1)
library(xtable)
options(warn = -1)
names(design) <- c("x1", "x2", "x3",
"x4", "x5", "x6",
"x7", "x8", "d1",
"d2", "d3", "Y")
regression <- aov(Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = design)
s_regression <- as.data.frame(summary.aov(regression)[[1]])
s_regression <- s_regression[1:8, c("F value", "Pr(>F)")]
s_regression$stars <- symnum(s_regression[ , "Pr(>F)"], na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("$***$", "$**$", "$*$", "$\\cdot$", " "))
names(s_regression) <- c("F value", "$\\text{Pr}(>\\text{F})$", "Signif.")
rownames(s_regression) <- c("$x_1$", "$x_2$", "$x_3$",
"$x_4$", "$x_5$", "$x_6$",
"$x_7$", "$x_8$")
cap <- "Shortened ANOVA table for the fit of the naive model, with significance intervals from the \\texttt{R} language"
x <- xtable(s_regression, caption = cap, digits = 3, display = c("s", "f", "f", "s"), label = "tab:anova_linear")
align(x) <- xalign(x)
options(warn = 0)
print(x, size = "\\small",
math.style.exponents = TRUE,
booktabs = TRUE,
sanitize.text.function = function(x){x},
table.placement="t",
caption.placement = "top")
Moving forward to steps 6, 7, and 8 in Figure fig:doe_anova_strategy, we will
build a linear model using factors
We now proceed to the prediction step, where we wish to identify the levels of
factors
Table \ref{tab:linear_prediction_comparison} compares the prediction for
library(xtable)
library(dplyr)
names(design) <- c("x1", "x2", "x3",
"x4", "x5", "x6",
"x7", "x8", "d1",
"d2", "d3", "Y")
design <- lapply(design, function(x){return(as.numeric(as.character(x)))})
regression <- lm(Y ~ x1 + x3 + x5 + x7 + x8, data = design)
prediction <- predict(regression, newdata = objective_data)
comparison_data <- objective_data[prediction == min(prediction), ]
comparison_data <- bind_rows(comparison_data, objective_data[objective_data$Y == min(objective_data$Y), ])
rownames(comparison_data) <- c(" Lin.", "Min.")
names(comparison_data) <- c("$x_1$", "$x_3$", "$x_5$",
"$x_7$", "$x_8$", "$Y$")
comparison_data[ , "$x_2$"] <- c("--", -0.2)
comparison_data[ , "$x_4$"] <- c("--", 0.6)
comparison_data[ , "$x_6$"] <- c("--", 0.4)
comparison_data <- comparison_data[ , c("$x_1$", "$x_2$", "$x_3$",
"$x_4$", "$x_5$", "$x_6$",
"$x_7$", "$x_8$", "$Y$")]
names(comparison_data) <- c("$\\bm{x_1}$", "$x_2$", "$\\bm{x_3}$",
"$x_4$", "$\\bm{x_5}$", "$x_6$",
"$\\bm{x_7$}", "$\\bm{x_8}$", "$Y$")
cap <- "Comparison of the response $Y$ predicted by the linear model and the true global minimum. Factors used in the model are bolded"
x <- xtable(comparison_data, caption = cap, digits = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 3), label = "tab:linear_prediction_comparison")
align(x) <- xalign(x)
options(warn = 0)
print(x,
size = "\\footnotesize",
math.style.exponents = TRUE,
booktabs = TRUE,
include.rownames = TRUE,
sanitize.text.function = function(x){x},
tabular.environment = "tabularx",
width = "\\columnwidth",
table.placement="b",
caption.placement = "top")
Using 12 measurements and a simple linear model, the predicted best value of
We can improve upon this result if we introduce some information about the problem and use a more flexible design construction technique. Next, we will discuss the construction of efficient designs using problem-specific formulas and continue the optimization of our example.
The application of DoE to autotuning problems requires design construction techniques that support factors of arbitrary types and number of levels. Autotuning problems typically combine factors such as binary flags, integer and floating point numerical values, and unordered enumerations of abstract values. Because Plackett-Burman designs only support 2-level factors, we had to restrict factor levels to interval extremities in our example. We have seen that this restriction makes it difficult to measure the significance of quadratic terms. We will now show how to optimize our example further by using D-Optimal designs, which increase the number of levels we can efficiently screen for and enables detecting the significance of more complex terms.
To construct a D-Optimal design it is necessary to choose an initial model,
which can be done based on previous experiments or on expert knowledge of the
problem. Once a model is selected, algorithmic construction is performed by
searching for the set of experiments that minimizes D-Optimality, a measure of
the variance of the estimators for the regression coefficients associated with
the selected model. This search is usually done by swapping experiments from the
current candidate design with experiments from a pool of possible experiments,
according to certain rules, until some stopping criterion is met. In the
example in this section and in our approach we use Fedorov’s
algorithm \cite{fedorov1972theory} for constructing D-Optimal designs,
implemented in R
in the AlgDesign
package \cite{wheeler2014algdesign}.
In our example, suppose that in addition to using our previous screening results
we decide to hire an expert in our problem’s domain. The expert confirms our
initial assumptions that the factor
Using our previous screening and the domain knowledge provided by our expert, we
choose a new performance model and use it to construct a D-Optimal design using
Fedorov’s algorithm. Since we need enough degrees of freedom to fit our model,
we construct the design with 12 experiments shown in Table \ref{tab:d_optimal}.
Note that the design includes levels
library(xtable)
library(dplyr)
library(AlgDesign)
output <- optFederov(~ x1 + x3 + x5 + x7 + x8 + I(x8 ^ 2) + x1:x3,
nTrials = 12,
data = objective_data)
dopt_design <- output$design
dopt_regression <- lm(Y ~ x1 + x3 + x5 + x7 + x8 + I(x8 ^ 2) + x1:x3, data = dopt_design)
dopt_prediction <- predict(dopt_regression, newdata = objective_data)
dopt_data <- objective_data[dopt_prediction == min(dopt_prediction), ]
names(dopt_data) <- c("$x_1$", "$x_3$", "$x_5$",
"$x_7$", "$x_8$", "$Y$")
names(dopt_design) <- c("$x_1$", "$x_3$", "$x_5$",
"$x_7$", "$x_8$", "$Y$")
cap <- "D-Optimal design constructed for the factors $\\left\\{x_1,x_3,x_5,x_7,x_8\\right\\}$ and computed response $Y$"
x <- xtable(dopt_design, caption = cap, digits = c(1, 1, 1, 1, 1, 1, 3), label = "tab:d_optimal")
align(x) <- xalign(x)
options(warn = 0)
print(x,
size = "\\footnotesize",
math.style.exponents = TRUE,
booktabs = TRUE,
include.rownames = FALSE,
sanitize.text.function = function(x){x},
table.placement="t",
caption.placement = "top")
s_regression <- as.data.frame(summary.aov(dopt_regression)[[1]])
s_regression <- s_regression[1:7, c("F value", "Pr(>F)")]
rownames(s_regression) <- c("$x_1$", "$x_3$", "$x_5$",
"$x_7$", "$x_8$", "I($x_8^2$)",
"$x_1$:$x_3$")
s_regression$stars <- symnum(s_regression[ , "Pr(>F)"], na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("$***$", "$**$", "$*$", "$\\cdot$", " "))
names(s_regression) <- c("F value", "Pr$(<\\text{F})$", "Signif.")
cap <- paste("Shortened ANOVA table for the fit of the naive model (", attributes(s_regression[ , "Signif."])$legend, ")", sep = "")
x <- xtable(s_regression, caption = cap, digits = 3, display = c("s", "f", "f", "s"))
align(x) <- xalign(x)
options(warn = 0)
print(x, size = "\\small",
math.style.exponents = TRUE,
booktabs = TRUE,
sanitize.text.function = function(x){x},
table.placement="ht",
caption.placement = "top")
We now fit this model using the results of the experiments in our design. Table \ref{tab:correct_fit} shows the model fit table and compares the estimated and real model coefficients. This example illustrates that the DoE approach can achieve close model estimations using few resources, provided the approach is able to use user input to identify significant factors, and knowledge about the problem domain to tweak the model.
s_regression <- as.data.frame(coef(summary.lm(dopt_regression)))
s_regression <- s_regression[, c("Estimate", "t value", "Pr(>|t|)")]
rownames(s_regression) <- c("Intercept", "$x_1$", "$x_3$", "$x_5$",
"$x_7$", "$x_8$", "$x_8^2$","$x_1x_3$")
s_regression$Significance <- symnum(s_regression[ , "Pr(>|t|)"], na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
names(s_regression) <- c("Estimated", "t value", "Pr$(>|\\text{t}|)$", "Signif.")
s_regression$Real <- c(0, -1.5, 1.3, 3.1, -1.4, 0, 1.35, 1.6)
s_regression <- s_regression[ , c("Real", "Estimated", "t value", "Pr$(>|\\text{t}|)$", "Signif.")]
cap <- "Correct model fit comparing real and estimated coefficients, with significance intervals from the \\texttt{R} language"
x <- xtable(s_regression, caption = cap, digits = 3, display = c("s", "f", "f", "f", "f", "s"), label = "tab:correct_fit")
align(x) <- xalign(x)
options(warn = 0)
print(x, size = "\\small",
math.style.exponents = TRUE,
booktabs = TRUE,
sanitize.text.function = function(x){x},
table.placement="b",
caption.placement = "top")
Table \ref{tab:prediction_comparisons} compares the global minimum of this
example with the predictions made by our initial linear model from the screening
step, and our improved model. Using screening, D-Optimal designs, and domain
knowledge, we found an optimization within
dopt_data[ , "$x_2$"] <- c("--")
dopt_data[ , "$x_4$"] <- c("--")
dopt_data[ , "$x_6$"] <- c("--")
tab_dopt_data <- dopt_data[ , c("$x_1$", "$x_2$", "$x_3$",
"$x_4$", "$x_5$", "$x_6$",
"$x_7$", "$x_8$", "$Y$")]
names(tab_dopt_data) <- c("$\\bm{x_1}$", "$x_2$", "$\\bm{x_3}$",
"$x_4$", "$\\bm{x_5}$", "$x_6$",
"$\\bm{x_7$}", "$\\bm{x_8}$", "$Y$")
dopt_comparison_data <- bind_rows(tab_dopt_data, comparison_data)
rownames(dopt_comparison_data) <- c(" Quad.", "Lin.", "Min.")
cap <- "Comparison of the response $Y$ predicted by our models and the true global minimum. Factors used in the models are bolded"
x <- xtable(dopt_comparison_data, caption = cap, digits = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 3), label = "tab:prediction_comparisons")
align(x) <- xalign(x)
options(warn = 0)
print(x,
size = "\\footnotesize",
math.style.exponents = TRUE,
booktabs = TRUE,
include.rownames = TRUE,
sanitize.text.function = function(x){x},
table.placement="ht",
caption.placement = "top")
We are able to explain the performance improvements we obtained in each step of
the process, because we finish steps with a performance model and a performance
prediction. Each factor is included or removed using information obtained in
statistical tests, or expert knowledge. If we need to optimize this problem
again, for a different architecture or with larger input, we could start
exploring the search space with a less naive model. We could also continue the
optimization of this problem by exploring levels of factors
The process of screening for factor significance using ANOVA and fitting a new model using acquired knowledge is equivalent to steps 5, 6, and 7 in Figure fig:doe_anova_strategy. In the next section we evaluate the performance of our DoE approach in two scenarios.
In this section we present performance evaluations of our approach in two scenarios that differ on search space size and complexity.
We first evaluated the performance of our approach in a Laplacian Kernel
implemented using BOAST \cite{videau2017boast} and targeting the Nvidia K40c
GPU. The objective was to minimize the time to compute each pixel by finding the
best level combination for the factors listed in Table
tab:gpu_laplacian_factors. Considering only factors and levels, the size of the
search space is
We applied domain knowledge to construct the following initial performance model:
This performance model was used by the Iterative Linear Model (LM) algorithm and by our D-Optimal Design approach (DLMT). LM is almost identical to our approach, described Section Autotuning with Design of Experiments, but it uses a fixed-size random sample of the search space instead of generating D-Optimal designs. We compared the performance of our approach with the following algorithms: uniform Random Sampling (RS); Latin Hypercube Sampling (LHS); Greedy Search (GS); Greedy Search with Restart (GSR); and Genetic Algorithm (GA). Each algorithm performed at most 125 measurements over 1000 repetitions, without user intervention.
Factor | Levels | Short Description |
---|---|---|
\texttt{vector\_length} | Size of support arrays | |
\texttt{load\_overlap} | \textit{true}, \textit{false} | Load overlaps in vectorization |
\texttt{temporary\_size} | Byte size of temporary data | |
\texttt{elements\_number} | Size of equal data splits | |
\texttt{y\_component\_number} | Loop tile size | |
\texttt{threads\_number} | $2^5,…,210$ | Size of thread groups |
\texttt{lws\_y} | $2^0,…,210$ | Block size in |
Since we measured the entire valid search space, we could use the slowdown relative to the global minimum to compare algorithm performance. Table \ref{tab:gpu_laplacian_compare_budget} shows the mean, minimum and maximum slowdowns in comparison to the global minimum, for each algorithm. It also shows the mean and maximum budget used by each algorithm. Figure fig:gpu_laplacian_comparison_histogram presents histograms with the count of the slowdowns found by each of the 1000 repetitions. Arrows point the maximum slowdown found by each algorithm. Note that GS’s maximum slowdown was left out of range to help the comparison between the other algorithms.
library(xtable)
df_all_methods <- read.csv("./dopt_anova_experiments/data/complete_1000.csv", strip.white = T, header = T)
df_all_methods$method <- factor(df_all_methods$method, levels = c("RS","LHS","GS","GSR","GA","LM", "LMB", "LMBT", "RQ", "DOPT", "DLM", "DLMT"))
df_all_methods <- subset(df_all_methods, method %in% c("RS", "LHS", "GS", "GSR", "GA", "LM", "DLMT"))
summaries <- data.frame(RS = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "RS", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "RS",]$point_number),
max(df_all_methods[df_all_methods$method == "RS",]$point_number)),
LHS = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "LHS", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "LHS",]$point_number),
max(df_all_methods[df_all_methods$method == "LHS",]$point_number)),
GS = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "GS", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "GS",]$point_number),
max(df_all_methods[df_all_methods$method == "GS",]$point_number)),
GSR = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "GSR", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "GSR",]$point_number),
max(df_all_methods[df_all_methods$method == "GSR",]$point_number)),
GA = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "GA", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "GA",]$point_number),
max(df_all_methods[df_all_methods$method == "GA",]$point_number)),
LM = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "LM", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "LM",]$point_number),
max(df_all_methods[df_all_methods$method == "LM",]$point_number)),
DLMT = c(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "DLMT", ]$slowdown)))[ , 1],
mean(df_all_methods[df_all_methods$method == "DLMT",]$point_number),
max(df_all_methods[df_all_methods$method == "DLMT",]$point_number)))
rownames(summaries) <- c(rownames(as.data.frame(unclass(summary(df_all_methods[df_all_methods$method == "RS", ]$slowdown)))), "Mean Budget", "Max. Budget")
summaries <- t(summaries)
summaries <- summaries[ , c("Mean", "Min.", "Max.", "Mean Budget", "Max. Budget")]
cap <- "Slowdown and budget used by 7 optimization methods on the Laplacian Kernel, using a budget of 125 points with 1000 repetitions"
x <- xtable(summaries, caption = cap, digits = 2, label = "tab:gpu_laplacian_compare_budget")
align(x) <- xalign(x)
display(x) <- display(x)
print(x,
size = "\\footnotesize",
booktabs = TRUE,
add.to.row = list(pos = as.list(c(6)), command = c("\\rowcolor{red!25}")),
math.style.exponents = TRUE,
table.placement = "b",
caption.placement = "top")
All algorithms performed relatively well in this kernel, with only GS not being
able to find slowdowns smaller than 4$×$ in some runs. As expected, other search
algorithms had results similar to RS. LM was able to find slowdowns close to the
global minimum on most runs, but some runs could not find slowdowns smaller than
We implemented a simple approach for the prediction step in this problem, choosing the best value of our fitted models on the complete set of valid level combinations. This was possible for this problem since all valid combinations were known. For problems were the search space is too large to be generated, we would have to either adapt this step and run the prediction on a sample or minimize the model using the differentiation strategies mentioned in Section Screening & Plackett-Burman Designs.
./img/comparison_histogram.pdf
This kernel provided ideal conditions for using our approach, where the performance model is approximately known and the complete valid search space is small enough to be used for prediction. The global minimum also appears to not be isolated in a region of points with bad performance, since our approach was able to exploit space geometry. We will now present a performance evaluation of our approach in a larger and more comprehensive benchmark.
The SPAPT \cite{balaprakash2012spapt} benchmark suite provides parametrized CPU kernels from different HPC domains. The kernels shown in Table tab:spapt_apps are implemented using the code annotation and transformation tools provided by Orio \cite{hartono2009annotation}. Search space sizes are larger than in the Laplacian Kernel example. Kernel factors are either integers in an interval, such as loop unrolling and register tiling amounts, or binary flags that control parallelization and vectorization.
./img/iteration_best_comparison.pdf
We used the Random Sampling (RS) implementation available in Orio and integrated an implementation of our approach (DLMT) to the system. We omitted the other Orio algorithms because other studies using SPAPT kernels \cite{balaprakash2011can,balaprakash2012experimental} showed that their performance is similar to RS regarding budget usage. The global minima are not known for any of the problems, and search spaces are too large to allow complete measurements. Therefore, we used the performance of each application compiled with \texttt{gcc}’s \texttt{-O3}, with no code transformations, as a baseline for computing the speedups achieved by each strategy. We performed 10 autotuning repetitions for each kernel using RS and DLMT, using a budget of at most 400 measurements. DLMT was allowed to perform only 4 of the iterations shown in Figure fig:doe_anova_strategy. Experiments were performed using Grid5000 \cite{balouek2013adding}, on Debian Jessie, using an Intel Xeon E5-2630v3 CPU and \texttt{gcc} version \texttt{6.3.0}.
Kernel | Operation | Factors | Size |
---|---|---|---|
atax |
Matrix transp. & vector mult. | 18 | $2.6 × 1016$ |
dgemv3 |
Scalar, vector & matrix mult. | 49 | $3.8 × 1036$ |
gemver |
Vector mult. & matrix add. | 24 | $2.6 × 1022$ |
gesummv |
Scalar, vector, & matrix mult. | 11 | $5.3 × 109$ |
hessian |
Hessian computation | 9 | $3.7 × 107$ |
mm |
Matrix multiplication | 13 | $1.2 × 1012$ |
mvt |
Matrix vector product & transp. | 12 | $1.1 × 109$ |
tensor |
Tensor matrix mult. | 20 | $1.2 × 1019$ |
trmm |
Triangular matrix operations | 25 | $3.7 × 1023$ |
bicg |
Subkernel of BiCGStab | 13 | $3.2 × 1011$ |
lu |
LU decomposition | 14 | $9.6 × 1012$ |
adi |
Matrix sub., mult., & div. | 20 | $6.0 × 1015$ |
jacobi |
1-D Jacobi computation | 11 | $5.3 × 109$ |
seidel |
Matrix factorization | 15 | $1.3 × 1014$ |
stencil3d |
3-D stencil computation | 29 | $9.7 × 1027$ |
correlation |
Correlation computation | 21 | $4.5 × 1017$ |
The time to measure each kernel varied from a few seconds to up to 20 minutes. In testing, some transformations caused the compiler to enter an internal optimization process that did not stop for over 12 hours. We did not study why these cases delayed for so long, and implemented an execution timeout of 20 minutes, considering cases that took longer than that to compile to be runtime failures.
Similar to the previous example, we automated factor elimination based on ANOVA tests so that a comprehensive evaluation could be performed. We also did not tailor initial performance models, which were the same for all kernels. Initial models had a linear term for each factor with two or more levels, plus quadratic and cubic terms for factors with sufficient levels. Although automation and identical initial models might have limited the improvements at each step of our application, our results show that it still succeeded in decreasing the budget needed to find significant speedups for some kernels.
Figure fig:iteration_best_comparison presents the speedup found by each run of RS and DLMT, plotted against the algorithm iteration where that speedup was found. We divided the kernels into 3 groups according to the results. The group where no algorithm found any speedups contains 3 kernels and is marked with “[0]” and blue headers. The group where both algorithms found similar speedups, in similar iterations, contains 6 kernels and is marked with “[=]” and orange headers. The group where DLMT found similar speedups using a significantly smaller budget than RS contains 8 kernels and is marked with “[+]” and green headers. Ellipses delimit an estimate of where 95% of the underlying distribution lies, and a dashed line marks the \texttt{-03} baseline. In comparison to RS, our approach significantly decreased the average number of iterations needed to find speedups for the 8 kernels in the green group.
Figure fig:split_histograms shows the search space exploration performed by RS
and DLMT. It uses the same color groups as Figure fig:iteration_best_comparison,
and shows the distribution of the speedups that were found during all
repetitions of the experiments. Histogram areas corresponding to DLMT are
usually smaller because it always stopped at 4 iterations, while RS always
performed 400 measurements. This is particularly visible in lu
, mvt
, and
jacobi
. We also observe that the quantity of configurations with high speedups
found by DLMT is higher, even for kernels on the orange group. This is
noticeable in gemver
, bicgkernel
, mm
and tensor
, and means that DLMT spent less
of the budget exploring configurations with small speedups or slowdowns, in
comparison with RS.
#
Analyzing the significant performance parameters identified by our automated
approach for every kernel, we were able to identify interesting relationships
between parameters and performance. In bicgkernel
, for example, DLTM identified
a linear relationship for OpenMP and scalar replacement optimizations, and
quadratic relationships between register and cache tiling, and loop
unrolling. This is an example of the transparency in the optimization process
that can be achieved with a DoE approach.
Our approach used a generic initial performance model for all kernels, but since it iteratively eliminates factors and model terms based on ANOVA tests, it was still able to exploit global search space structures for kernels in the orange and green groups. Even in this automated setting, the results with SPAPT kernels illustrate the ability our approach has to reduce the budget needed to find good speedups by efficiently exploring search spaces.
The contribution of this paper is a transparent DoE approach for program autotuning under tight budget constraints. We discussed the underlying concepts that enable our approach to reduce significantly the measurement budget needed to find good optimizations consistently over different kernels exposing configuration parameters of source-to-source transformations. We have made efforts to make our results, figures and analyses reproducible by hosting all our scripts and data publicly \cite{bruel2018ccgrid19}.
Our approach outperformed six other search heuristics, always finding a slowdown
of 1% from the global optimum of the search space defined by the optimization of
a Laplacian kernel for GPUs, while using at most half of the allotted budget. In
a more comprehensive evaluation, using kernels from the SPAPT benchmark, our
approach was able to find the same speedups as RS while using up to
We presented a completely automated version of our approach in this paper so that we could perform a thorough performance evaluation on comprehensive benchmarks. Despite using the same generic performance model for all kernels, our approach was able to find good speedups by eliminating insignificant model terms at each iteration. This means that our approach can still improve the performance of applications using unspecialized models that incorporate only general knowledge about algorithm performance. We would incur some budget overhead in this case while insignificant terms are removed.
In future work we will explore the impact of user input and expert knowledge in the selection of the initial performance model and in the subsequent elimination of factors using ANOVA tests. We expect that tailored initial performance models and assisted factor elimination will improve the solutions found by our approach and decrease the budget needed to find them.
Our current strategy eliminates completely from the model the factors with low significance detected by ANOVA tests. In future work we will also explore the effect of adding random experiments with randomized factor levels. We expect this will decrease the impact of removing factors wrongly detected to have low significance.
Decreasing the number of experiments needed to find optimizations is a desirable property for autotuners in problem domains other than source-to-source transformation. We intend to evaluate the performance of our approach in domains such as High-Level Synthesis and compiler configuration for FPGAs, where search spaces can get as large as $10126$, and where we already have some experience \cite{bruel2017autotuninghls}.
Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations. This work was partly funded by CAPES, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brazil, funding code 001.