Methods and benchmarks for fast, approximate Gaussian kernel density estimation. This repo is for archiving research results. Want to use fast KDE in your own projects? See github.com/uwdata/fast-kde.
This repo contains code (src/
) and benchmark (benchmarks/
) scripts for the IEEE VIS 2021 Short Paper Fast & Accurate Gaussian Kernel Density Estimation. All code is written as ESM modules. Running benchmarks requires Node.js version 13.2 or higher.
If you use or build on this repo in academic work, please use this citation:
@inproceedings{2021-Heer-FastKDE,
title = {Fast \& Accurate Gaussian Kernel Density Estimation},
author = {Jeffrey Heer},
booktitle = {IEEE VIS Short Papers},
year = {2021},
url = {http://idl.cs.washington.edu/papers/fast-kde},
}
To build a bundle (ESM module or minified UMD):
- Run
yarn
to install dependencies. - Run
yarn build
to build the bundles.
Compiled bundles will be written to the dist
directory.
- Run
yarn
to install dependencies. - Run
yarn benchmarks
to run benchmarks. - Wait until the benchmarks complete...
Benchmark result datasets will be written to the results
directory as JSON files.
Convenience methods for performing 1D or 2D density estimation.
# kde.density1d()
Creates a new 1D density estimate helper with default settings.
Example
// perform 1D estimation with bandwidth = 1 over domain [0, 10]
// use default grid size (512) and method (kdeDeriche1d)
// returns an array of [ { x, value }, ... ] points
kde.density1d()
.bandwidth(1)
.extent([0, 10])
.points([1, 2, 5, 5, 6, 9])
# density1d(data)
Computes a 1D Gaussian kernel density estimate using the current settings for the given data and returns a Float64Array of gridded density values. To instead produce an array of objects containing coordinate values and density estimates, use density1d.points().
# density1d.points(data)
Computes a 1D Gaussian kernel density estimate using the current settings for the given data and returns an array of objects containing the grid coordinate value (x
) and density value (value
).
# density1d.x([x])
Get or set the x coordinate getter for the input data. Defaults to the identity function, which assumes a flat array of values.
# density1d.grid([grid])
Get or set the grid function for binning the input data into a one-dimensional grid, such as grid1d_linear (the default) or grid1d_simple.
# density1d.size([size])
Get or set the size (default 512) of the binned data grid.
# density1d.bandwidth([bandwidth])
Get or set the bandwidth (standard deviation) of the Gaussian kernel. If set to null
or zero (the default), the normal reference density heuristic will be used to select a bandwidth automatically.
# density1d.extent([extent])
Get or set the extent ([x0, x1]) of the data domain over which to perform density estimation. Any data points outside this extent will be ignored. The value defaults to [0, 1]. Note that the estimation helper does not automatically select a domain based on the input data.
# density1d.method([method])
Get or set the density estimation method to use. One of kdeDeriche1d (the default), kdeBox1d, kdeExtBox1d, or kdeCDF1d.
# kde.density2d()
Creates a new 2D density estimate helper with default settings.
Example
// perform 2D estimation with bandwidth = 1 over domain [[0, 10], [0, 10]]
// use default grid size ([256, 256]) and method (kdeDeriche2d)
// returns an array of [ { x, y, value }, ... ] points
kde.density2d()
.bandwidth([1, 1])
.extent([[0, 10], [0, 10]])
.points([[1, 1], [1, 2], [5, 4], [5, 3], [6, 2], [8, 7]])
# density2d(data)
Computes a 2D Gaussian kernel density estimate for the given data using the current settings and returns a Float64Array of gridded density values. To instead produce an array of objects containing coordinate values and density estimates, use density2d.points().
# density2d.points(data)
Computes a 2D Gaussian kernel density estimate for the given data using the current settings and returns an array of objects containing the grid coordinate (x
, y
) and density (value
) values.
# density2d.x([x])
Get or set the x coordinate getter for the input data. Defaults to index 0 (i.e., the first element if individual data points are provided as arrays).
# density2d.y([y])
Get or set the y coordinate getter for the input data. Defaults to index 1 (i.e., the second element if individual data points are provided as arrays).
# density2d.grid([grid])
Get or set the grid function for binning the input data into a two dimensional grid, such as grid2d_linear (the default) or grid2d_simple.
# density2d.size([size])
Get or set the size (default [256, 256]) of the binned data grid.
# density2d.bandwidth([bandwidth])
Get or set the bandwidths (standard deviations) of the Gaussian kernels as a two element array. If an entry is set to null
or zero (the defaults), the normal reference density heuristic will be used to select a bandwidth automatically for that dimension.
# density2d.extent([extent])
Get or set the extents ([[x0, x1], [y0, y1]]) of the data domain over which to perform density estimation. Any data points outside this extent will be ignored. The value defaults to [[0, 1], [0, 1]]. Note that the estimation helper does not automatically select domains based on the input data.
# density2d.method([method])
Get or set the density estimation method to use. One of kdeDeriche2d (the default), kdeBox2d, kdeExtBox2d, or kdeCDF2d.
# kde.kdeCDF1d(data, extent, size, bandwidth)
Computes a 1D Gaussian kernel density estimate for an array of numeric data values via direct calculation of the Gaussian cumulative distribution function. This method can be very slow in practice and is provided only for comparison and testing purposes; see kdeDeriche1d instead. The extent parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The size parameter indicates the number of bins (grid steps) to use. The bandwidth parameter indicates the width (standard deviation) of the kernel function. Returns a Float64Array of gridded density estimates.
# kde.kdeBox1d(data, extent, size, bandwidth, grid1d)
Computes a 1D Gaussian kernel density estimate for an array of numeric data values using the standard box filtering method. This method is not recommended for normal use, see kdeDeriche1d instead. The extent parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The size parameter indicates the number of bins (grid steps) to use. The bandwidth parameter indicates the width (standard deviation) of the kernel function. The grid1d parameter indicates the binning method to use, such as grid1d_linear or grid1d_simple. Returns a Float64Array of gridded density estimates.
# kde.kdeExtBox1d(data, extent, size, bandwidth, grid1d)
Computes a 1D Gaussian kernel density estimate for an array of numeric data values using the extended box filtering method, which smooths away the quantization error of standard box filtering. This method is not recommended for normal use, see kdeDeriche1d instead. The extent parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The size parameter indicates the number of bins (grid steps) to use. The bandwidth parameter indicates the width (standard deviation) of the kernel function. The grid1d parameter indicates the binning method to use, such as grid1d_linear or grid1d_simple. Returns a Float64Array of gridded density estimates.
# kde.kdeDeriche1D(data, extent, size, bandwidth, grid1d)
Computes a 1D Gaussian kernel density estimate for an array of numeric data values using Deriche's recursive filter approximation. This method is recommended for normal use. The extent parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The size parameter indicates the number of bins (grid steps) to use. The bandwidth parameter indicates the width (standard deviation) of the kernel function. The grid1d parameter indicates the binning method to use, such as grid1d_linear or grid1d_simple. Returns a Float64Array of gridded density estimates.
# kde.kdeCDF2d(data, extent, size, bandwidth)
Computes a 2D Gaussian kernel density estimate for two arrays of numeric data values via direct calculation of the Gaussian cumulative distribution function. This method can be very slow in practice and is provided only for comparison and testing purposes; see kdeDeriche2d instead. The input data should be an array with two entries: a numeric x-value array and a numeric y-value array. The extent parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The size parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The bandwidth parameter should be a [bandwidthX, bandwidthY] array. Returns a Float64Array of gridded density estimates.
# kde.kdeBox2d(data, extent, size, bandwidth, grid2d)
Computes a 2D Gaussian kernel density estimate for two arrays of numeric data values using the standard box filtering method. This method is not recommended for normal use, see kdeDeriche2d instead. The input data should be an array with two entries: a numeric x-value array and a numeric y-value array. The extent parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The size parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The bandwidth parameter should be a [bandwidthX, bandwidthY] array. The grid2d parameter indicates the binning method to use, such as grid2d_linear or grid2d_simple. Returns a Float64Array of gridded density estimates.
# kde.kdeExtBox2d(data, extent, size, bandwidth, grid2d)
Computes a 2D Gaussian kernel density estimate for two arrays of numeric data values using the extended box filtering method, which smooths away the quantization error of standard box filtering. This method is not recommended for normal use, see kdeDeriche2d instead. The input data should be an array with two entries: a numeric x-value array and a numeric y-value array. The extent parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The size parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The bandwidth parameter should be a [bandwidthX, bandwidthY] array. The grid2d parameter indicates the binning method to use, such as grid2d_linear or grid2d_simple. Returns a Float64Array of gridded density estimates.
# kde.kdeDeriche1D(data, extent, size, bandwidth, grid2d)
Computes a 2D Gaussian kernel density estimate for two arrays of numeric data values using Deriche's recursive filter approximation. This method is recommended for normal use. The input data should be an array with two entries: a numeric x-value array and a numeric y-value array. The extent parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The size parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The bandwidth parameters should be a [bandwidthX, bandwidthY] array. The grid2d parameter indicates the binning method to use, such as grid2d_linear or grid2d_simple. Returns a Float64Array of gridded density estimates.
Utility methods for grid construction and bandwidth suggestion.
# kde.grid1d_linear(data, size, init, scale[, offset])
Bins a set of numeric data values into a 1D output grid of size size, using linear interpolation of point mass across adjacent bins. The init value indicates the minimum grid value for the data domain. The scale parameter is a numeric value for scaling data values to a [0, 1] domain. The optional offset parameter (default 0) indicates the number of bins by which to offset the start of the data domain, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.
# kde.grid1d_linear(data, size, init, scale[, offset])
Bins a set of numeric data values into a 1D output grid of size size, where the full weight of a point is assigned to the nearest enclosing bin. The init value indicates the minimum grid value for the data domain. The scale parameter is a numeric value for scaling data values to a [0, 1] domain. The optional offset parameter (default 0) indicates the number of bins by which to offset the start of the data domain, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.
# kde.grid2d_linear(data, size, init, scale, offset)
Bins a set of numeric data values into a 2D output grid of size size, using linear interpolation of point mass across adjacent bins. The input data should be an array with two entries: a numeric x-value array and a numeric y-value array. The size parameter sets the grid dimensions and should be an [sizeX, sizeY] array containing two integers. The init parameter ([initX, initY]) indicates the minimum grid values for the data domains. The scale parameter ([scaleX, scaleY]) provides numeric values for scaling data values to a [0, 1] domain. The offset parameter ([offsetX, offsetY]) indicates the number of bins by which to offset the start of the data domains, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.
# kde.grid2d_simple(data, size, init, scale, offset)
Bins a set of numeric data values into a 2D output grid of size size, where the full weight of a point is assigned to the nearest enclosing bin. The input data should be an array with two entries: a numeric x-value array and a numeric y-value array. The size parameter sets the grid dimensions and should be an [sizeX, sizeY] array containing two integers. The init parameter ([initX, initY]) indicates the minimum grid values for the data domains. The scale parameter ([scaleX, scaleY]) provides numeric values for scaling data values to a [0, 1] domain. The offset parameter ([offsetX, offsetY]) indicates the number of bins by which to offset the start of the data domains, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.
# kde.nrd(data)
Calculates a suggested bandwidth for an array of numeric data values, using Scott's normal reference distribution (NRD) heuristic.