umapr
wraps the Python implementation of UMAP to make the algorithm accessible from within R. It uses the great reticulate
package.
Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction algorithm. It is similar to t-SNE but computationally more efficient. UMAP was created by Leland McInnes and John Healy (github, arxiv).
Recently, two new UMAP R packages have appeared. These new packages provide more features than umapr
does and they are more actively developed. These packages are:
-
umap, which provides the same Python wrapping function as
umapr
and also an R implementation, removing the need for the Python version to be installed. It is available on CRAN. -
uwot, which also provides an R implementation, removing the need for the Python version to be installed.
Angela Li, Ju Kim, Malisa Smith, Sean Hughes, Ted Laderas
umapr
is a project that was first developed at rOpenSci Unconf 2018.
First, you will need to install Python
and the UMAP
package. Instruction available here.
Then, you can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("ropenscilabs/umapr")
Here is an example of running UMAP on the iris
data set.
library(umapr)
library(tidyverse)
# select only numeric columns
df <- as.matrix(iris[ , 1:4])
# run UMAP algorithm
embedding <- umap(df)
umap
returns a data.frame
with two attached columns called "UMAP1" and "UMAP2". These columns represent the UMAP embeddings of the data, which are column-bound to the original data frame.
# look at result
head(embedding)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width UMAP1 UMAP2
#> 1 5.1 3.5 1.4 0.2 5.647059 -6.666872
#> 2 4.9 3.0 1.4 0.2 4.890193 -8.130815
#> 3 4.7 3.2 1.3 0.2 4.397037 -7.546669
#> 4 4.6 3.1 1.5 0.2 4.412886 -7.633424
#> 5 5.0 3.6 1.4 0.2 5.707233 -6.863213
#> 6 5.4 3.9 1.7 0.4 6.442851 -5.726554
# plot the result
embedding %>%
mutate(Species = iris$Species) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) + geom_point()
There is a function called run_umap_shiny()
which will bring up a Shiny app for exploring different colors of the variables on the umap plots.
run_umap_shiny(embedding)
There are a few important parameters. These are fully described in the UMAP Python documentation.
The n_neighbor
argument can range from 2 to n-1 where n is the number of rows in the data.
neighbors <- c(4, 8, 16, 32, 64, 128)
neighbors %>%
map_df(~umap(as.matrix(iris[,1:4]), n_neighbors = .x) %>%
mutate(Species = iris$Species, Neighbor = .x)) %>%
mutate(Neighbor = as.integer(Neighbor)) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) +
geom_point() +
facet_wrap(~ Neighbor, scales = "free")
The min_dist
argument can range from 0 to 1.
dists <- c(0.001, 0.01, 0.05, 0.1, 0.5, 0.99)
dists %>%
map_df(~umap(as.matrix(iris[,1:4]), min_dist = .x) %>%
mutate(Species = iris$Species, Distance = .x)) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) +
geom_point() +
facet_wrap(~ Distance, scales = "free")
The distance
argument can be many different distance functions.
dists <- c("euclidean", "manhattan", "canberra", "cosine", "hamming", "dice")
dists %>%
map_df(~umap(as.matrix(iris[,1:4]), metric = .x) %>%
mutate(Species = iris$Species, Metric = .x)) %>%
ggplot(aes(UMAP1, UMAP2, color = Species)) +
geom_point() +
facet_wrap(~ Metric, scales = "free")
t-SNE and UMAP are both non-linear dimensionality reduction methods, in contrast to PCA. Because t-SNE is relatively slow, PCA is sometimes run first to reduce the dimensions of the data.
We compared UMAP to PCA and t-SNE alone, as well as to t-SNE run on data preprocessed with PCA. In each case, the data were subset to include only complete observations. The code to reproduce these findings are available in timings.R
.
The first data set is the same iris data set used above (149 observations of 4 variables):
Next we tried a cancer data set, made up of 699 observations of 10 variables:
Third we tried a soybean data set. It is made up of 531 observations and 35 variables:
Finally we used a large single-cell RNAsequencing data set, with 561 observations (cells) of 55186 variables (over 30 million elements)!
PCA is orders of magnitude faster than t-SNE or UMAP (not shown). UMAP, though, is a substantial improvement over t-SNE both in terms of memory and time taken to run.