January 1 2020: I used to say that “the UMAP paper does not go into much implementation detail.” That was true of the first version of the paper, but the second version adds an appendix that covers most of the material here. This is no coincidence, as I was added as a co-author, and this was my main contribution.
If you are coming to UMAP from t-SNE and don’t know much about topology or fuzzy sets (and I certainly don’t), you may find yourself hankering for some insight into how UMAP achieves its results and what connection there is with t-SNE and related methods.
Here are some details I have taken from scouring the Python source code and from asking UMAP creator Leland McInnes. In what follows, I assume that you are already familiar with how t-SNE works.
Broadly, the UMAP implementation uses a similar approach to LargeVis. LargeVis in turn uses concepts from t-SNE. t-SNE itself is a modification of the original Stochastic Neighbor Embedding (SNE) method.
Different papers use different symbols and nomenclature, so to make sure we’re all on the same page, I will restate the relevant definitions from SNE and t-SNE before we take a look at LargeVis and UMAP.
(Asymmetric) SNE
Given observations of some high dimensional data, for any pair, and , SNE defines the similarity (aka an affinity or weight) between them, , using a Gaussian kernel function:
where is the distance between and and must be determined by some method (we’ll get back to that). The notation of rather than , is to indicate that this quantity is not symmetric, i.e. . I’ve borrowed this notation from the conditional versus joint probability definitions used in symmetric SNE (see below) but we’ll also need it for quantities other than probabilities. The notation indicates that the distances are symmetric and this convention will be used for other symmetric values.
The weights are normalized to form probability distributions:
is chosen by finding that value that results in the probability distribution having a specific perplexity. The perplexity has to be chosen by the user, but is interpreted as being a continuous version of the number of nearest neighbors, and generally is chosen to take values between 5 and 50.
is a conditional probability, and is interpreted as meaning “the probability that you would pick item as being similar to item , given that you’ve already picked ”.
In the output space of the embedded coordinates, the similarity between the points and is also defined as a Gaussian:
where is the Euclidean distance between and . There is no in this weight definition so these weights are symmetric. The output probabilities, are calculated from in the same way that we go from to , again creating probability distributions. Due to normalizing by rows, the are asymmetric despite the symmetric weights they are generated from.
The SNE cost function is the sum of the Kullback-Leibler divergences of the distributions:
In all of the above (and in what follows), weights and probabilities when are not defined. I don’t want to clutter the notation further, so assume they are excluded from any sums.
Symmetric SNE
In Symmetric SNE, the input probability matrix is symmetrized by averaging and and then re-normalized over all pairs of points, to create a single (joint) probability distribution, :
The output probabilities, are now defined by normalizing the output weights over all pairs, again creating a single probability distribution:
The cost function for SSNE is then:
t-SNE
For the purposes of this discussion, t-SNE only differs from symmetric SNE by its weight function:
SNE Optimization
Optimization in t-SNE proceeds by:
- Calibrate the input probabilities, according to the desired perplexity (this only needs to be done once).
- Iteratively:
- Calculate all pairwise distances, from and
- Calculate the weights,
- Calculate the output probabilities
- Use gradient descent to update the and
This is fundamentally because of the need to calculate pairwise distances: the t-SNE gradient requires the to be calculated and the normalization step that converts to requires all of to be calculated so you also need all the distances.
Approaches like Barnes-Hut t-SNE and others (e.g. Flt-SNE) attempt to improve on this by taking advantage of the t-SNE gradient:
- The attractive part of the gradient depends on , which is constant and only large for neighbors that are close in the input space. Therefore it’s only necessary to calculate the attractive gradient for nearest neighbors of . In Barnes-Hut t-SNE, the number of nearest neighbors used is three times whatever the perplexity is. For larger datasets, a perplexity of 50 is common, so usually you are looking for the 150-nearest neighbors of each point.
- The repulsive part of the gradient is dependent on which changes with each iteration, so the improvements here focus on grouping together points which are distant in the output space and treating them as a single point for the purposes of the gradient calculation.
As you will be able to tell from perusing the publications linked to above, these approaches are increasing in sophistication and complexity.
LargeVis
LargeVis takes a different approach: it re-uses a lot of the same definitions as t-SNE, but makes sufficient modifications so that it’s possible to use stochastic gradient descent.
Also, rather than talk about probabilities, LargeVis uses the language of graph theory. Each observation in our dataset is now considered to be a vertex or node and the similarity between them is the weight of the edge between the two vertices. Conceptually we’re still talking about elements in a matrix, but I will start slipping into the language of “edges” and “vertices”.
The key change is the cost function, which is now a maximum likelihood function:
and is the same as in t-SNE (the authors try some alternative definitions, but they aren’t as effective).
The new concepts here are and . is a user-defined positive scalar to weight repulsive versus attractive forces. Its default in the reference implementation is 7.
is the set of edges with a non-zero weight. This is the graph theory way to talk about nearest neighbors in the input space. Just as with Barnes-Hut t-SNE, we find a set of nearest neighbors for each point and only define input weights and probabilities for pairs of points which are nearest neighbors. As with the official Barnes-Hut t-SNE implementation, the LargeVis reference implementation uses a default perplexity of 50, and the default number of nearest neighbors is 3 times the perplexity.
This cost function therefore consists of two disjoint contributions: nearest neighbors in the input space contribute to the attractive part of the cost function (the first part). Everything else contributes to the second, repulsive part.
The key advantage of this cost function over the KL divergence is that it doesn’t contain . With no output normalization, we don’t need to calculate all the output pairwise distances. So this cost function is amenable to stochastic gradient descent techniques.
The LargeVis sampling strategy
To calculate a stochastic gradient, LargeVis does the following:
- Samples an edge in , i.e. chooses and such that . This is called a “positive edge” in the LargeVis paper. and are used to calculate the attractive part of the gradient.
- Samples one of the vertices, let’s call it . As datasets grow larger, the probability that is in grows smaller, but that’s not actually checked for (only that ). This is used to calculate the repulsive part of the gradient. These are called the “negative samples” in the LargeVis paper.
- Repeat the negative sample step a number of times. The default in LargeVis is to sample 5 negatives for each positive edge.
The coordinates of , and the various are then updated according to the gradients. This concludes one iteration of the SGD.
The attractive and repulsive gradients for LargeVis are respectively:
$$ \frac{\partial L_{LV}}{\partial \mathbf{y_i}}^+ = \frac{-2}{1 + d_{ij}^2}p_{ij} \left(\mathbf{y_i - y_j}\right) \\ \frac{\partial L_{LV}}{\partial \mathbf{y_i}}^- = \frac{2\gamma}{\left(0.1 + d_{ij}^2\right)\left(1 + d_{ij}^2\right)} \left(\mathbf{y_i - y_j}\right) $$ The value of 0.1 that appears in the repulsive gradient is there to prevent division by zero.
Sampling of edges and vertices is not uniform. For the attractive gradient, the authors note that the factor of that appears means that the magnitude of the gradient can differ hugely between samples to the extent that choosing an appropriate learning rate can be difficult. Instead they sample the edges proportionally to and then for the gradient calculation, treat each edge as if the weights were all equal. The attractive gradient used as part of LargeVis SGD is therefore:
As
doesn’t appear in the repulsive part of the gradient, so it would seem
that uniform sampling would work for the negative sampling. However,
vertices are sampled using a “noisy” distribution proportional to their
degree ^ 0.75, where the degree of the vertex is the sum of the weights
of the edges incident to them. There doesn’t seem to be a theoretical
reason to use the degree ^ 0.75. It’s based on results from the field of
word embeddings: the LargeVis authors reference a skip-gram
paper, but the same power also shows up in GloVE. In both cases
it is justified purely empirically. The uwot
version of
LargeVis (lvish
) samples the negative edges uniformly, and
it doesn’t seem to cause any problems.
UMAP (at last)
The UMAP cost function is the cross-entropy of two fuzzy sets, which can be represented as symmetric weight matrices:
are symmetrized input affinities, and are not probabilities. The graph interpretation of them as weights of edges in a graph still applies, though. These are arrived at differently to t-SNE and LargeVis. The unsymmetrized UMAP input weights are given by:
where are the input distances, is the distance to the nearest neighbor (ignoring zero distances where neighbors are duplicates) and is analogous to in the perplexity calibration used in SNE. In this case, is determined such that where is the number of nearest neighbors.
January 1 2020: I assume there is a connection here with the local scaling advocated for self-tuning spectral clustering, given that spectral decomposition of the affinity graph is the default initialization method for UMAP.
These weights are symmetrized by a slightly different method to SNE:
or as a matrix operation:
where indicates the transpose and is the Hadamard (i.e. entry-wise) product. This effectively carries out a fuzzy set union.
The output weights are given by:
where
and
are determined by a non-linear least squares fit based on the
min_dist
and spread
parameters that control
the tightness of the squashing function. By setting
and
you get t-SNE weighting back. The current UMAP defaults result in a
= 1.929 and b = 0.7915. April 7 2019: Actually, I got
this wrong. The UMAP defaults use
min_dist = 0.1, spread = 1
, which results in
and
.
If you use min_dist = 0.001, spread = 1
then you get the
result for
and
.
The attractive and repulsive UMAP gradient expressions are, respectively:
$$ \frac{\partial C_{UMAP}}{\partial \mathbf{y_i}}^+ = \frac{-2abd_{ij}^{2\left(b - 1\right)}}{1 + ad_{ij}^{2b}} v_{ij} \left(\mathbf{y_i - y_j}\right) \\ \frac{\partial C_{UMAP}}{\partial \mathbf{y_i}}^- = \frac{2b}{\left(0.001 + d_{ij}^2\right)\left(1 + ad_{ij}^{2b}\right)}\left(1 - v_{ij}\right)\left(\mathbf{y_i - y_j}\right) $$
(April 13 2020: In previous versions of this document I had completely messed up this expression by omitting a factor of 2 in the repulsive gradient equation and missed out some important s and s. This also affected the SGD version two equations below. Thank you to Dmitry Kobak for spotting this.)
While more complex-looking than the LargeVis gradient, there are obvious similarities, which become more clearer if you set and , to get back the t-SNE/LargeVis output weight function. The 0.001 term in the denominator of the repulsive gradient plays the same role as the 0.1 in the LargeVis gradient (preventing division by zero).
UMAP uses the same sampling strategy as LargeVis, where sampling of positive edges is proportional to the weight of the edge (in this case ), and then the value of the gradient is calculated by assuming that for all edges. So for SGD purposes, the attractive gradient for UMAP is:
The final expression might be more computationally convenient because it saves on an extra power calculation.
The repulsive part of the gradient contains a term, but because for most pairs of edges, that term effectively disappears, leaving:
Unlike LargeVis, negative sampling in UMAP uses a uniform distribution.
It’s worth considering that, although LargeVis uses
and UMAP uses
in their cost functions, the difference isn’t that important, because
sampling proportionally to
is exactly the same as sampling proportionally to
.
In fact, if you look at the LargeVis reference implementation (or
lvish
in uwot
), the input affinities are
symmetrized, but not divided by
.
Nonetheless, because the affinities are only used to construct the
sampling probabilities, the presence of the
parameter in the repulsive part of the gradient means that you are
effectively using
in the LargeVis cost function that is being optimized, not
.
Minor UMAP Variations
There are some extra parameters in UMAP that make some minor changes if modified from their default-values:
local_connectivity
affects by using the distance to thelocal_connectivity
th non-zero near neighbor (or by interpolating between distances iflocal_connectivity
is non-integral).set_op_mix_ratio
This changes the form of the symmetrization from fuzzy set union to fuzzy set intersection which is just , and can also blend between the two.gamma
This works exactly like the value in LargeVis, up-weighting the repulsive contribution of the gradient.
2 August 2018: The follow parameter no longer appears in the reference UMAP implementation:
-
bandwidth
affects by multiplying the value of :
where bandwidth
is represented as
.
The value of
is determined using calculations of
without
,
before recalculating the
using bandwidth
.
I’m not sure how useful any of these are when changed from the defaults.
My thanks to Dmitry Kobak for some very helpful discussions and typo-spotting.