Skip to contents

January 1 2020: I used to say that “the UMAP paper does not go into much implementation detail.” That was true of the first version of the paper, but the second version adds an appendix that covers most of the material here. This is no coincidence, as I was added as a co-author, and this was my main contribution.

If you are coming to UMAP from t-SNE and don’t know much about topology or fuzzy sets (and I certainly don’t), you may find yourself hankering for some insight into how UMAP achieves its results and what connection there is with t-SNE and related methods.

Here are some details I have taken from scouring the Python source code and from asking UMAP creator Leland McInnes. In what follows, I assume that you are already familiar with how t-SNE works.

Broadly, the UMAP implementation uses a similar approach to LargeVis. LargeVis in turn uses concepts from t-SNE. t-SNE itself is a modification of the original Stochastic Neighbor Embedding (SNE) method.

Different papers use different symbols and nomenclature, so to make sure we’re all on the same page, I will restate the relevant definitions from SNE and t-SNE before we take a look at LargeVis and UMAP.

(Asymmetric) SNE

Given NN observations of some high dimensional data, for any pair, 𝐱𝐢\mathbf{x_{i}} and 𝐱𝐣\mathbf{x_{j}}, SNE defines the similarity (aka an affinity or weight) between them, vj|iv_{j|i}, using a Gaussian kernel function:

vj|i=exp(βirij2) v_{j|i} = \exp(-\beta_{i} r_{ij}^2)

where rijr_{ij} is the distance between 𝐱𝐢\mathbf{x_{i}} and 𝐱𝐣\mathbf{x_{j}} and βi\beta_{i} must be determined by some method (we’ll get back to that). The notation of vj|iv_{j|i} rather than vijv_{ij}, is to indicate that this quantity is not symmetric, i.e. vj|ivi|jv_{j|i} \neq v_{i|j}. I’ve borrowed this notation from the conditional versus joint probability definitions used in symmetric SNE (see below) but we’ll also need it for quantities other than probabilities. The rijr_{ij} notation indicates that the distances are symmetric and this convention will be used for other symmetric values.

The weights are normalized to form NN probability distributions:

pj|i=vj|ikNvk|i p_{j|i} = \frac{v_{j|i}}{\sum_{k}^{N} v_{k|i}} βi\beta_{i} is chosen by finding that value that results in the probability distribution having a specific perplexity. The perplexity has to be chosen by the user, but is interpreted as being a continuous version of the number of nearest neighbors, and generally is chosen to take values between 5 and 50.

pj|ip_{j|i} is a conditional probability, and is interpreted as meaning “the probability that you would pick item jj as being similar to item ii, given that you’ve already picked ii”.

In the output space of the embedded coordinates, the similarity between the points 𝐲𝐢\mathbf{y_i} and 𝐲𝐣\mathbf{y_j} is also defined as a Gaussian:

wij=exp(dij2) w_{ij} = \exp(-d_{ij}^2)

where dijd_{ij} is the Euclidean distance between 𝐲𝐢\mathbf{y_i} and 𝐲𝐣\mathbf{y_j}. There is no β\beta in this weight definition so these weights are symmetric. The output probabilities, qj|iq_{j|i} are calculated from wijw_{ij} in the same way that we go from vj|iv_{j|i} to pj|ip_{j|i}, again creating NN probability distributions. Due to normalizing by rows, the qj|iq_{j|i} are asymmetric despite the symmetric weights they are generated from.

The SNE cost function is the sum of the Kullback-Leibler divergences of the NN distributions:

CSNE=iNjNpj|ilogpj|iqj|i C_{SNE} = \sum_{i}^{N} \sum_{j}^{N} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} In all of the above (and in what follows), weights and probabilities when i=ji = j are not defined. I don’t want to clutter the notation further, so assume they are excluded from any sums.

Symmetric SNE

In Symmetric SNE, the input probability matrix is symmetrized by averaging pj|ip_{j|i} and pi|jp_{i|j} and then re-normalized over all pairs of points, to create a single (joint) probability distribution, pijp_{ij}:

pij=pj|i+pi|j2N p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}

The output probabilities, qijq_{ij} are now defined by normalizing the output weights over all pairs, again creating a single probability distribution:

qij=wijkNlNwkl q_{ij} = \frac{w_{ij}}{\sum_{k}^N \sum_{l}^N w_{kl}}

The cost function for SSNE is then:

CSSNE=iNjNpijlogpijqij C_{SSNE} = \sum_{i}^{N} \sum_{j}^{N} p_{ij} \log \frac{p_{ij}}{q_{ij}}

t-SNE

For the purposes of this discussion, t-SNE only differs from symmetric SNE by its weight function:

wij=11+dij2 w_{ij} = \frac{1}{1 + d_{ij}^2}

SNE Optimization

Optimization in t-SNE proceeds by:

  • Calibrate the input probabilities, pijp_{ij} according to the desired perplexity (this only needs to be done once).
  • Iteratively:
  • Calculate all pairwise distances, dijd_{ij} from 𝐲𝐢\mathbf{y_{i}} and 𝐲𝐣\mathbf{y_{j}}
  • Calculate the weights, wijw_{ij}
  • Calculate the output probabilities qijq_{ij}
  • Use gradient descent to update the 𝐲𝐢\mathbf{y_{i}} and 𝐲𝐣\mathbf{y_{j}}

This is fundamentally O(N2)O(N^2) because of the need to calculate pairwise distances: the t-SNE gradient requires the qijq_{ij} to be calculated and the normalization step that converts wijw_{ij} to qijq_{ij} requires all of wijw_{ij} to be calculated so you also need all the distances.

Approaches like Barnes-Hut t-SNE and others (e.g. Flt-SNE) attempt to improve on this by taking advantage of the t-SNE gradient:

  • The attractive part of the gradient depends on pijp_{ij}, which is constant and only large for neighbors that are close in the input space. Therefore it’s only necessary to calculate the attractive gradient for nearest neighbors of 𝐱𝐢\mathbf{x_i}. In Barnes-Hut t-SNE, the number of nearest neighbors used is three times whatever the perplexity is. For larger datasets, a perplexity of 50 is common, so usually you are looking for the 150-nearest neighbors of each point.
  • The repulsive part of the gradient is dependent on qijq_{ij} which changes with each iteration, so the improvements here focus on grouping together points which are distant in the output space and treating them as a single point for the purposes of the gradient calculation.

As you will be able to tell from perusing the publications linked to above, these approaches are increasing in sophistication and complexity.

LargeVis

LargeVis takes a different approach: it re-uses a lot of the same definitions as t-SNE, but makes sufficient modifications so that it’s possible to use stochastic gradient descent.

Also, rather than talk about probabilities, LargeVis uses the language of graph theory. Each observation in our dataset is now considered to be a vertex or node and the similarity between them is the weight of the edge between the two vertices. Conceptually we’re still talking about elements in a matrix, but I will start slipping into the language of “edges” and “vertices”.

The key change is the cost function, which is now a maximum likelihood function:

LLV=(i,j)Epijlogwij+γ(i,j)Elog(1wij) L_{LV} = \sum_{ \left(i, j\right) \in E} p_{ij} \log w_{ij} +\gamma \sum_{\left(i, j\right) \in \bar{E}} \log \left(1 - w_{ij} \right) pijp_{ij} and wijw_{ij} is the same as in t-SNE (the authors try some alternative wijw_{ij} definitions, but they aren’t as effective).

The new concepts here are γ\gamma and EE. γ\gamma is a user-defined positive scalar to weight repulsive versus attractive forces. Its default in the reference implementation is 7.

EE is the set of edges with a non-zero weight. This is the graph theory way to talk about nearest neighbors in the input space. Just as with Barnes-Hut t-SNE, we find a set of nearest neighbors for each point 𝐱𝐢\mathbf{x_i} and only define input weights and probabilities for pairs of points which are nearest neighbors. As with the official Barnes-Hut t-SNE implementation, the LargeVis reference implementation uses a default perplexity of 50, and the default number of nearest neighbors is 3 times the perplexity.

This cost function therefore consists of two disjoint contributions: nearest neighbors in the input space contribute to the attractive part of the cost function (the first part). Everything else contributes to the second, repulsive part.

The key advantage of this cost function over the KL divergence is that it doesn’t contain qijq_{ij}. With no output normalization, we don’t need to calculate all the output pairwise distances. So this cost function is amenable to stochastic gradient descent techniques.

The LargeVis sampling strategy

To calculate a stochastic gradient, LargeVis does the following:

  • Samples an edge in EE, i.e. chooses ii and jj such that pij0p_{ij} \neq 0. This is called a “positive edge” in the LargeVis paper. ii and jj are used to calculate the attractive part of the gradient.
  • Samples one of the NN vertices, let’s call it kk. As datasets grow larger, the probability that kk is in EE grows smaller, but that’s not actually checked for (only that kik \neq i). This is used to calculate the repulsive part of the gradient. These are called the “negative samples” in the LargeVis paper.
  • Repeat the negative sample step a number of times. The default in LargeVis is to sample 5 negatives for each positive edge.

The coordinates of ii, jj and the various kk are then updated according to the gradients. This concludes one iteration of the SGD.

The attractive and repulsive gradients for LargeVis are respectively:

$$ \frac{\partial L_{LV}}{\partial \mathbf{y_i}}^+ = \frac{-2}{1 + d_{ij}^2}p_{ij} \left(\mathbf{y_i - y_j}\right) \\ \frac{\partial L_{LV}}{\partial \mathbf{y_i}}^- = \frac{2\gamma}{\left(0.1 + d_{ij}^2\right)\left(1 + d_{ij}^2\right)} \left(\mathbf{y_i - y_j}\right) $$ The value of 0.1 that appears in the repulsive gradient is there to prevent division by zero.

Sampling of edges and vertices is not uniform. For the attractive gradient, the authors note that the factor of pijp_{ij} that appears means that the magnitude of the gradient can differ hugely between samples to the extent that choosing an appropriate learning rate can be difficult. Instead they sample the edges proportionally to pijp_{ij} and then for the gradient calculation, treat each edge as if the weights were all equal. The attractive gradient used as part of LargeVis SGD is therefore:

LLV𝐲𝐢+=21+dij2(𝐲𝐢𝐲𝐣) \frac{\partial L_{LV}}{\partial \mathbf{y_i}}^+ = \frac{-2}{1 + d_{ij}^2} \left(\mathbf{y_i - y_j}\right)

As pijp_{ij} doesn’t appear in the repulsive part of the gradient, so it would seem that uniform sampling would work for the negative sampling. However, vertices are sampled using a “noisy” distribution proportional to their degree ^ 0.75, where the degree of the vertex is the sum of the weights of the edges incident to them. There doesn’t seem to be a theoretical reason to use the degree ^ 0.75. It’s based on results from the field of word embeddings: the LargeVis authors reference a skip-gram paper, but the same power also shows up in GloVE. In both cases it is justified purely empirically. The uwot version of LargeVis (lvish) samples the negative edges uniformly, and it doesn’t seem to cause any problems.

UMAP (at last)

The UMAP cost function is the cross-entropy of two fuzzy sets, which can be represented as symmetric weight matrices:

CUMAP=ij[vijlog(vijwij)+(1vij)log(1vij1wij)] C_{UMAP} = \sum_{ij} \left[ v_{ij} \log \left( \frac{v_{ij}}{w_{ij}} \right) + (1 - v_{ij}) \log \left( \frac{1 - v_{ij}}{1 - w_{ij}} \right) \right] vijv_{ij} are symmetrized input affinities, and are not probabilities. The graph interpretation of them as weights of edges in a graph still applies, though. These are arrived at differently to t-SNE and LargeVis. The unsymmetrized UMAP input weights are given by:

vj|i=exp[(rijρi)/σi] v_{j|i} = \exp \left[ -\left( r_{ij} - \rho_{i} \right) / \sigma_{i} \right]

where rijr_{ij} are the input distances, ρi\rho_{i} is the distance to the nearest neighbor (ignoring zero distances where neighbors are duplicates) and σi\sigma_{i} is analogous to βi\beta_{i} in the perplexity calibration used in SNE. In this case, σi\sigma_{i} is determined such that jvj|i=log2k\sum_{j} v_{j|i} = \log_{2} k where kk is the number of nearest neighbors.

January 1 2020: I assume there is a connection here with the local scaling advocated for self-tuning spectral clustering, given that spectral decomposition of the affinity graph is the default initialization method for UMAP.

These weights are symmetrized by a slightly different method to SNE:

vij=(vj|i+vi|j)vj|ivi|j v_{ij} = \left(v_{j|i} + v_{i|j}\right) - v_{j|i}v_{i|j} or as a matrix operation:

Vsymm=V+VTVVT V_{symm} = V + V^{T} - V \circ V^{T}

where TT indicates the transpose and \circ is the Hadamard (i.e. entry-wise) product. This effectively carries out a fuzzy set union.

The output weights are given by:

wij=1/(1+adij2b) w_{ij} = 1 / \left(1 + ad_{ij}^{2b}\right)

where aa and bb are determined by a non-linear least squares fit based on the min_dist and spread parameters that control the tightness of the squashing function. By setting a=1a = 1 and b=1b = 1 you get t-SNE weighting back. The current UMAP defaults result in a = 1.929 and b = 0.7915. April 7 2019: Actually, I got this wrong. The UMAP defaults use min_dist = 0.1, spread = 1, which results in a=1.577a = 1.577 and b=0.8951b = 0.8951. If you use min_dist = 0.001, spread = 1 then you get the result for a=1.929a = 1.929 and b=0.7915b = 0.7915.

The attractive and repulsive UMAP gradient expressions are, respectively:

$$ \frac{\partial C_{UMAP}}{\partial \mathbf{y_i}}^+ = \frac{-2abd_{ij}^{2\left(b - 1\right)}}{1 + ad_{ij}^{2b}} v_{ij} \left(\mathbf{y_i - y_j}\right) \\ \frac{\partial C_{UMAP}}{\partial \mathbf{y_i}}^- = \frac{2b}{\left(0.001 + d_{ij}^2\right)\left(1 + ad_{ij}^{2b}\right)}\left(1 - v_{ij}\right)\left(\mathbf{y_i - y_j}\right) $$

(April 13 2020: In previous versions of this document I had completely messed up this expression by omitting a factor of 2 in the repulsive gradient equation and missed out some important aas and bbs. This also affected the SGD version two equations below. Thank you to Dmitry Kobak for spotting this.)

While more complex-looking than the LargeVis gradient, there are obvious similarities, which become more clearer if you set a=1a=1 and b=1b=1, to get back the t-SNE/LargeVis output weight function. The 0.001 term in the denominator of the repulsive gradient plays the same role as the 0.1 in the LargeVis gradient (preventing division by zero).

UMAP uses the same sampling strategy as LargeVis, where sampling of positive edges is proportional to the weight of the edge (in this case vijv_{ij}), and then the value of the gradient is calculated by assuming that vij=1v_{ij} = 1 for all edges. So for SGD purposes, the attractive gradient for UMAP is:

CUMAP𝐲𝐢+=2abdij2(b1)1+adij2b(𝐲𝐢𝐲𝐣)=2abdij2bdij2(1+adij2b)(𝐲𝐢𝐲𝐣) \frac{\partial C_{UMAP}}{\partial \mathbf{y_i}}^+ = \frac{-2abd_{ij}^{2\left(b - 1\right)}}{1 + ad_{ij}^{2b}}\left(\mathbf{y_i - y_j}\right) = \frac{-2abd_{ij}^{2b}}{d_{ij}^2 \left(1 + ad_{ij}^{2b}\right)}\left(\mathbf{y_i - y_j}\right)

The final expression might be more computationally convenient because it saves on an extra power calculation.

The repulsive part of the gradient contains a 1vij1 - v_{ij} term, but because vij=0v_{ij} = 0 for most pairs of edges, that term effectively disappears, leaving:

CUMAP𝐲𝐢=2b(0.001+dij2)(1+adij2b)(𝐲𝐢𝐲𝐣) \frac{\partial C_{UMAP}}{\partial \mathbf{y_i}}^- = \frac{2b}{\left(0.001 + d_{ij}^2\right)\left(1 + ad_{ij}^{2b}\right)} \left(\mathbf{y_i - y_j}\right)

Unlike LargeVis, negative sampling in UMAP uses a uniform distribution.

It’s worth considering that, although LargeVis uses pijp_{ij} and UMAP uses vijv_{ij} in their cost functions, the difference isn’t that important, because sampling proportionally to pijp_{ij} is exactly the same as sampling proportionally to vijv_{ij}. In fact, if you look at the LargeVis reference implementation (or lvish in uwot), the input affinities are symmetrized, but not divided by NN. Nonetheless, because the affinities are only used to construct the sampling probabilities, the presence of the γ\gamma parameter in the repulsive part of the gradient means that you are effectively using pijp_{ij} in the LargeVis cost function that is being optimized, not vijv_{ij}.

Minor UMAP Variations

There are some extra parameters in UMAP that make some minor changes if modified from their default-values:

  • local_connectivity affects ρi\rho_{i} by using the distance to the local_connectivityth non-zero near neighbor (or by interpolating between distances if local_connectivity is non-integral).

  • set_op_mix_ratio This changes the form of the symmetrization from fuzzy set union to fuzzy set intersection which is just vj|ivi|jv_{j|i}v_{i|j}, and can also blend between the two.

  • gamma This works exactly like the value in LargeVis, up-weighting the repulsive contribution of the gradient.

2 August 2018: The follow parameter no longer appears in the reference UMAP implementation:

  • bandwidth affects vj|iv_{j|i} by multiplying the value of σi\sigma_{i}:

vj|i=exp[(rijρi)/βσi] v_{j|i} = \exp \left[ -\left( r_{ij} - \rho_{i} \right) / \beta \sigma_{i} \right]

where bandwidth is represented as β\beta. The value of σi\sigma_{i} is determined using calculations of vj|iv_{j|i} without β\beta, before recalculating the vj|iv_{j|i} using bandwidth.

I’m not sure how useful any of these are when changed from the defaults.

My thanks to Dmitry Kobak for some very helpful discussions and typo-spotting.