UMAP on some simple datasets • uwot

The distill publication How to Use t-SNE Effectively provides some interactive examples of t-SNE on a variety of simple datasets and demonstrates the effect of changing its hyperparameters, mainly the perplexity, which acts like a continuous version of the n_neighbors parameter in UMAP.

Below, I will repeat some results using UMAP instead of t-SNE, which might highlight some differences between the two methods. Sadly, there won’t be any fancy in-browser interactive demos.

The datasets used in the distill publication are available translated to R in the snedata package.

UMAP results are all run with the n_neighbors parameter set to match the perplexity used in t-SNE all other parameters left at their default values. There are two minor exceptions: in UMAP, a point is a neighbor of itself, which isn’t the case with perplexity. This doesn’t make much difference for large perplexities, but for the low perplexity values of 2 and 5, I use n_neighbors = 3 and n_neighbors = 6, respectively.

I won’t repeat the t-SNE results here. You should refer back to the distill page to compare. I have kept the same order of datasets and used the same colors so it should be straightforward to keep things tracked.

Grid

A 2D grid with regularly spaced points.

grid2d <- snedata::grid_data(n = 20)

Like t-SNE, UMAP tends to expand denser regions of data, so there is a bigger gap between points in the middle of the grid.

2 Clusters

Two 2D Gaussian clusters of equal variance, and 50 points each.

gauss2d <- snedata::two_clusters_data(n = 50, dim = 2)

Setting n_neighbors too low clearly gives results which are too local.

Cluster Densities

In this example, one of the clusters (the yellow one) is much denser (and hence smaller) than the other.

gauss2d_scale <- snedata::two_different_clusters_data(n = 75, scale = 10, dim = 2)

Again like t-SNE, UMAP does not reproduce the relative cluster densities.

Cluster Size

x100a <- snedata::gaussian_data(n = 100, dim = 50, color = "blue")
x1000b <- snedata::gaussian_data(n = 1000, dim = 50, color = "orange")
x1000b[, 1:50] <- x1000b[, 1:50] <- x1000b[, 1:50] + 10
x200 <- rbind(x100a, x100b)
x1100 <- rbind(x100a, x1000b)

As an aside, what about two clusters with the same density but different numbers of points? Below is an example with two clusters with equal sizes (100 points each), and then where the orange cluster contains 1000 points:

From this example you can see that UMAP will display clusters with more members as being larger. This can have implications for the visualization if you have a minority class that you are most interested in.

Distances Between Clusters

In this example, we are back to gaussians with the same variances, but now one of them (the green one) is much further away than the other two.

gauss_3clusters <- snedata::three_clusters_data(n = 50, dim = 2)

There’s not really any value of n_neighbors where the correct relative distances are reproduced. On the other hand, at least we don’t see any strange distortion of the size of the green cluster at high values for n_neighbors, where as the t-SNE results start showing distortions at high perplexity.

We then repeat with a larger number of points in each cluster:

gauss_3clusters200 <- snedata::three_clusters_data(n = 200, dim = 2)

Results are very consistent with a sensible value of n_neighbors, but it’s clear that UMAP does not reproduce relative distances in this case.

Random Noise

A single high-dimensional Guassian:

gauss100d <- snedata::gaussian_data(n = 500, dim = 100, color = "#003399")

Again, we see t-SNE-like behavior: the density of points in the projection is more even than the linear projection provided by PCA. It’s also clear that low values of n_neighbors could mislead you into seeing large numbers of small clusters that aren’t really there.

Elongated Shapes

An ellipsoidal cluster:

gauss_long <- snedata::long_gaussian_data(n = 100, dim = 50, color = "#003399")

Once again, UMAP behaves pretty well here as long as n_neighbors is sufficiently high.

Now, here are two ellipsoidal clusters:

gauss_2long <- snedata::long_cluster_data(n = 75)

The density distortion effect is also apparent here, causing the clusters to curve.

Topology

Containment

In this dataset, there are two 50D gaussian clusters, centered over the same location, but as the PCA plot on the top left row shows, the blue cluster has a much smaller variance and so is “contained” inside the yellow cluster.

subset50d <- snedata::subset_clusters_data(n = 75, dim = 50)

At last a difference with the t-SNE results. With t-SNE, the containment relationship can be displayed with a suitable choice of perplexity, at the cost of the yellow cluster gaining a more ring-like shape. UMAP, however, stubbornly refuses to show anything of the sort, with the blue cluster expanded to overlap the yellow cluster even at the higher values of n_neighbors.

Linked Rings

2D linked rings, embedding into 3D (one is at right angles to the other).

linked_rings <- snedata::link_data(n = 100)

The t-SNE results show the rings between separate at low perplexity, and only linked at high perplexities. The UMAP results are always unlinked except at a very low value of n_neighbors, and even then this seems to be an artifact of the number of epochs and the random seed. If you set n_epochs higher, then the rings will be invariably unlinked.

Trefoil Knot

trefoil <- snedata::trefoil_data(n = 150)

Results here are quite similar to the t-SNE results. At low values of n_neighbors, the knot is unfolded into a circle, and at higher values, the folded form appears. As with the linked rings, the results you get at low values of n_neighbors are more consistent at higher values of n_epochs.

What are we to make of all this? Mainly, that the UMAP results are a bit more consistent than that of t-SNE, in the sense that changing n_neighbors doesn’t lead to very different results in the way that changing perplexity does for t-SNE, although these effects are mainly restricted to the three cluster and the containment example. You may see this as an advantage to t-SNE. Personally, I am a bit skeptical that you would see this sort of thing in real world datasets.

It’s also worth noting that, like t-SNE:

Inter-cluster distances are not reproduced.
Relative cluster sizes are not reproduced.
Denser regions of space are expanded more than less dense parts.