Python Comparison
and some nearest neighbor comparisons
Source:vignettes/articles/pycompare.Rmd
pycompare.Rmd
This page is a gallery of images comparing the output of
uwot
version 0.1.3 to the Python UMAP
package version 0.3.8. In an ideal world, this wouldn’t be a very
exciting document, as it would exist to demonstrate that
uwot
results resemble those from the Python
implementation.
However, there are a couple of examples that highlight some
differences between the implementations. It’s still not that exciting,
but there may be some information of use to anyone using
uwot
and possibly some implications for users of t-SNE
packages. To distinguish between the general technique of UMAP and the
its specific implementation in the Python package of the same name, I’ll
refer to the Python implementation with the formatting
UMAP
.
For details on the datasets, see the UMAP
examples gallery. The uwot
results have been re-run for
the images here using different random seeds, but they should resemble
the uwot
output on that page.
The datasets originated in R, mostly being loaded via the snedata package. I then exported them to CSV, before reading them into Python. UMAP version 0.3.8 was used to generate the embeddings, using all default settings. The embeddings were then written to CSV and read back into R for visualization.
The uwot
output also used default settings with the
following exceptions: pca = 100
was used to reduce the
dimensionality to a maximum of 100 columns (data was mean-centered as
part of this process, but not rescaled), because the Annoy nearest
neighbor search can be very slow for high-dimensional data.
UMAP
’s nearest neighbor
search seems less affected by dimensionality. Also,
min_dist = 0.1
, which is the default for the Python UMAP.
uwot
uses a default of min_dist = 0.01
. Why? I
would love to tell you, but I think I just made a mistake when I set the
uwot
defaults. Therefore, the defaults will probably change
in some later version of uwot
. Fortunately, this doesn’t
make a perceptible practical difference, but I will use the Python
default of min_dist = 0.1
in the following results.
There is also a potential difference in initialization. Both packages
use a spectral initialization if possible, but differ if the input graph
contains more than one component: UMAP
attempts a
meta-embedding of the two separate components, where uwot
abandons a spectral approach and falls back to using PCA (followed by
rescaling the output to a standard deviation of 1 for each axis). This
affects the following datasets: iris
, coil20
,
coil100
, norb
and tasic2018
.
Below the results on the left are for the uwot
output.
The image on the right is from UMAP
. I present most of
these results without comment because they seem to be very similar to
each other. There are two exceptions, for the small NORB dataset
(norb
) and the gene expression dataset
macosko2015
, which will be discussed in their section.
Results
norb
Here’s a result that seems worth commenting on. The uwot
results show slightly less structure, a large blue loop seems to have
been broken up.
I suspect that the PCA preprocessing to 100 dimensions as used by
uwot
is too aggressive here and might be throwing away too
much of the variance in the dataset. Compared to the ground truth 15
nearest neighbors using FNN, Annoy results
with pca = 100
are 82% accurate. Without PCA, this accuracy
goes up to 99% and the uwot
results are much closer to the
UMAP
results. This is the image below, on the left.
Meanwhile, if you use the PCA results as input to UMAP
, the
output now looks a bit more like the uwot
results. That’s
the image on the lower right.
Unfortunately, the very high dimensionality of norb
makes running Annoy on it directly very slow indeed: with a
single-thread, it took 4 and a half hours, versus around 5 minutes in
the PCA case. We will revisit nearest neighbor accuracies later.
Note also that norb
is one of the datasets where the
graph contains two separate components, so the initialization of the
output coordinates is different between uwot
and
UMAP
, with UMAP
’s “meta-embedding” approach
better at retaining local structure.
macosko2015
This result shows the largest deviation of uwot
from
UMAP
. That big cyan cluster is of an obviously different
shape in the two plots. The source of the difference here seems to be
nearest neighbor results. It turns out that both uwot
and
UMAP
have trouble find good approximations to the nearest
neighbors with this dataset, but UMAP
does a lot better. We
will get into more detail about the nearest neighbor accuracies across
the datasets in the next section. For now, below are two further plots.
The left-hand plot is the uwot
results when it uses the
more accurate nearest neighbor data calculated by UMAP
: the
uwot
results now resemble the UMAP
results,
which is reassuring. The right hand plot uses exact nearest neighbor
results calculated via the FNN
.
Even though the UMAP
nearest neighbor data is better
than that which uwot
produced, we can still see an obvious
difference in the two plots, so it’s clear that this data set presents a
challenge for the approximate nearest neighbor methods considered
here.
cifar10
Based on results with this dataset in the UMAP
examples gallery, I wasn’t expecting these results to be a feast for
the eyes. After seeing the difference in results with
macosko2015
, I am just relieved that these two results look
disappointing in a similar way.
Nearest Neighbor Accuracy
The norb
and macosko2015
results suggest
that the use of PCA and the nearest neighbor search methods present the
biggest source of difference between uwot
and
UMAP
. Here are some more details.
For small datasets, which in both implementations means less than
4,096 observations, both uwot
and UMAP
calculate the exact nearest neighbors. For larger datasets,
uwot
uses the Annoy method, and
UMAP
uses pynndescent, which
uses random projection trees, followed by nearest neighbor descent.
For the larger datasets, I used the FNN package to
generate the exact 15 nearest neighbors and then compared results for
uwot
and UMAP
. For UMAP
, I used
default settings. For uwot
, I applied
pca = 100
as that is what I currently use for high
dimensional datasets. The third column shows the amount of variance
explained by only keeping 100 components.
datasets | uwot |
UMAP |
PCA 100 var |
---|---|---|---|
coil100 |
89.1% | 99.7% | 88.4% |
mnist |
86.2% | 97.5% | 91.5% |
fashion |
74.3% | 97.9% | 91.2% |
kuzushiji |
78.3% | 95.2% | 84.6% |
norb |
81.7% | 98.3% | 95.6% |
tasic2018 |
48.0% | 97.7% | 67.7% |
macosko2015 |
11.2% | 41.8% | 37.3% |
cifar10 |
67.4% | 84.1% | 90.1% |
UMAP
does a pretty incredible job with nearly every
dataset, and it’s fast too. And it doesn’t need to use PCA to
reduce the dimensionality. The obvious exception is
macosko2015
. I was unable to find a combination of
parameters with pynndescent
that produced good results,
e.g. by increasing the number of trees in the RP tree initialization or
the number of iterations in the nearest neighbor descent phase. With
n_trees=250
, I was able to get to 54.9% accuracy, but I
started to get out of memory errors much beyond this setting, and
nearest neighbor descent tends to converge after 5-6 iterations anyway.
Doing better will may require more manipulation of the sampling rate and
candidate list size used in nearest neighbor descent, which can quickly
cause a large increase in memory and run time if you aren’t careful.
Also, you can’t fiddle with these parameters in UMAP
version 0.3.8 anyway.
uwot
results… well, they’re less impressive than
UMAP
’s. The tasic2018
results are pretty bad,
and the macosko2015
results can be fairly described as
horrendous. The accuracies show a rough correlation with the percentage
of variance explained by 100 components. Results for
tasic2018
and especially macosko2015
show that
100 components may not be enough to extract the meaningful variation
from the datasets, which might explain the inability to find the true
nearest neighbors. The uwot
defaults for the Annoy search
are somewhat arbitrary, having been copied from LargeVis. That package
used nearest neighbor descent to improve the accuracy, and increasing
Annoy’s n_trees
parameter can give a small improvement to
some of the datasets, but the increase in computation time is probably
not worth the small increase in accuracy I’ve seen when trying this. It
seems that in cases where the current defaults don’t work well, it’s
because too much information is thrown away by the PCA
pre-processing.
tasic2018
and macosko2015
are both
transcriptomics datasets and clearly represent an interesting challenge
to Annoy, while macosko2015
troubles even pynndescent. The
other datasets are image datasets. Possibly the way the data has been
scaled and prepared for these biology datasets makes life harder for
approximate nearest neighbors? It’s hard to know where to start when it
comes to unpicking what causes macosko2015
to perform so
differently, but here is a histogram of the nearest neighbor distances
for macosko2015
and tasic2018
(top row), which
misbehave with Annoy, and on the bottom row are the histograms for
mnist
and fashion
, which are better
behaved:
The biomodal distribution of macosko2015
and
tasic2018
do seem more similar to each other than they do
to mnist
or fashion
. That said,
tasic2018
behaves perfectly well with the pynndescent
method, so I see no obvious answers here.
Conclusions
The good news is that for most datasets, the differences between
uwot
’s UMAP implementation and that of the Python
UMAP
implementation makes very little difference. The bad
news is that there are a couple of cases where we see
differences. The biggest offender is macosko2015
but there
neither method is doing a perfect job. Nonetheless, UMAP
is
able to calculate nearest neighbors more accurately and more quickly
than uwot
.
One source of the difference is the use of Annoy as the nearest neighbor calculation method. It does seem to be solidly slower than pynndescent, but in my experience, it can reach high accuracies. even if it takes a while. March 8 2020: On Windows, there used to be an issue where an Annoy index that was larger than 2GB on disk couldn’t be read back in. Make sure you are using a version of RcppAnnoy 0.0.15 or later to avoid this.
The time to search an Annoy index seems to start scaling quite badly
with dimensionality > 1000, so if you can get away with picking
500-1000 random features and use that, you could try that. With 1,152
random features (i.e. 1/16th the pixels innorb
), visual
results aren’t that much worse with this approach and the
nearest neighbor search runs 100 times faster (50 minutes vs 20
seconds):
Otherwise, if you have a high-dimensional dataset, you should
consider PCA. I don’t have any solid advice for the number of components
to retain if you do use PCA for preprocessing. Less than the usual t-SNE
default of 50 is probably unwise. I use pca = 100
in most
examples, but as we’ve seen, that can still sufficiently perturbs the
pairwise distances in the input data to give some differences in the
output. It would be nice if you could use more components to get a
better trade-off of accuracy and speed, but even with fast partial PCA
routines used by uwot
(courtesy of irlba), extracting
more than 100 components can be quite time-consuming.
What about t-SNE?
uwot
is not the only dimensionality reduction to use PCA
or Annoy for approximate nearest neighbor search. LargeVis uses it, the
t-SNE package Fit-SNE
has Annoy as an option and openTSNE seems to
have at least considered it (see e.g. https://github.com/pavlin-policar/openTSNE/issues/28).
It is also standard practice in t-SNE to reduce dimensionality: Rtsne, a wrapper around the
de-facto standard Barnes Hut t-SNE routine. will only keep 50 components
via PCA by default.
Does the PCA issue effect t-SNE? Well, a little bit. Below are
results for norb
and macosko2015
(openTSNE’s
documentation is where I discovered this dataset, as it happens) with
t-SNE results from Rtsne
, using
perplexity = 15
. The images on the left use PCA to reduce
the initial dimensionality down to 100, while the ones on the right use
the raw input data.
norb
results seem fairly unaffected, but we also see for
macosko2015
that there’s a change to the main large
cluster. Usually, you don’t set perplexity as low as 15, so are there
also differences when the perplexity is set to a more typical value,
like 50? Also, the default PCA dimensionality setting in
Rtsne
is 50, not 100, so here are those t-SNE plots
repeated with perplexity = 50
and either
pca = TRUE, partial_pca = TRUE, initial_dims = 50
on the
left, and pca = FALSE
on the right:
Again, the norb
dataset is less affected than
macosko2015
but for the latter there is a definite effect
of preprocessing with PCA. This is something to bear in mind before
applying PCA to your data even with t-SNE. On balance, I’d probably live
with the effect of applying PCA to your data, because just like with
uwot
, Rtsne
input processing can take a few
hours with very high dimensional datasets.
Finally, some people consider results obtained with PCA to be
more reliable than those using the “raw” input, because PCA can
be considered a way to denoise data, where the lower-variance components
are assumed to be irrelevant. I might be a bit biased, but of all the
UMAP plots of macosko2015
, I think the default
uwot
one shows the best separation of clusters, so maybe
there’s something to it. But this is also the reason I tabulated the
amount of variance 100 components explained for each dataset. Where most
of the datasets keep a large majority of the variance with 100
components, in the case of macosko2015
, you would have to
be prepared to argue why it was ok to throw away over 60% of variance in
the data. I’m sure it can be done, but you would need a solid, ideally
biology-based justification.
A future version of uwot
will hopefully contain an
implementation of nearest neighbor descent optimization to reduce the
gap in nearest neighbor accuracy with UMAP
.