Skip to contents

uwot relies on the underlying compiler and C++ standard library on your machine and this can result in differences in output even with the same input data, arguments, packages and R version. If you require reproducibility between machines, it is strongly suggested that you stick with the same OS and compiler version on all of them (e.g. a fixed LTS of a Linux distro and gcc version). Otherwise, the following can help:

  • Use the tumap method instead of umap. This avoid the use of std::pow in gradient calculations. This also has the advantage of being faster to optimize. However, this gives larger clusters in the output, and you don’t have the ability to control that with a and b (or spread and min_dist) parameters.
  • For umap, it’s better to provide a and b directly with a fixed precision rather than allowing them to be calculated via the spread and min_dist parameters. For default UMAP, use a = 1.8956, b = 0.8006.
  • Use approx_pow = TRUE, which avoids the use of the std::pow function.
  • Use init = "spca" rather than init = "spectral" (although the latter is the default and preferred method for UMAP initialization).
  • If n_sgd_threads is set larger than 1, then even if you use set.seed, results of the embeddings are not repeatable, This is because there is no locking carried out on the underlying coordinate matrix, and work is partitioned by edge not vertex and a given vertex may be processed by different threads. The order in which reads and writes occur is of course at the whim of the thread scheduler. This is the same behavior as LargeVis.

In summary, your chances of reproducibility are increased by using:

mnist_umap <- umap(mnist, a = 1.8956, b = 0.8006, approx_pow = TRUE, init = "spca", batch = TRUE)
# or
mnist_tumap <- tumap(mnist, init = "spca", batch = TRUE)