uwot relies on the underlying compiler and C++ standard
library on your machine and this can result in differences in output
even with the same input data, arguments, packages and R version. If you
require reproducibility between machines, it is strongly suggested that
you stick with the same OS and compiler version on all of them (e.g. a
fixed LTS of a Linux distro and gcc version). Otherwise, the following
can help:
- Use the
tumapmethod instead ofumap. This avoid the use ofstd::powin gradient calculations. This also has the advantage of being faster to optimize. However, this gives larger clusters in the output, and you don’t have the ability to control that withaandb(orspreadandmin_dist) parameters. - For
umap, it’s better to provideaandbdirectly with a fixed precision rather than allowing them to be calculated via thespreadandmin_distparameters. For default UMAP, usea = 1.8956, b = 0.8006. - Use
approx_pow = TRUE, which avoids the use of thestd::powfunction. - Use
init = "spca"rather thaninit = "spectral"(although the latter is the default and preferred method for UMAP initialization). - If
n_sgd_threadsis set larger than1, then even if you useset.seed, results of the embeddings are not repeatable, This is because there is no locking carried out on the underlying coordinate matrix, and work is partitioned by edge not vertex and a given vertex may be processed by different threads. The order in which reads and writes occur is of course at the whim of the thread scheduler. This is the same behavior as LargeVis. - Use
rng_type = "deterministic, which will make vertex sampling during the optimization deterministic. Note that this will not affect the use of a random number generator in other parts of the algorithm, such as approximate nearest neighbor search and initialization. This may give slightly less accurate results due to the lack of random sampling but the trade-off may be worth it (and it’s also a bit faster). - For random number generation, you can provide a
seedparameter. This doesn’t do anything other than callset.seedfor you inside the routine, but you may find it convenient to fix the seed in the call toumap.
In summary, your chances of reproducibility are increased by using: