Up: Documentation Home.

The Spectral Direction describes an approximation to the Hessian used in t-SNE and related methods, which can be used in a Newton-like optimization. For methods which use symmetrized affinity (or probabiity) matrices, a good approximation of the Hessian of the attractive part of the cost function is just the graph Laplacian multiplied by 4, where the graph Laplacian is formed from the input affinities/probabilities matrix.

Being a Newton-like method, this requires a Cholesky factorization, followed by forward/back-solving to solve the usual \(\mathbf{B}p = -g\) equation. The Cholesky factorization is O(N^3), which the forward/back-solving part is “only” O(N^2), so for this solution to be scalable, the Cholesky factorization is carried out only once at the beginning of the optimization, and the graph Laplacian is sparsified. This isn’t a bad approximation for the attractive part of the cost function because the elements are linearly proportional to the probabilities, which are by construction extremely close to zero outside the k-nearest neighbors of any observation, where k is the perplexity.

smallvis doesn’t care for scalability, so its implementation of the above doesn’t enforce sparsity. A very similar approach to spectral directions was described by van der Maarten (PDF) which also didn’t attempt to enforce sparsity (it should also be noted that the van der Maaten approach also doesn’t attempt to enforce positive definiteness, which could lead to the optimization failing). In this paper, the Cholesky factorization didn’t dominate the computation time: the dataset used was a subset of MNIST with n = 5,000, so similar in size to the largest datasets considered here. Therefore we can probably get away with ignoring sparsity. However, results in this paper were much less detailed than those presented in the spectral directions paper though (sadly, no visualization of the delta-bar-delta versus partial Hessian resulting coordinates are shown).

Datasets

As usual, see the Datasets page.

Settings

For the backtracking line search, the default armijo setting from mize of c1 = 1e-4 was used, rather than c1 = 0.1, which would be equivalent to the spectral directions paper. That probably doesn’t make much difference.

To try and keep the optimization within a reasonable resource usage compared to the delta-bar-delta default, the convergence settings for mize were set to max_gr = 1000 (maximum number of gradient evaluations) and step_tol = 1e-6 (stop if the final \(\alpha\) for any iteration is less than 1e-6). This still results in more resource usage than using DBD, because the line search uses function evaluations as well as gradient evaluations, which DBD only carries out when logging every 100 iterations.

Initialization was from a scaled PCA using the first two PCs, without any input data scaling. The van der Maaten experiments also initialized from PCA, presumably without any scaling. No early exaggeration was used: the spectral directions and the van der Maaten experiments don’t either. The target perplexity was 40.

Example invocations are given below:

# Spectral Direction with default Wolfe line search
iris_specd <- smallvis(iris, Y_init = "spca", opt = list("specd", step_tol = 1e-6, max_gr = 1000), scale = FALSE, ret_extra = c("DX", "DY"), perplexity = 40)

# Spectral Direction with backtracking line search
iris_specd_bt <- smallvis(iris, Y_init = "spca", opt = list("specd", step_tol = 1e-6, max_gr = 1000, line_search = "backtracking", step_next_init = 1, step0 = 1), scale = FALSE, ret_extra = c("DX", "DY"), perplexity = 40)

Evaluation

For each initialization, the mean neighbor preservation of the 40 nearest neighbors, calculated using the quadra package. The number reported is the mean average over all results and is labelled as mnp@40 in the plots. 40 was chosen for these results to match the perplexity.

Results

For each dataset, four results are shown below. On the first row, the left image shows results using spectral direction with a Wolfe line search; the righthand image is the results for using spectral direction with backtracking line search. On the bottom row, the left and right hand images are those from using default mize L-BFGS settings with Wolfe line search, and the default smallvis DBD settings respectively. These used the same initialization, input scaling and perplexity. The L-BFGS results were not restricted in the number of gradient evaluations or step tolerance they were allowed, so have an unfair advantage (although they still had to stop after 1000 iterations). Although perplexity scaling via smallvis_perpstep gave better results for L-BFGS than using a single perplexity, these results are not used for comparison here.

Iris

iris specd Wolfe iris specd BT
iris l-BFGS iris DBD

s1k

s1k specd Wolfe s1k specd BT
s1k l-BFGS s1k DBD

Olivetti Faces

oli specd Wolfe oli specd BT
oli l-BFGS oli DBD

Frey Faces

frey specd Wolfe frey specd BT
frey l-BFGS frey DBD

COIL-20

coil20 specd Wolfe coil20 specd BT
coil20 l-BFGS coil20 DBD

MNIST (6,000)

mnist specd Wolfe mnist specd BT
mnist l-BFGS mnist DBD

Fashion (6,000)

fashion specd Wolfe fashion specd BT
fashion l-BFGS fashion DBD

Conclusions

As usual, for the smaller datasets, results are all pretty consistent. Also the Fashion MNIST sample is reliably non-separable in 2D. Differences in the MNIST digits are worth looking at, though. The spectral directions results look better than the L-BFGS results, although not quite as good as the DBD results.

Restricting the line search to back-tracking seems to have basically no effect on the results. Visually, the MNIST results are better with back-tracking in fact. Tentatively, I would recommend back-tracking line search with spectral directions, which at least agrees with the original spectral directions paper.

These experiments didn’t use early exaggeration. It’s possible that early exaggeration is more useful than spectral directions: the Barnes-Hut version of t-SNE, despite also being written by van der Maaten, doesn’t use the partial Hessian approach but does use early exaggeration. Indeed the importance of making the exaggeration larger than for smaller datasets is stressed. Possibly for large datasets, the Cholesky decomposition starts to dominate the calculation time. Or, as noted by Yang and co-workers in their experiments, the spectral direction method is unstable when used with early exaggeration.

Up: Documentation Home.