Next: Experimental Gradients Up: Index

There are lots of different approaches to deriving the t-SNE gradient and related methods in the literature. The one that helped me the most is the one given by Lee and co-workers in the JSE paper.

The following discussion attempts to do something similar, but taking things even slower. Also, I am a lot less rigorous and I’ve tried to use less notation, but to also be more generic, avoiding making any assumptions about the cost function or weighting function.

The only mathematical abilities you should need for this is the ability to do basic partial differentiation and the chain rule for partial derivatives, which happens to be:

Chain rule for partial derivatives

Say we have a function \(x\), of \(N\) variables \(y_1, y_2 \dots y_i \dots y_N\), and each \(y\) is a function of \(M\) variables \(z_1, z_2, \dots z_j \dots z_M\), then the partial derivative of \(x\) with respect to one of \(z\) is:

\[\frac{\partial x}{\partial z_j} = \sum_i^N \frac{\partial x}{\partial y_i}\frac{\partial y_i}{\partial z_j} \]

Also, if this isn’t obvious, the various commutative, associative and distributive properties of addition and multiplication mean that we can move around nested sigma notation:

\[\sum_i x_i \sum_j y_j \equiv \sum_j y_j \sum_i x_i \equiv \sum_i \sum_j x_i y_j \equiv \sum_{ij} x_i y_j \]

There are going to be a lot of nested summations at the beginning of the derivation, so I will be making use of that final short hand where multiple summation indices indicate a nested summation.

Notation

I assume you are familiar with the basic concepts of the approach of SNE and related methods. I’ll use the following notation:

  • \(\mathbf{y_i}\) is the \(i\)th embedded coordinate in the lower dimension.
  • \(P\) is the matrix of input probabilities, \(Q\) is the matrix of output probabilities.
  • \(p_{ij}\) means the \(\left(i, j\right)\)th element of the matrix \(P\).
  • I’ll also use \(i\), \(j\), \(k\) and \(l\) as indexes into various matrices.
  • The number of points being embedded is \(N\).

I’ll assume that there is an input probability matrix already created, and that the cost function involves \(P\) and \(Q\) and hence so does the gradient. At each stage of the optimization we need to get from the current set of coordinates \(\mathbf{y_1}, \mathbf{y_2} \dots \mathbf{y_i} \dots \mathbf{y_N}\), to a gradient matrix. The procedure is as follows:

  • Create the distance matrix, where the element \(d_{ij}\) represents the distance between point \(i\) and \(j\).
  • Transform the distances to create \(f_{ij}\).
  • Apply a weighting function to create a weight, \(w_{ij}\), such that the larger the weight, the smaller the distance between \(i\) and \(j\). Because of this inverse relationship between the weight and the distance, I will refer to this weight as a similarity as it makes it easier to remember that a big weight refers to a small distance.
  • Convert the weights into a probability, \(q_{ij}\). This achieved by normalizing over a sum of weights. There are two approaches to defining this sum, which affects the interpretation of the probability. See below for more on this.

Once the output probabilities \(q_{ij}\) are calculated, you are generally in possession of enough data to calcuate the gradient, with the exact form depending on the nature of the cost and similarity function.

Before going further, let’s look at the two approaches to probability calculation.

Symmetric vs asymmetric embedding

Point-wise probabilities

The original SNE approach converted the weights into probabilities by:

\[q_{ij} = \frac{w_{ij}}{\sum_k^N w_{ik}}\]

That is, we consider all similarities involving point \(i\). Let’s call this the point-wise approach. A consequence of this is that \(q_{ij} \neq q_{ji}\) and hence this results in an asymmetric probability matrix, \(Q\). In fact, (at least in the sneer implementation), each row of the matrix is a separate probability distribution, where each row sums to one. In the point-wise approach you are calculating \(N\) different divergences, with each point being responsible for a separate probability distribution.

The point-wise normalization to create \(N\) probabilities is the scheme used in what is now called Asymmetric SNE.

Pair-wise probabilities

Another way to convert the weights is:

\[q_{ij} = \frac{w_{ij}}{\sum_k^N \sum_l^N w_{kl}}\]

This normalizes by using all pairs of points, so we’ll call this the pair-wise approach (in the ws-SNE paper, it’s referred to as “matrix-wise”). The resulting matrix \(Q\) contains a single probability distribution, i.e. the grand sum of the matrix is one. Using this normalization, it’s still true that, in general, \(q_{ij} \neq q_{ji}\), but when creating the input probability matrix \(P\), \(p_{ij}\) and \(p_{ji}\) are averaged so that they are equal to each other. In the case of the output weights, the function that generates them always produces symmetric weights, so that \(w_{ij} = w_{ji}\) which naturally leads to \(q_{ij} = q_{ji}\). The resulting matrix is therefore symmetric without having to do any extra work.

This pair-wise scheme is used in what is called Symmetric SNE and t-distributed SNE.

Obviously these two schemes are very similar to each other, but it’s easy to get confused when looking at how different embedding methods are implemented. As to whether it makes much of a practical difference, in the JSE paper Lee and co-workers say that it has “no significant effect” on JSE, whereas in the t-SNE paper, van der Maaten and Hinton note that SSNE sometimes produced results that were “a little better” than ASNE. Not a ringing endorsement either way, but in my experiments with sneer, the symmetrized (pair-wise) normalization seems to produce better results.

Also, to avoid taking up too much space, from now on, I will omit the \(N\) from the upper range of the summation, and where there are nested summations, I will only use one \(\sum\) symbol, with the number of summation indices below the sigma indicating whether it’s nested, i.e.

\[q_{ij} = \frac{w_{ij}}{\sum_k^N \sum_l^N w_{kl}} \equiv \frac{w_{ij}}{\sum_{kl} w_{kl}} \]

Breaking down the cost function

With all that out of the way, let’s try and define the gradient with respect to the cost function.

We’ll start by defining a chain of dependent variables specifically for probability-based embeddings. A glance at the chain rule for partial derivatives above indicates that we’re going to be using a lot of nested summations of multiple terms, although mercifully most of them evaluate to 0 and disappear. But for now, let’s ignore the exact indices. To recap the variables we need to include and the order of their dependencies:

  • The cost function, \(C\) is normally a divergence of some kind, and hence expressed in terms of the output probabilities, \(q\).
  • The output probabilities, \(q\), are normalized versions of the similarity weights, \(w\).
  • The similarity weights are generated from a function of the distances, \(f\).
  • The \(f\) values are a function of the Euclidean distances, \(d\). Normally, this is the squared distance.
  • The distances are generated from the coordinates, \(\mathbf{y_i}\).

We’re going to chain those individual bits together via the chain rule for partial derivatives. The chain of variable dependencies is \(C \rightarrow q \rightarrow w \rightarrow f \rightarrow d \rightarrow \mathbf{y}\).

Gradient with pair-wise normalization

We can write the partial derivative relating the total error to a coordinate \(h\) as:

\[ \frac{\partial C}{\partial \mathbf{y_h}} = \sum_{ij} \frac{\partial C}{\partial q_{ij}} \sum_{kl} \frac{\partial q_{ij}}{\partial w_{kl}} \sum_{mn} \frac{\partial w_{kl}}{\partial f_{mn}} \sum_{pq} \frac{\partial f_{mn}}{\partial d_{pq}} \frac{\partial d_{pq}}{\partial \mathbf{y_h}} \]

I told you there would be a lot of nested summations. Let’s make some of them disappear.

The relationship between \(q\) and \(w\) depends on whether we are using a point-wise or pair-wise normalization. For now, let’s assume a pair-wise normalization. The difference is not enormous, so let’s come back to it later. No matter which normalization you use, the good news is that the relationship between \(w\), \(f\) and \(d\) is such that any cross terms are 0, i.e. unless \(k=m=p\) and \(l=n=q\) those derivatives evaluate to 0, which immediately gets us to:

\[ \frac{\partial C}{\partial \mathbf{y_h}} = \sum_{ij} \frac{\partial C}{\partial q_{ij}} \sum_{kl} \frac{\partial q_{ij}}{\partial w_{kl}} \frac{\partial w_{kl}}{\partial f_{kl}} \frac{\partial f_{kl}}{\partial d_{kl}} \frac{\partial d_{kl}}{\partial \mathbf{y_h}} \]

Also, either \(k=h\) or \(l=h\), otherwise \(\partial d_{kl}/\partial \mathbf{y_h}=0\) which leads to:

\[ \frac{\partial C}{\partial \mathbf{y_h}} = \sum_{ij} \frac{\partial C}{\partial q_{ij}} \sum_{l} \frac{\partial q_{ij}}{\partial w_{hl}} \frac{\partial w_{hl}}{\partial f_{hl}} \frac{\partial f_{hl}}{\partial d_{hl}} \frac{\partial d_{hl}}{\partial \mathbf{y_h}} + \sum_{ij} \frac{\partial C}{\partial q_{ij}} \sum_{k} \frac{\partial q_{ij}}{\partial w_{kh}} \frac{\partial w_{kh}}{\partial f_{kh}} \frac{\partial f_{kh}}{\partial d_{kh}} \frac{\partial d_{kh}}{\partial \mathbf{y_h}} \] Now to rearrange some of the grouping of the summations:

\[ \frac{\partial C}{\partial \mathbf{y_h}} = \sum_{l} \left( \sum_{ij} \frac{\partial C}{\partial q_{ij}} \frac{\partial q_{ij}}{\partial w_{hl}} \right) \frac{\partial w_{hl}}{\partial f_{hl}} \frac{\partial f_{hl}}{\partial d_{hl}} \frac{\partial d_{hl}}{\partial \mathbf{y_h}} + \sum_{k} \left( \sum_{ij} \frac{\partial C}{\partial q_{ij}} \frac{\partial q_{ij}}{\partial w_{kh}} \right) \frac{\partial w_{kh}}{\partial f_{kh}} \frac{\partial f_{kh}}{\partial d_{kh}} \frac{\partial d_{kh}}{\partial \mathbf{y_h}} \] At this point we can rename some of the indices: \(h\) becomes \(i\), \(i\) becomes \(k\), \(j\) becomes \(l\) and \(k\) and \(l\) become \(j\). This gives:

\[ \frac{\partial C}{\partial \mathbf{y_i}} = \sum_{j} \left( \sum_{kl} \frac{\partial C}{\partial q_{kl}} \frac{\partial q_{kl}}{\partial w_{ij}} \right) \frac{\partial w_{ij}}{\partial f_{ij}} \frac{\partial f_{ij}}{\partial d_{ij}} \frac{\partial d_{ij}}{\partial \mathbf{y_i}} + \sum_{j} \left( \sum_{kl} \frac{\partial C}{\partial q_{kl}} \frac{\partial q_{kl}}{\partial w_{ji}} \right) \frac{\partial w_{ji}}{\partial f_{ji}} \frac{\partial f_{ji}}{\partial d_{ji}} \frac{\partial d_{ji}}{\partial \mathbf{y_i}} \]

Also, let’s assume that both the distances and transformed distances will be symmetric, \(d_{ij} = d_{ji}\) and \(f_{ij} = f_{ji}\). The former is certainly true if we’re using the mathematical concept of a metric as a distance (and I’m unaware of anyone using anything other than the Euclidean distance anyway), and the latter can be made so by construction, by just moving any parameterization that would lead to asymmetric transformed distances into the weight calculation.

This lets us write:

\[ \frac{\partial C}{\partial \mathbf{y_i}} = \sum_{j} \left[ \left( \sum_{kl} \frac{\partial C}{\partial q_{kl}} \frac{\partial q_{kl}}{\partial w_{ij}} \right) \frac{\partial w_{ij}}{\partial f_{ij}} + \left( \sum_{kl} \frac{\partial C}{\partial q_{kl}} \frac{\partial q_{kl}}{\partial w_{ji}} \right) \frac{\partial w_{ji}}{\partial f_{ji}} \right] \frac{\partial f_{ij}}{\partial d_{ij}} \frac{\partial d_{ij}}{\partial \mathbf{y_i}} \]

\[ \frac{\partial C}{\partial \mathbf{y_i}} = \sum_{j} \left( k_{ij} + k_{ji} \right) \frac{\partial f_{ij}}{\partial d_{ij}} \frac{\partial d_{ij}}{\partial \mathbf{y_i}} \]

with

\[k_{ij} = \left[ \sum_{kl} \frac{\partial C}{\partial q_{kl}} \frac{\partial q_{kl}}{\partial w_{ij}} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

Without getting into any specifics of the functional form of the cost function or weighting function, there’s a further simplification we can make. The probabilities are always created by a straightforward normalization of the weights. We’re considering just the pair-wise normalization for now, and as we’ve seen, the relationship between \(w\) and \(q\) is:

\[q_{ij} = \frac{w_{ij}}{\sum_{kl} w_{kl}} = \frac{w_{ij}}{S}\]

but now we’ve introduced \(S\), the sum of all the weights involving all pairs.

It’s important to realize that any particular weight, \(w_{ij}\), appears in both the expression for its equivalent probability, \(q_{ij}\) (where it appears in the numerator and denonimator) and in the expression for all the other probabilities, \(q_{kl}\), where \(i \neq k\) and \(j \neq l\). In the latter case, it appears only in the denominator, but it still leads to non-zero derivatives.

Thus, we have two forms of the derivative to consider: \[\frac{\partial q_{ij}}{\partial w_{ij}} = \frac{S - w_{ij}}{S^2} = \frac{1}{S} - \frac{q_{ij}}{S} \] and: \[\frac{\partial q_{kl}}{\partial w_{ij}} = -\frac{w_{kl}}{S^2} = -\frac{q_{kl}}{S} \]

They’re nearly the same expression, there’s just one extra \(1/S\) term to consider when \(i=k\) and \(j=l\).

Inserting these expressions into the one we had for \(k_{ij}\):

\[ k_{ij} = \left[ \sum_{kl} \frac{\partial C}{\partial q_{kl}} \frac{\partial q_{kl}}{\partial w_{ij}} \right] \frac{\partial w_{ij}}{\partial f_{ij}} = \frac{1}{S} \left[ \frac{\partial C}{\partial q_{ij}} - \sum_{kl} \frac{\partial C}{\partial q_{kl}} q_{kl} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

This is far as we can get with \(k_{ij}\) without choosing a cost and weighting function, but we can simplify the distance part of the expression if we’re prepared to assume that we’re only going to want to use Euclidean distances in the output.

While there may be some exotic situations where the output distances should be non-Euclidean (a literary analysis of HP Lovecraft perhaps, or more seriously, the use of hyperbolic distances to model hierarchical data), the t-SNE literature always has \(d_{ij}\) represent Euclidean distances. In a \(K\)-dimensional output space, the distance between point \(\mathbf{y_i}\) and point \(\mathbf{y_j}\) is:

\[d_{ij} = \left[\sum_l^K\left (y_{il} - y_{jl} \right )^2\right ]^{1/2}\]

and the derivative can be written as:

\[\frac{\partial d_{ij}}{\partial \mathbf{y_i}} = \frac{1}{d_{ij}}\left(\mathbf{y_i} - \mathbf{y_j}\right) \]

Therefore, a completely generic expression for the gradient is:

\[ \frac{\partial C}{\partial \mathbf{y_i}} = \sum_{j} \left( k_{ij} + k_{ji} \right) \frac{\partial f_{ij}}{\partial d_{ij}} \frac{1}{d_{ij}}\left(\mathbf{y_i} - \mathbf{y_j}\right) \]

with

\[k_{ij} = \frac{1}{S} \left[ \frac{\partial C}{\partial q_{ij}} - \sum_{kl} \frac{\partial C}{\partial q_{kl}} q_{kl} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

If you want to get a tiny bit more specific, I am also unaware of any t-SNE-related methods that don’t transform the distances by simply squaring them:

\[f_{ij} = d_{ij}^{2}\]

Allow me to insult your intelligence by writing out the gradient:

\[\frac{\partial f_{ij}}{\partial d_{ij}} = 2d_{ij}\]

which cleans things up even more:

\[ \frac{\partial C}{\partial \mathbf{y_i}} = 2 \sum_{j} \left( k_{ij} + k_{ji} \right) \left(\mathbf{y_i} - \mathbf{y_j}\right) \] This is now looking more like the expected “points on springs” interpretation of the gradient, with the \(k_{ij}\) representing the force constant (stiffness) of each spring, and \(\mathbf{y_i - y_j}\) the displacement.

The above equation is useful because as long as you can define the gradient of a cost function in terms of \(q\) and the gradient of a similarity kernel in terms of \(f\), you can mix and match these terms and get the gradient of the cost function with respect to the embedded coordinates without too much trouble, which is all you need to optimize the coordinates with a standard gradient descent algorithm.

Gradient with point-wise normalization

The above expression was derived for the pair-wise normalization. I said we’d get back to the point-wise version and now seems like a good time. The derivation proceeds as normal initially, where we cancel out all the cross terms associated with \(w\), \(f\) and \(d\):

\[ \frac{\partial C}{\partial \mathbf{y_h}} = \sum_{ij} \frac{\partial C}{\partial q_{ij}} \sum_{kl} \frac{\partial q_{ij}}{\partial w_{kl}} \frac{\partial w_{kl}}{\partial f_{kl}} \frac{\partial f_{kl}}{\partial d_{kl}} \frac{\partial d_{kl}}{\partial \mathbf{y_h}} \]

There is now one extra simplification we can make. Let’s refresh our memories over the form of the point-wise normalization:

\[q_{ij} = \frac{w_{ij}}{\sum_k w_{ik}}\]

i.e. it’s only over weights associated with point \(i\). In terms of cancelling cross terms in the derivative, we can see that unless \(i = k\), then \(\partial q_{ij}/\partial w_{kl} = 0\), so we can now replace \(k\) with \(i\):

\[ \frac{\partial C}{\partial \mathbf{y_h}} = \sum_{ij} \frac{\partial C}{\partial q_{ij}} \sum_{l} \frac{\partial q_{ij}}{\partial w_{il}} \frac{\partial w_{il}}{\partial f_{il}} \frac{\partial f_{il}}{\partial d_{il}} \frac{\partial d_{il}}{\partial \mathbf{y_h}} \]

The derivation then proceeds just as before, where \(h=l\) or \(h=i\), and we end up at the same gradient expression, but with a minor change to \(k_{ij}\):

\[ k_{ij} = \left[ \sum_{k} \frac{\partial C}{\partial q_{ik}} \frac{\partial q_{ik}}{\partial w_{ij}} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

We will still expand \(\partial q_{ik}/\partial w_{ij}\), also involving a slight change in notation for \(q_{ij}\):

\[q_{ij} = \frac{w_{ij}}{\sum_k w_{ik}} = \frac{w_{ij}}{S_i}\]

i.e. we now have to give an \(i\) subscript to \(S\), because the sum of the weights only includes those associated with point \(i\). The partial derivatives are:

\[\frac{\partial q_{ij}}{\partial w_{ij}} = \frac{1}{S_i} - \frac{q_{ij}}{S_i} \] and: \[\frac{\partial q_{ik}}{\partial w_{ij}} = -\frac{q_{ik}}{S_i}\]

and inserting this into \(k_{ij}\) gives:

\[k_{ij} = \frac{1}{S_i} \left[ \frac{\partial C}{\partial q_{ij}} - \sum_{k} \frac{\partial C}{\partial q_{ik}} q_{ik} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

Comparing this with the pair-wise version, you can see they are very similar.

Summary

After all that, we can summarise with three equations. The probability-based embedding gradient is:

\[ \frac{\partial C}{\partial \mathbf{y_i}} = 2 \sum_{j} \left( k_{ij} + k_{ji} \right) \left(\mathbf{y_i} - \mathbf{y_j}\right) \]

where for a point-wise normalization: \[k_{ij} = \frac{1}{S_i} \left[ \frac{\partial C}{\partial q_{ij}} - \sum_{k} \frac{\partial C}{\partial q_{ik}} q_{ik} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

and for a pair-wise normalization: \[k_{ij} = \frac{1}{S} \left[ \frac{\partial C}{\partial q_{ij}} - \sum_{kl} \frac{\partial C}{\partial q_{kl}} q_{kl} \right] \frac{\partial w_{ij}}{\partial f_{ij}} \]

Plug whatever combination of cost function derivative and weight function derivative you like, and off you go.

Some Cost Functions and their Derivatives

Kullback-Leibler Divergence

Used by the ASNE, SSNE, t-SNE and variants:

\[C = D_{KL}(P||Q) = \sum_{ij} p_{ij}\log\left(\frac{p_{ij}}{q_{ij}}\right)\] \[\frac{\partial C}{\partial q_{ij}} = - \frac{p_{ij}}{q_{ij}}\]

“Reverse” Kullback-Leibler Divergence

NeRV symmetrizes the KL divergence by also considering the cost when \(q_{ij}\) is the “reference” probability distribution:

\[C = D_{KL}(Q||P) = \sum_{ij} q_{ij}\log\left(\frac{q_{ij}}{p_{ij}}\right)\] \[\frac{\partial C}{\partial q_{ij}} = \log\left(\frac{q_{ij}}{p_{ij}}\right) + 1\]

Jensen-Shannon Divergence

The Jensen-Shannon Divergence, as defined in JSE is:

\[C = D_{JS}(P||Q) = \frac{1}{1-\kappa}D_{KL}(P||Z) + \frac{1}{\kappa}D_{KL}(Q||Z) \] where \(Z\) is a mixture of P and Q: \[Z = \kappa P + \left(1-\kappa \right)Q\]

Rather than invoke the chain rule to couple \(Z\) to \(Q\), I’m just going to write out the derivatives for the two parts of the JS divergence with respect to \(Q\) (ignoring the \(\kappa\) weighting for now):

\[\frac{\partial D_{KL}(P||Z)}{\partial q_{ij}} = - \left(1 - \kappa \right) \frac{p_{ij}}{z_{ij}} \] \[ \frac{\partial D_{KL}(Q||Z)}{\partial q_{ij}} = \log\left(\frac{q_{ij}}{z_{ij}}\right) + \kappa \left(\frac{p_{ij}}{z_{ij}}\right) \]

Once you add these derivatives together, multiplying by the \(\kappa\) values in the cost function, terms cancel to leave a surprisingly simple derivative:

\[ \frac{\partial C}{\partial q_{ij}} = \frac{\partial D_{JS}(p_{ij}||q_{ij})}{\partial q_{ij}} = \frac{1}{\kappa}\log\left(\frac{q_{ij}}{z_{ij}}\right) \]

\(\chi^2\) (Chi-square):

Im, Verma and Branson describe replacing the KL divergence with other members of the f-divergence family. These include the reverse KL divergence (which we looked at as part of NeRV) and the Jensen-Shannon divergence we just looked at in JSE. But there are other members of the f-divergence family. One is the \(\chi^2\) (chi-square) divergence:

\[C = \sum_{ij} \frac{\left(p_{ij} - q_{ij}\right)^2}{q_{ij}}\]

\[\frac{\partial C}{\partial q_{ij}} = -\frac{p_{ij}^2 - q_{ij}^2}{q_{ij}^2} = -\frac{p_{ij}^2}{q_{ij}^2} + 1\]

Hellinger Distance

The other member of the f-divergences that Im and co-workers look at is the Hellinger distance:

\[C = \sum_{ij} \left(\sqrt{p_{ij}} - \sqrt{q_{ij}}\right)^2\]

\[\frac{\partial C}{\partial q_{ij}} = -\frac{\sqrt{p_{ij}} - \sqrt{q_{ij}}}{\sqrt{q_{ij}}} = -\frac{\sqrt{p_{ij}}}{\sqrt{q_{ij}}} + 1\]

The gradient for the chi-square and Hellinger distance versions of t-SNE aren’t stated, so we will revisit these divergences when deriving some force constants for more complex divergences below.

\(\alpha\)-\(\beta\) Divergence

Cichocki and co-workers came up with the AB-divergence (also see their earlier paper), which generalizes a whole load of divergences, including the Kullback-Leibler and all the f-divergences described above. Narayan, Punjani and Abbeel used it to develop AB-SNE. As the name suggests, the AB-divergence has two free parameters, \(\alpha\) and \(\beta\):

\[C =\frac{1}{\alpha \beta} \sum_{ij} -p_{ij}^{\alpha} q_{ij}^{\beta} + \frac{\alpha}{\alpha + \beta}p_{ij}^{\alpha + \beta} + \frac{\beta}{\alpha + \beta}q_{ij}^{\alpha + \beta}\]

There are several special cases to deal with where this expression explodes, namely \(\alpha = 0\), \(\beta = 0\), when both \(\alpha = 0\) and \(\beta = 0\) at the same time, and \(\alpha + \beta = 0\), but the paper describes several alternative expressions for those cases.

\[\frac{\partial C}{\partial q_{ij}} = -\frac{1}{\alpha} q_{ij}^{\beta - 1} \left( p_{ij}^{\alpha} - q_{ij}^{\alpha} \right)\]

It’s not very clear from the cost function, but when we derive the gradient below, you can see that you get t-SNE back with \(\alpha = 1\) and \(\beta = 0\). The AB-SNE paper is of particular interest because it provides a meaningful guide on how to interpret changing \(\alpha\) and \(\beta\) rather than leaving you to fiddle with them randomly.

Some Similarity Kernels and their Derivatives

Exponential/Gaussian Kernel

The weighting function most commonly encountered is a gaussian kernel, although because I’ve separated the distance transformation from the weighting, it’s actually exponential with respect to the transformed distances:

\[w_{ij} = \exp\left(-\beta_{i} f_{ij}\right)\]

Also, I’ve thrown in the exponential decay factor, \(\beta_{i}\) which is used in the Input Initialization for the input kernel only, as part of the perplexity-based calibration (aka entropic affinities). As discussed further in that section, the output kernel function normally just sets all the \(\beta\) values to one, so it effectively disappears from the output kernel gradient. However, some methods do make use of an output \(\beta\), so here is the general gradient:

\[\frac{\partial w_{ij}}{\partial f_{ij}} = -\beta_{i} \exp\left(-\beta_{i} f_{ij}\right) = -\beta_{i} w_{ij} \]

t-Distribution Kernel

As used in the output kernel of t-SNE. Also referred to as the Cauchy distribution sometimes.

\[w_{ij} = \frac{1}{1 + f_{ij}}\] \[\frac{\partial w_{ij}}{\partial f_{ij}} = -\frac{1}{\left(1 + f_{ij}\right)^2} = -w_{ij}^2 \]

Heavy-tailed Kernel

A generalization of the exponential and t-distributed kernel, and described in the HSSNE paper.

\[w_{ij} = \frac{1}{\left(1 + \alpha \beta_{i} f_{ij} \right)^{1 / \alpha}}\] \[\frac{\partial w_{ij}}{\partial f_{ij}} = - \frac{\beta_{i}}{\left(1 + \alpha \beta_{i} f_{ij}\right)^{\left(\alpha + 1\right) / \alpha}} = -\beta_{i} w_{ij} ^ \left(\alpha + 1\right) \]

The degree of heavy-tailedness is controlled by \(\alpha\): when \(\alpha = 1\), the kernel behaves like the t-distributed kernel. As it approaches 0, it approaches the exponential kernel. \(\alpha > 1\) allows output distances to stretch even more than t-SNE. I am unaware of any research that looks at seeing if there is a way to find an optimal value of \(\alpha\) for a given dataset, although the inhomogeneous t-SNE method mentioned below solves the problem by optimizing a similar set of values along with the embedded coordinates.

I have also included the precision parameter \(\beta\) in the equation. This isn’t present in the original HSSNE paper, but it’s fairly easy to do the integration to work out how to include it. This shouldn’t be taken as a suggestion to use it as an input kernel - heavy-tailedness can make it very easy for it to be impossible to get lower values of perplexity. However, it does allow for an output kernel that behaves like t-SNE, but has the free precision parameter that would allow t-SNE to be used with the method described in multiscale JSE. See the sections on Input Initialization and Embedding Methods for how to do that in sneer.

Other groups have looked at heavy-tailed kernels with slightly different definitions of the heavy-tailedness parameter, which I will continue to represent as \(\alpha\). In parametric t-SNE, van der Maaten used a general t-distribution:

\[w_{ij} = \frac{1}{\left(1 + f_{ij} / \alpha \right)^{\left(\alpha + 1\right)/2}}\]

“twice” t-SNE (tt-SNE) uses:

\[w_{ij} = \frac{1}{\left(1 + f_{ij} / \alpha \right)^{\alpha / 2}}\]

Kobak, Linderman and co-workers defined the kernel as:

\[w_{ij} = \frac{1}{\left(1 + f_{ij} / \alpha \right)^\alpha}\]

Inhomogeneous t-SNE

Inhomogeneous t-SNE defines yet another heavy-tailed kernel, using nearly the same definition as in parametric t-SNE. I’ll swap to the notation that this paper uses for the degrees of freedom, \(\nu\):

\[w_{ij} = \left(1 + \frac{f_{ij}}{\nu_{i}}\right)^{-\left(\nu_{i} + 1\right)/2}\]

where \(\nu_{i} = \infty\) gives SNE-like behavior and \(\nu_{i} = 1\) gives t-SNE behavior. Th