Quantify hubness of a nearest neighbor graph

k_occur returns a vector of the k-occurrences of a nearest neighbor graph as defined by Radovanovic and co-workers (2010). The k-occurrence of an object is the number of times it occurs among the k-nearest neighbors of objects in a dataset.

Usage

k_occur(idx, k = NULL, include_self = TRUE)

Arguments

idx: integer matrix containing the nearest neighbor indices, integers labeled starting at 1. Note that the integer labels do not have to refer to the rows of idx, for example if the nearest neighbor result is from querying one set of objects with respect to another (for instance from running graph_knn_query()). You may also pass a nearest neighbor graph object (e.g. the output of running nnd_knn()), and the indices will be extracted from it, or a sparse matrix in the same format as that returned by prepare_search_graph().
k: The number of closest neighbors to use. Must be between 1 and the number of columns in idx. By default, all columns of idx are used. Ignored if idx is sparse.
include_self: logical indicating whether the label i in idx is considered to be a valid neighbor when found in row i. By default this is TRUE. This can be set to FALSE when the the labels in idx refer to the row indices of idx, as in the case of results from nnd_knn(). In this case you may not want to consider the trivial case of an object being a neighbor of itself. In all other cases leave this set to TRUE.

Value

a vector of length max(idx), containing the number of times an object in idx was found in the nearest neighbor list of the objects represented by the row indices of idx.

Details

The k-occurrence can take values between 0 and the size of the dataset. The larger the k-occurrence for an object, the more "popular" it is. Very large values of the k-occurrence (much larger than k) indicates that an object is a "hub" and also implies the existence of "anti-hubs": objects that never appear as k-nearest neighbors of other objects.

The presence of hubs can reduce the accuracy of nearest-neighbor descent and other approximate nearest neighbor algorithms in terms of retrieving the exact k-nearest neighbors. However the appearance of hubs can still be detected in these approximate results, so calculating the k-occurrences for the output of nearest neighbor descent is a useful diagnostic step.

References

Radovanovic, M., Nanopoulos, A., & Ivanovic, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487-2531. https://www.jmlr.org/papers/v11/radovanovic10a.html

Bratic, B., Houle, M. E., Kurbalija, V., Oria, V., & Radovanovic, M. (2019). The Influence of Hubness on NN-Descent. International Journal on Artificial Intelligence Tools, 28(06), 1960002. doi:10.1142/S0218213019600029

Examples

iris_nbrs <- brute_force_knn(iris, k = 15)
iris_ko <- k_occur(iris_nbrs$idx)
# items 42 and 107 are not in 15 nearest neighbors of any other members of
# iris
which(iris_ko == 1) # they are only their own nearest neighbor
#> [1]  42 107
max(iris_ko) # most "popular" item appears on 29 15-nearest neighbor lists
#> [1] 29
which(iris_ko == max(iris_ko)) # it's iris item 64
#> [1] 64
# with k = 15, a maximum k-occurrence = 29 ~= 1.9 * k, which is not a cause
# for concern