k_occur returns a vector of the k-occurrences of a nearest neighbor graph
as defined by Radovanovic and co-workers (2010). The k-occurrence of an
object is the number of times it occurs among the k-nearest neighbors of
objects in a dataset.
Arguments
- idx
integer matrix containing the nearest neighbor indices, integers labeled starting at 1. Note that the integer labels do not have to refer to the rows of
idx, for example if the nearest neighbor result is from querying one set of objects with respect to another (for instance from runninggraph_knn_query()). You may also pass a nearest neighbor graph object (e.g. the output of runningnnd_knn()), and the indices will be extracted from it, or a sparse matrix in the same format as that returned byprepare_search_graph().- k
The number of closest neighbors to use. Must be between 1 and the number of columns in
idx. By default, all columns ofidxare used. Ignored ifidxis sparse.- include_self
logical indicating whether the label
iinidxis considered to be a valid neighbor when found in rowi. By default this isTRUE. This can be set toFALSEwhen the the labels inidxrefer to the row indices ofidx, as in the case of results fromnnd_knn(). In this case you may not want to consider the trivial case of an object being a neighbor of itself. In all other cases leave this set toTRUE.
Value
a vector of length max(idx), containing the number of times an
object in idx was found in the nearest neighbor list of the objects
represented by the row indices of idx.
Details
The k-occurrence can take values between 0 and the size of the dataset. The larger the k-occurrence for an object, the more "popular" it is. Very large values of the k-occurrence (much larger than k) indicates that an object is a "hub" and also implies the existence of "anti-hubs": objects that never appear as k-nearest neighbors of other objects.
The presence of hubs can reduce the accuracy of nearest-neighbor descent and other approximate nearest neighbor algorithms in terms of retrieving the exact k-nearest neighbors. However the appearance of hubs can still be detected in these approximate results, so calculating the k-occurrences for the output of nearest neighbor descent is a useful diagnostic step.
References
Radovanovic, M., Nanopoulos, A., & Ivanovic, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487-2531. https://www.jmlr.org/papers/v11/radovanovic10a.html
Bratic, B., Houle, M. E., Kurbalija, V., Oria, V., & Radovanovic, M. (2019). The Influence of Hubness on NN-Descent. International Journal on Artificial Intelligence Tools, 28(06), 1960002. doi:10.1142/S0218213019600029
Examples
iris_nbrs <- brute_force_knn(iris, k = 15)
iris_ko <- k_occur(iris_nbrs$idx)
# items 42 and 107 are not in 15 nearest neighbors of any other members of
# iris
which(iris_ko == 1) # they are only their own nearest neighbor
#> [1] 42 107
max(iris_ko) # most "popular" item appears on 29 15-nearest neighbor lists
#> [1] 29
which(iris_ko == max(iris_ko)) # it's iris item 64
#> [1] 64
# with k = 15, a maximum k-occurrence = 29 ~= 1.9 * k, which is not a cause
# for concern