k_occur
returns a vector of the k-occurrences of a nearest neighbor graph
as defined by Radovanovic and co-workers (2010). The k-occurrence of an
object is the number of times it occurs among the k-nearest neighbors of
objects in a dataset.
Arguments
- idx
integer matrix containing the nearest neighbor indices, integers labeled starting at 1. Note that the integer labels do not have to refer to the rows of
idx
, for example if the nearest neighbor result is from querying one set of objects with respect to another (for instance from runninggraph_knn_query()
). You may also pass a nearest neighbor graph object (e.g. the output of runningnnd_knn()
), and the indices will be extracted from it, or a sparse matrix in the same format as that returned byprepare_search_graph()
.- k
The number of closest neighbors to use. Must be between 1 and the number of columns in
idx
. By default, all columns ofidx
are used. Ignored ifidx
is sparse.- include_self
logical indicating whether the label
i
inidx
is considered to be a valid neighbor when found in rowi
. By default this isTRUE
. This can be set toFALSE
when the the labels inidx
refer to the row indices ofidx
, as in the case of results fromnnd_knn()
. In this case you may not want to consider the trivial case of an object being a neighbor of itself. In all other cases leave this set toTRUE
.
Value
a vector of length max(idx)
, containing the number of times an
object in idx
was found in the nearest neighbor list of the objects
represented by the row indices of idx
.
Details
The k-occurrence can take values between 0 and the size of the dataset. The larger the k-occurrence for an object, the more "popular" it is. Very large values of the k-occurrence (much larger than k) indicates that an object is a "hub" and also implies the existence of "anti-hubs": objects that never appear as k-nearest neighbors of other objects.
The presence of hubs can reduce the accuracy of nearest-neighbor descent and other approximate nearest neighbor algorithms in terms of retrieving the exact k-nearest neighbors. However the appearance of hubs can still be detected in these approximate results, so calculating the k-occurrences for the output of nearest neighbor descent is a useful diagnostic step.
References
Radovanovic, M., Nanopoulos, A., & Ivanovic, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487-2531. https://www.jmlr.org/papers/v11/radovanovic10a.html
Bratic, B., Houle, M. E., Kurbalija, V., Oria, V., & Radovanovic, M. (2019). The Influence of Hubness on NN-Descent. International Journal on Artificial Intelligence Tools, 28(06), 1960002. doi:10.1142/S0218213019600029
Examples
iris_nbrs <- brute_force_knn(iris, k = 15)
iris_ko <- k_occur(iris_nbrs$idx)
# items 42 and 107 are not in 15 nearest neighbors of any other members of
# iris
which(iris_ko == 1) # they are only their own nearest neighbor
#> [1] 42 107
max(iris_ko) # most "popular" item appears on 29 15-nearest neighbor lists
#> [1] 29
which(iris_ko == max(iris_ko)) # it's iris item 64
#> [1] 64
# with k = 15, a maximum k-occurrence = 29 ~= 1.9 * k, which is not a cause
# for concern