Run queries against reference data to return randomly selected neighbors. This is not a useful query method on its own, but can be used with other methods which require initialization.
Usage
random_knn_query(
query,
reference,
k,
metric = "euclidean",
use_alt_metric = TRUE,
order_by_distance = TRUE,
n_threads = 0,
verbose = FALSE,
obs = "R"
)Arguments
- query
Matrix of
nquery items, with observations in the rows and features in the columns. Optionally, the data may be passed with the observations in the columns, by settingobs = "C", which should be more efficient. Thereferencedata must be passed in the same orientation asquery. Possible formats arebase::data.frame(),base::matrix()orMatrix::sparseMatrix(). Sparse matrices should be indgCMatrixformat. Dataframes will be converted tonumericalmatrix format internally, so if your data columns arelogicaland intended to be used with the specialized binarymetrics, you should convert it to a logical matrix first (otherwise you will get the slower dense numerical version).- reference
Matrix of
mreference items, with observations in the rows and features in the columns. The nearest neighbors to the queries are randomly selected from this data. Optionally, the data may be passed with the observations in the columns, by settingobs = "C", which should be more efficient. Thequerydata must be passed in the same orientation and format asreference. Possible formats arebase::data.frame(),base::matrix()orMatrix::sparseMatrix(). Sparse matrices should be indgCMatrixformat.- k
Number of nearest neighbors to return.
- metric
Type of distance calculation to use. One of:
"braycurtis""canberra""chebyshev""correlation"(1 minus the Pearson correlation)"cosine""dice""euclidean""hamming""hellinger""jaccard""jensenshannon""kulsinski""sqeuclidean"(squared Euclidean)"manhattan""rogerstanimoto""russellrao""sokalmichener""sokalsneath""spearmanr"(1 minus the Spearman rank correlation)"symmetrickl"(symmetric Kullback-Leibler divergence)"tsss"(Triangle Area Similarity-Sector Area Similarity or TS-SS metric)"yule"
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
"cosine-preprocess":cosinewith preprocessing."correlation-preprocess":correlationwith preprocessing.
For non-sparse binary data passed as a
logicalmatrix, the following metrics have specialized variants which should be substantially faster than the non-binary variants (in other cases the logical data will be treated as a dense numeric vector of 0s and 1s):"dice""hamming""jaccard""kulsinski""matching""rogerstanimoto""russellrao""sokalmichener""sokalsneath""yule"
- use_alt_metric
If
TRUE, use faster metrics that maintain the ordering of distances internally (e.g. squared Euclidean distances if usingmetric = "euclidean"), then apply a correction at the end. Probably the only reason to set this toFALSEis if you suspect that some sort of numeric issue is occurring with your data in the alternative code path.- order_by_distance
If
TRUE(the default), then results for each item are returned by increasing distance. If you don't need the results sorted, e.g. you are going to pass the results as initialization to another routine likegraph_knn_query(), set this toFALSEto save a small amount of computational time.- n_threads
Number of threads to use.
- verbose
If
TRUE, log information to the console.- obs
set to
"C"to indicate that the inputqueryandreferenceorientation stores each observation as a column (the orientation must be consistent). The default"R"means that observations are stored in each row. Storing the data by row is usually more convenient, but internally your data will be converted to column storage. Passing it already column-oriented will save some memory and (a small amount of) CPU usage.
Value
an approximate nearest neighbor graph as a list containing:
idxan n by k matrix containing the nearest neighbor indices.distan n by k matrix containing the nearest neighbor distances.
Examples
# 100 reference iris items
iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
# 50 query items
iris_query <- iris[iris$Species == "versicolor", ]
# For each item in iris_query find 4 random neighbors in iris_ref
# If you pass a data frame, non-numeric columns are removed
# set verbose = TRUE to get details on the progress being made
iris_query_random_nbrs <- random_knn_query(iris_query,
reference = iris_ref,
k = 4, metric = "euclidean", verbose = TRUE
)
#> 16:34:20 Using alt metric 'sqeuclidean' for 'euclidean'
#> 16:34:20 Generating random k-nearest neighbor graph from reference with k = 4
#> 16:34:20 Finished
# Manhattan (l1) distance
iris_query_random_nbrs <- random_knn_query(iris_query,
reference = iris_ref,
k = 4, metric = "manhattan"
)