Run queries against a "forest" of Random Projection Trees (Dasgupta and Freund, 2008), to return nearest neighbors taken from the reference data used to build the forest.
Usage
rpf_knn_query(
query,
reference,
forest,
k,
cache = TRUE,
n_threads = 0,
verbose = FALSE,
obs = "R"
)
Arguments
- query
Matrix of
n
query items, with observations in the rows and features in the columns. Optionally, the data may be passed with the observations in the columns, by settingobs = "C"
, which should be more efficient. Thereference
data must be passed in the same orientation asquery
. Possible formats arebase::data.frame()
,base::matrix()
orMatrix::sparseMatrix()
. Sparse matrices should be indgCMatrix
format. Dataframes will be converted tonumerical
matrix format internally, so if your data columns arelogical
and intended to be used with the specialized binarymetric
s, you should convert it to a logical matrix first (otherwise you will get the slower dense numerical version).- reference
Matrix of
m
reference items, with observations in the rows and features in the columns. The nearest neighbors to the queries are calculated from this data and should be the same data used to build theforest
. Optionally, the data may be passed with the observations in the columns, by settingobs = "C"
, which should be more efficient. Thequery
data must be passed in the same format and orientation asreference
. Possible formats arebase::data.frame()
,base::matrix()
orMatrix::sparseMatrix()
. Sparse matrices should be indgCMatrix
format.- forest
A random partition forest, created by
rpf_build()
, representing partitions of the data inreference
.- k
Number of nearest neighbors to return. You are unlikely to get good results if you choose a value substantially larger than the value of
leaf_size
used to build theforest
.- cache
if
TRUE
(the default) then candidate indices found in the leaves of the forest are cached to avoid recalculating the same distance repeatedly. This incurs an extra memory cost which scales withn_threads
. Set this toFALSE
to disable distance caching.- n_threads
Number of threads to use. Note that the parallelism in the search is done over the observations in
query
not the trees in theforest
. Thus a single observation will not see any speed-up from increasingn_threads
.- verbose
If
TRUE
, log information to the console.- obs
set to
"C"
to indicate that the inputdata
orientation stores each observation as a column. The default"R"
means that observations are stored in each row. Storing the data by row is usually more convenient, but internally your data will be converted to column storage. Passing it already column-oriented will save some memory and (a small amount of) CPU usage.
Value
the approximate nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.dist
an n by k matrix containing the nearest neighbor distances.
k
neighbors per observation are not guaranteed to be found. Missing data
is represented with an index of 0
and a distance of NA
.
References
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452 .
Examples
# Build a forest of 10 trees from the odd rows
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
iris_odd_forest <- rpf_build(iris_odd, n_trees = 10)
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
iris_even_nn <- rpf_knn_query(
query = iris_even, reference = iris_odd,
forest = iris_odd_forest, k = 15
)