Keep the best trees in a random projection forest

Reduce the size of a random projection forest, by scoring each tree against a k-nearest neighbors graph. Only the top N trees will be retained which allows for a faster querying.

Usage

rpf_filter(nn, forest = NULL, n_trees = 1, n_threads = 0, verbose = FALSE)

Arguments

nn: Nearest neighbor data in the dense list format. This should be derived from the same data that was used to build the forest.
forest: A random partition forest, e.g. created by rpf_build(), representing partitions of the same underlying data reflected in nn. As a convenient, this parameter is ignored if the nn list contains a forest entry, e.g. from running rpf_knn() or nnd_knn() with ret_forest = TRUE, and the forest value will be extracted from nn.
n_trees: The number of trees to retain. By default only the best-scoring tree is retained.
n_threads: Number of threads to use.
verbose: If TRUE, log information to the console.

Value

A forest with the best scoring n_trees trees.

Details

Trees are scored based on how well each leaf reflects the neighbors as specified in the nearest neighbor data. It's best to use as accurate nearest neighbor data as you can and it does not need to come directly from searching the forest: for example, the nearest neighbor data from running nnd_knn() to optimize the neighbor data output from an RP Forest is a good choice.

Rather than rely on an RP Forest solely for approximate nearest neighbor querying, it is probably more cost-effective to use a small number of trees to initialize the neighbor list for use in a graph search via graph_knn_query().

Examples