This function builds an approximate nearest neighbors graph with convenient
defaults, then prepares the index for querying new data, for later use with
rnnd_query()
. For more control over the process, please see the other
functions in the package.
Usage
rnnd_build(
data,
k = 30,
metric = "euclidean",
use_alt_metric = TRUE,
init = "tree",
n_trees = NULL,
leaf_size = NULL,
max_tree_depth = 200,
margin = "auto",
n_iters = NULL,
delta = 0.001,
max_candidates = NULL,
low_memory = TRUE,
weight_by_degree = FALSE,
n_search_trees = 1,
pruning_degree_multiplier = 1.5,
diversify_prob = 1,
prune_reverse = FALSE,
n_threads = 0,
verbose = FALSE,
progress = "bar",
obs = "R"
)
Arguments
- data
Matrix of
n
items to generate neighbors for, with observations in the rows and features in the columns. Optionally, input can be passed with observations in the columns, by settingobs = "C"
, which should be more efficient. Possible formats arebase::data.frame()
,base::matrix()
orMatrix::sparseMatrix()
. Sparse matrices should be indgCMatrix
format. Dataframes will be converted tonumerical
matrix format internally, so if your data columns arelogical
and intended to be used with the specialized binarymetric
s, you should convert it to a logical matrix first (otherwise you will get the slower dense numerical version).- k
Number of nearest neighbors to build the index for. You can specify a different number when running
rnnd_query
, but the index is calibrated using this value so it's recommended to setk
according to the likely number of neighbors you will want to retrieve. Optional ifinit
is specified, in which casek
can be inferred from theinit
data. If you do both, then the specified version ofk
will take precedence.- metric
Type of distance calculation to use. One of:
"braycurtis"
"canberra"
"chebyshev"
"correlation"
(1 minus the Pearson correlation)"cosine"
"dice"
"euclidean"
"hamming"
"hellinger"
"jaccard"
"jensenshannon"
"kulsinski"
"sqeuclidean"
(squared Euclidean)"manhattan"
"rogerstanimoto"
"russellrao"
"sokalmichener"
"sokalsneath"
"spearmanr"
(1 minus the Spearman rank correlation)"symmetrickl"
(symmetric Kullback-Leibler divergence)"tsss"
(Triangle Area Similarity-Sector Area Similarity or TS-SS metric)"yule"
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
"cosine-preprocess"
:cosine
with preprocessing."correlation-preprocess"
:correlation
with preprocessing.
For non-sparse binary data passed as a
logical
matrix, the following metrics have specialized variants which should be substantially faster than the non-binary variants (in other cases the logical data will be treated as a dense numeric vector of 0s and 1s):"dice"
"hamming"
"jaccard"
"kulsinski"
"matching"
"rogerstanimoto"
"russellrao"
"sokalmichener"
"sokalsneath"
"yule"
- use_alt_metric
If
TRUE
, use faster metrics that maintain the ordering of distances internally (e.g. squared Euclidean distances if usingmetric = "euclidean"
), then apply a correction at the end. Probably the only reason to set this toFALSE
is if you suspect that some sort of numeric issue is occurring with your data in the alternative code path.- init
Name of the initialization strategy or initial
data
neighbor graph to optimize. One of:"rand"
random initialization (the default)."tree"
use the random projection tree method of Dasgupta and Freund (2008).a pre-calculated neighbor graph. A list containing:
idx
ann
byk
matrix containing the nearest neighbor indices.dist
(optional) ann
byk
matrix containing the nearest neighbor distances. If the input distances are omitted, they will be calculated for you.'
If
k
andinit
are specified as arguments to this function, and the number of neighbors provided ininit
is not equal tok
then:if
k
is smaller, only thek
closest values ininit
are retained.if
k
is larger, then random neighbors will be chosen to fillinit
to the size ofk
. Note that there is no checking if any of the random neighbors are duplicates of what is already ininit
so effectively fewer thank
neighbors may be chosen for some observations under these circumstances.
- n_trees
The number of trees to use in the RP forest. A larger number will give more accurate results at the cost of a longer computation time. The default of
NULL
means that the number is chosen based on the number of observations indata
. Only used ifinit = "tree"
.- leaf_size
The maximum number of items that can appear in a leaf. This value should be chosen to match the expected number of neighbors you will want to retrieve when running queries (e.g. if you want find 50 nearest neighbors set
leaf_size = 50
) and should not be set to a value smaller than10
. Only used ifinit = "tree"
.- max_tree_depth
The maximum depth of the tree to build (default = 200). If the maximum tree depth is exceeded then the leaf size of a tree may exceed
leaf_size
which can result in a large number of neighbor distances being calculated. Ifverbose = TRUE
a message will be logged to indicate that the leaf size is large. However, increasing themax_tree_depth
may not help: it may be that there is something unusual about the distribution of your data set under your chosemetric
that makes a tree-based initialization inappropriate. Only used ifinit = "tree"
.- margin
A character string specifying the method used to assign points to one side of the hyperplane or the other. Possible values are:
"explicit"
categorizes all distance metrics as either Euclidean or Angular (Euclidean after normalization), explicitly calculates a hyperplane and offset, and then calculates the margin based on the dot product with the hyperplane."implicit"
calculates the distance from a point to each of the points defining the normal vector. The margin is calculated by comparing the two distances: the point is assigned to the side of the hyperplane that the normal vector point with the closest distance belongs to."auto"
(the default) picks the margin method depending on whether a binary-specificmetric
such as"bhammming"
is chosen, in which case"implicit"
is used, and"explicit"
otherwise: binary-specific metrics involve storing the data in a way that isn't very efficient for the"explicit"
method and the binary-specific metric is usually a lot faster than the generic equivalent such that the cost of two distance calculations for the margin method is still faster.
Only used if
init = "tree"
.- n_iters
Number of iterations of nearest neighbor descent to carry out. By default, this will be chosen based on the number of observations in
data
.- delta
The minimum relative change in the neighbor graph allowed before early stopping. Should be a value between 0 and 1. The smaller the value, the smaller the amount of progress between iterations is allowed. Default value of
0.001
means that at least 0.1% of the neighbor graph must be updated at each iteration.- max_candidates
Maximum number of candidate neighbors to try for each item in each iteration. Use relative to
k
to emulate the "rho" sampling parameter in the nearest neighbor descent paper. By default, this is set tok
or60
, whichever is smaller.- low_memory
If
TRUE
, use a lower memory, but more computationally expensive approach to index construction. If set toFALSE
, you should see a noticeable speed improvement, especially when using a smaller number of threads, so this is worth trying if you have the memory to spare.- weight_by_degree
If
TRUE
, then candidates for the local join are weighted according to their in-degree, so that if there are more thanmax_candidates
in a candidate list, candidates with a smaller degree are favored for retention. This prevents items with large numbers of edges crowding out other items and for high-dimensional data is likely to provide a small improvement in accuracy. Because this incurs a small extra cost of counting the degree of each node, and because it tends to delay early convergence, by default this isFALSE
.- n_search_trees,
the number of trees to keep in the search forest as part of index preparation. The default is
1
.- pruning_degree_multiplier
How strongly to truncate the final neighbor list for each item. The neighbor list of each item will be truncated to retain only the closest
d
neighbors, whered = k * pruning_degree_multiplier
, andk
is the original number of neighbors per item ingraph
. Roughly, values larger than1
will keep all the nearest neighbors of an item, plus the given fraction of reverse neighbors (if they exist). For example, setting this to1.5
will keep all the forward neighbors and then half as many of the reverse neighbors, although exactly which neighbors are retained is also dependent on any occlusion pruning that occurs. Set this toNULL
to skip this step.- diversify_prob
the degree of diversification of the search graph by removing unnecessary edges through occlusion pruning. This should take a value between
0
(no diversification) and1
(remove as many edges as possible) and is treated as the probability of a neighbor being removed if it is found to be an "occlusion". If itemp
andq
, two members of the neighbor list of itemi
, are closer to each other than they are toi
, then the nearer neighborp
is said to "occlude"q
. It is likely thatq
will be in the neighbor list ofp
so there is no need to retain it in the neighbor list ofi
. You may also set this toNULL
to skip any occlusion pruning. Note that occlusion pruning is carried out twice, once to the forward neighbors, and once to the reverse neighbors.- prune_reverse
If
TRUE
, prune the reverse neighbors of each item before the reverse graph diversification step usingpruning_degree_multiplier
. Because the number of reverse neighbors can be much larger than the number of forward neighbors, this can help to avoid excessive computation during the diversification step, with little overall effect on the final search graph. Default isFALSE
.- n_threads
Number of threads to use.
- verbose
If
TRUE
, log information to the console.- progress
Determines the type of progress information logged during the nearest neighbor descent stage when
verbose = TRUE
. Options are:"bar"
: a simple text progress bar."dist"
: the sum of the distances in the approximate knn graph at the end of each iteration.
- obs
set to
"C"
to indicate that the inputdata
orientation stores each observation as a column. The default"R"
means that observations are stored in each row. Storing the data by row is usually more convenient, but internally your data will be converted to column storage. Passing it already column-oriented will save some memory and (a small amount of) CPU usage.
Value
the approximate nearest neighbor index, a list containing:
graph
the k-nearest neighbor graph, a list containing:idx
an n by k matrix containing the nearest neighbor indices.dist
an n by k matrix containing the nearest neighbor distances.
Other list items are intended only for internal use by other functions such as
rnnd_query()
.
Details
The process of k-nearest neighbor graph construction using Random Projection Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor Descent (Dong and co-workers, 2011) for refinement. Index preparation, uses the graph diversification method of Harwood and Drummond (2016).
References
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452 .
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. doi:10.1145/1963405.1963487 .
Harwood, B., & Drummond, T. (2016). Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5713-5722).