Nearest neighbour (NN) search is the major problem
in this project. The nearest neighbour of one image is
another image which is most similar to it. Hence, e.g.
two images with the same content in different sizes
are identical.
The naive approach to perform a NN search is by linear search which involves the comparison of all
images with the current one. But due to the large
amount of images (>1.5mio) and still a lot of
dimensions per image, it's a very slow approach.
Although the number of dimensions is significantly reduced by using image descriptors instead of whole images, there are still to much dimensions for applying common geometrical datastructures to solve this problem (e.g. kD-Trees, see slides). Those difficulties are know by the term . Condensed, two major issues are:
Curse of
Dimensionality- In high dimensional space the volume grows
exponentially with increasing dimensions. A spherical
search range covers less of its surrounding cube.
- This results in spatial indexing becoming linear, starting already at a low dimensionality threshold (12-20).
- The average distance distribution becomes
narrower.
- The nearest neighbour of a point in high dimensional space is only very little nearer to it than any other point.
Due to these facts, an
nearest neighbour search is used. It does not find the
exact nearest neighbour but one which is very close to
it, though approximateclose enough. The algorithm of
choice is
(LSH). It finds the approximate nearest neighbours in
sublinear time.
locality sensitive hashingThe general idea of hashing is to avoid collisions. The idea of LSH is to exploit collisions for mapping points which are nearby (in geometrical sense) into the same bucket. Unfortunately, the data points (image descriptors) consist of a high number of dimensions but hash tables are one-dimensional. Therefore, data points are projected onto one dimension (i.e. a line) several times from different "view points". This requires several hash tables. By combined projection and hashing, each hash table contains all data points according to a particular projection. To perform a search query, all hypothetical buckets of the query point are determined. The data points contained therein are the approximate nearest neighbours of the query point. They are in random order and have to be sorted to determine the one with the nearest distance to the query point. For further details see presentation slides. |