Nearest neighbor problem
Nearest neighbour (NN) search is the major problem in this project. The nearest neighbour of one image is another image which is most similar to it. Hence, e.g. two images with the same content in different sizes are identical.

The naive approach to perform a NN search is by linear search which involves the comparison of all images with the current one. But due to the large amount of images (>1.5mio) and still a lot of dimensions per image, it's a very slow approach.

Although the number of dimensions is significantly reduced by using image descriptors instead of whole images, there are still to much dimensions for applying common geometrical datastructures to solve this problem (e.g. kD-Trees, see slides).

Those difficulties are know by the term Curse of Dimensionality. Condensed, two major issues are:
  • In high dimensional space the volume grows exponentially with increasing dimensions. A spherical search range covers less of its surrounding cube.
    • This results in spatial indexing becoming linear, starting already at a low dimensionality threshold (12-20).
  • The average distance distribution becomes narrower.
    • The nearest neighbour of a point in high dimensional space is only very little nearer to it than any other point.
See the slides for more details.

Due to these facts, an approximate nearest neighbour search is used. It does not find the exact nearest neighbour but one which is very close to it, though close enough. The algorithm of choice is locality sensitive hashing (LSH). It finds the approximate nearest neighbours in sublinear time.



lsh


The general idea of hashing is to avoid collisions. The idea of LSH is to exploit collisions for mapping points which are nearby (in geometrical sense) into the same bucket. Unfortunately, the data points (image descriptors) consist of a high number of dimensions but hash tables are one-dimensional. Therefore, data points are projected onto one dimension (i.e. a line) several times from different "view points".

lsh


This requires several hash tables. By combined projection and hashing, each hash table contains all data points according to a particular projection.

To perform a search query, all hypothetical buckets of the query point are determined. The data points contained therein are the approximate nearest neighbours of the query point. They are in random order and have to be sorted to determine the one with the nearest distance to the query point.

For further details see presentation slides.