The task definition
N.B. Here is demo app with source code that cover some moments from article http://comparerf8a6d5.herokuapp.com
When it comes to the development of various aggregators of basically any items (goods, service, real estate etc), the problem of duplicates of the same item is always there. For example, it can be duplicates of an apartment photo, product photo etc. It happens because of the necessity to aggregate the same types of catalog items from different providers.
As a result, such subtasks arise: automatic defining of image duplicates while importing new data for the current object and search of the present object by an image associated with it and others.
The following article covers the solution of the task of defining such duplicates.
General Idea
Obviously, comparing images by value of each pixel is quite time consuming, technically complicated and in general doesn’t make any sense. The estimation of complexity for such a comparison constitutes O(N2). Moreover, unlike other types of data such as archives, documents, and binary files, images can be changed in some way. For instance they can be a different size, have watermarks, or have a different quality in the event of compression like JPEG, but despite these changes be “similar” for the end user.
Thus it makes sense to try to build some bijective relation between an image and something that is convenient to work with, in particular to save, compare, search etc. As this idea is not new, so called perceptual hashes have been used for a long time already.
Perceptual Hashes
Hash functions are transformation functions that allow getting an imprint (stamp) of a fixed length for initial data. Working with modern classic cryptographic hash algorithms (e.g. MD5 or SHA1) results in having different initial data, no matter how little it differs from the other. We receive the hashvalues which are different between each other to the highest extent.
In this aspect, perceptual hash algorithms generate hashes used for comparison of initial data. The more similar the generated hashes are to each other (in certain space with certain measures), the more similar the initial data is.
In such a way, we receive a number of advantages:

Memory economy (saving hashes is cheaper than keeping the initial images)

Economy of computing power (we can organize hashes into different structures such as KDTree, MVPTree and others, and then perform the operations with them with complexity
O(N)
, a linear one) 
The opportunity to use the existing functionality (methods, libraries, tools) for data analysis.
Some opensource solutions
Typical images processing, which complicate the work with hashes
Very often, the same image obtained from different sources can be changed in the way that is reflected in the resulted hashes. The most widespread are

Scaling (including changing of images proportions)

Cutting edges

Change of brightness or contrast (including color corrections)

Various affine transformations (like rotations etc).
Each algorithm of hashes calculations has certain sensitivity towards different transformations (see below), but universal algorithm doesn’t exist yet (at least, no one has informed about it publicly).
Famous hashalgorithms
Algorithms for perpetual hashes calculations differ from each other with sensitivity to certain distortion changes: change of size, proportion between the sites, color characteristics (brightness, contrast, gamma), watermarks etc. Let’s review the most popular of them.
aHash
(Average Hash)
This algorithm is quite fast but not sensitive to such transformations like scaling of initial image, compressing and stretching, brightness and contrast. It is based on the average value, and, as a result, is sensitive to the operations that change this average value (for instance, change of levels or color balance).
1.To build aHash one should perform the following steps:
1. Decreasing the image size. The initial image is compressed to some reasonable size (usually it is 8x8 or 16x16 points which will show a 64 or 256 bites respectively). The image size influences the comparison accuracy and the speed of the algorithm. The larger the image, the higher the accuracy of the comparison, although in the last case it would take more time for computing. Scaling can be performed without following the proportions. Thus, received hash will correspond with all the images variants despite their proportions. If we consider the image a discrete signal, where high frequencies which provide specification of image and high frequencies show its structure. Decreasing the image size deletes the high frequency. The image obtained consists of mainly low frequencies, saving the whole image structure.
2. Image greyscaling. This move helps to decrease the cash size in three times as it describes the number of components from 3 values of RGB to one level of grey.
3. Computing the average value. Then the average on all the image points is calculated.
4. Simplifying the image. Every pixel gives a value of 0 if it is less than the average value and it gives a value of 1 when its value is greater than average. Thus, the image is converted into the set of the bits. It’s read line by line, and the set of values becomes the hash.
There is an example of aHash implementation in Python using Pil library
from PIL import Image class Hash(object): def __init__(self, image): self.image = Image.open(image) def ahash(self): im = self.image size = 16, 16 # resize im = im.resize(size, Image.ANTIALIAS) # convert to greyscale # L = R * 299/1000 + G * 587/1000 + B * 114/1000 im = im.convert('L') # Calc average value of pixels pixels = list(im.getdata()) average = sum(pixels) / len(pixels) result = '' for pixel in pixels: if pixel > average: result += '1' else: result += '0' return result
pHash
pHash mostly repeats the steps of aHash, but also adds another step, where discrete cosine transformation (DCT) takes place. It allows you to break the image into different parts of “importance” on the harmonicas of discrete signal. It influences the image quality (this transformation is also used when coding the images in the JPEG format).
The first two steps are almost the same as aHash except the larger image size is chosen (eg. 32x32) not to delete the high frequencies (this will happen later) but to simplify the DCT algorithm.
Then, one needs to greyscale the image and perform the DCTtransformation, which breaks the image into the basic of frequencies. In contrast to JPEG, pHash also use of 32x32 block size instead of 8x8. For 2D
matrixes this transformation appears the following way:
, where:
here:
M,N
 size of initial matrix
f(i,j)
 value of matrix (intensity of a pixel) in a row and column j
F(u,v)
 DCTcoefficient in a row k1
and column k2
of DCT matrix. These coefficients can be considered as weight coefficients of basic functions. e.g. matrix with 8x8 elements has 64 basic functions, like demonstrated on the picture:
Initial array of DCT contains integers within the range [1024, 1023]. Low frequencies are also essential for the majority of the images. They will be situated in the left upper corner of the DCTmatrix. After computing, DCT matrix is decreasing as high frequencies are filtered out, and thus we obtain a smaller block (8x8 or 16x16). Also, like in case of aHash algorithm, an average value is computed:
The next steps are the same as for aHash: the value of obtained matrix is converted to the values 1 or 0 depending whether the pixel has a lesser or bigger value than an average. The hash is built upon the image obtained.
dHash
Alike aHash and pHash, this algorithm is easy and fast to use, and moreover, functions faster and provides more precise results in comparison to the first two ones. While aHash is based on average value and pHash on frequency patterns, dHash follows the image gradient.
The first step, like in the cases with aHash and pHash algorithms, we decrease the image size. However, in this algorithm we use is not square but a rectangular matrix of an image with a size of 9x8 (mostly N + 1 x N).
Then the image is greyscaled. For each line we compute the difference between the next and previous pixels to arrive with a matrices with a size of 8x8 pixels:
The last step is hash calculation: if a value of a current pixel is less than the previous one (y[i, j] > 0
), the value of the hash is 1 or otherwise it is 0. In initial code these two steps are also merged into one, and the value of hash is obtained from the difference between the values of adjacent pixels.
This algorithm is the fastest of all mentioned above, and moreover, more precise.
Currently, we have a great number of algorithms used to compute the perceptual hashes except the ones we have described. These include:

various algorithms using wavelet transformations (separate elements are used in pHash)

algorithms using static moments

Radial variance hash
Expereince proves that to solve the task described in this article, it is enough to use aHash, dHash and pHash.
Possible approach to complete the task of image duplicates
The very first practical tests have shown that using a certain algorithm is not enough to claim that the result is precise, and that it is more efficient to use a classifier, which generates the differences between the hashes of two images under the check. The hashes should be built using different algorithms and initial data.
As a value for difference, we can use various measures:

Haming distance;

Euclidean distance;

Weighed Euclidean distance;

Chebishev distance;

The percentage of disagreement.
The choice of metrics and weights for classifying properties is a very important step as they influence the content and quantity of formed classes, as well as similarity degree of objects within a class.
As classifiers, one can also use different approaches:
static classifiers (regressive modules, in particular logistic regression);
neuron networks (perceptrons, Kochenen maps, etc.)
models based on fuzzy logics or set (fuzzy, clasterization, etc)
Received results
Using the tools described above, we have managed to obtain high quality results for our apartment aggregator http://roomlr.com. First, we decided to exploit various existing methods to the maximum extent. Various static methods turned out to be the easiest ones.
After some experiments, we decided to go with a logistic regression model as a classifier. This is indeed used to solve the foreseeing problem of continuous dependent variable, under the condition that this variable can take values within the interval 01.
According to this peculiarity, it is often used to foresee the probability of some event happening depending on the values of a certain number of predictors. That is what we need to understand not only whether we have a duplicate or not, but how confidently we can claim duplicates.
As predictors, we used the distances between hashes computed with the help of aHash, dHash and pHash.
Having experimented with it, we discovered an interesting moment. Significant improvement of the results reliability was achieved by taking not only the predictors such as distance between the hashes of original images, but also the predictors of distances computed for images which were distorted on purpose (scaling, cropping etc.).
Combining such predictors within one model allowed getting the result that solves the task.
Try and Play with it
We've made a small demo for you to play with. It's on http://comparerf8a6d5.herokuapp.com. Source code is on https://github.com/7WebPages/comparer. You are free to fork and improve results using multiple hashes and different metrics. As there only aHash with Hamming distance used.
It also includes examples of the formation of the predictors using the principles mentioned above. For their formation we use such transformaions of the original image as cropping, scaling, and combinations thereof:

scaling up to 128x128 size with keeping aspect ratio,

cropping vertically the top and bottom by 10%, and scaling up to 128x128 size with keeping aspect ratio,

cropping horizontally the top and bottom by 10%, and scaling up to 128x128 size with keeping aspect ratio,

cropping vertically and horizontally the top and bottom by 10%, and scaling up to 128x128 size with keeping aspect ratio,

transformations are the same as 14 but scaling without keeping aspect ratio.
Further for each pair of the compared images (X, Y) we form the vector of 8 elements.
The vectors are X = (x1, x2, x3, x4, x5, x6, x7, x8)
and Y = (y1, y2, y3, y4, y5, y6, y7, y8)
,
where the elements of the vector xi
and yi
are the aHash value (8x8), calculated for each of the corresponding 8converted images.
Then we calculate the resulting vector Z
as a Hamming distance between the corresponding elements of the vectors X
and Y
.
We form the training set like variety of 9component vectors (8 components  the elements of the vector Z
and 9th  is a sign: 1 if the original pair of images is a duplicate, 0  otherwise) for selection of the coefficients of logistic regression. After that, a logistic regression model with such components, we can use to detect duplicates, giving the input vector Z
and getting the output a sign of duplicate: 1yes, 0no.
For building and manipulating with logistic regression model we are used python scikitlearn library
@classmethod def predict(cls, vector): coefs = numpy.array( [ [ 0.30346249, 0.33800637, 0.30347395, 0.33800637, 0.05190433, 0.20001436, 0.07453074, 0.29136006 ] ] ) classifier = linear_model.LogisticRegression() classifier.coef_ = coefs classifier.intercept_ = numpy.array([ 1.98375232]) resutl = classifier.predict_proba(numpy.array(vector)) match = resutl[:, 1] > resutl[:, 0] return match[0]
See results below. On left size image is cropped and resized, on right side it's original version of image. There is some difference but we match them as duplicates using linear regression script.
 Demo: http://comparerf8a6d5.herokuapp.com.
 Source code: https://github.com/7WebPages/comparer.
 Real world usage: http://roomlr.com
Author: Anatolii, Senior Python Developer.
20150922