SIFT (Scale-invariant feature transform) is one of popular feature matching algorithms, it is good because of its several attributes.

There are kinds of primitive ways to do image matching, for some images, even compare the gray scale value pixel by pixel works well. However, what if features in images are in different scales? What if features are in different orientations? What if the illumination of each image is different from each other? What if part of the feature is sheltered by something?

First let’s look at this example (picture stolen from David G. Lowe‘s paper):

Say if we wanna find the toy train and frog, as left pictures, in the image on the right, all of the problems above appear, for example, the frog, more than half of its body is sheltered by that black object.

By Sift method, we got this result:

Really effective method, and OMG! I didn’t notice the train in green rectangle. 🙂

Now let’s talk about how this method works.

In general, the method includes these parts:

- Image pyramid
- Key points finding
- Refining key points
- Key points orientation assignment
- Local image descriptor

**IMAGE PYRAMID**

There is one article about Image Pyramid in my blog, so I’ll not give much details about what is image pyramid and how to generate it here.

Here we are using Laplacian pyramid, i.e. difference between different level of Gaussian pyramid, that is because, in Laplacian pyramid, we can easily find high frequency information of the image, and features of an image, are mostly in these high frequency parts of image.

In fact, Difference of Gaussian (DoG) is an approximation of Laplacian of Gaussian (LoG).

Here we actually need to build several pyramids, from different scale of image, and each pyramid, we call it an octave:

For each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images on the right. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated.

Why we need these octave pyramids? That is for “scale invariant”, by doing this, we can find key points in different scales.

The number of octaves and scales depend on the size of your original image (Lowe suggests making 4 octaves pyramids and 5 blur levels for each pyramid). Furthermore, Lowe suggests that we can double the size of the original image, that is for producing more key points.

**KEY POINTS FINDING**

This step is easy, we just got the DoG pyramids, and now we just find maxima/minima in current scale, and find maxima/minima in adjacent scales.

First in each levels in DoG pyramid, we find pixels which is bigger (or smaller) than the eight adjacent pixels, and for all these maxima/minima points, compare its value with the value of its adjacent pixels in levels above and below it. And the maxima and minima points we finally get, have greater or least value than its 9+8+9=26 neighbors.

For the first scale and last scale of each octave, there are no enough adjacent pixels to calculate, so we simply not consider these scales when finding local maxima/minima.

**REFINING KEY POINTS**

After finding maxima and minima, the next step is to perform a detailed fit to the nearby data for location, scale, and ratio of principal curvatures. This information allows points to be rejected that have low contrast (and are therefore sensitive to noise) or are poorly localized along an edge.

First using Taylor expansion of scale-space function, by this function and its derivatives, we can find location of the extremum, and the function value at the extremum, is useful for rejecting unstable extrema with low contrast.

Secondly, we must eliminate edge responses, because features on edges are not good features, what we need is corners, this is about “Harris corner detector“, and maybe I’ll cover that in future articles.

In fact, if we want to find a “corner”, we do nothing just calculate gradients of two different orientation of an image, **if** both of these gradients are big, it is a corner; **else if** only one gradient is big and the other is small, this point is on an edge; **else** this point is in a flat region.

**KEY POINTS ORIENTATION ASSIGNMENT**

By assigning a consistent orientation to each key point based on local image properties, the key point descriptor (will be covered in next step) can be represented relative to this orientation and therefore achieve invariance to image rotation.

Assuming we are going to assign orientation to the below key point:

We choose a region which the key point is in the center of it, and the size of this region is up to the scale which that key point is in.

By using the above formula, we can calculate the gradient magnitude m and orientation theta of each pixel in the region we chose:

and an orientation histogram is formed with these orientations and magnitudes. The orientation histogram has 36 bins covering the 360 degree range of orientations. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a σ that is 1.5 times that of the scale of the key point.

Say above is the histogram we just created, we can see the peak is at “20-29” degrees, and we simply assign the key point orientation 3 (the third bin of histogram); furthermore, we can see another peak which at “300-309” degrees, and the value of it is over 80% of the third bin value, under this circumstance, we create a new key point which is assigned by orientation 31 (which means the 31st bin of histogram).

**LOCAL IMAGE DESCRIPTOR**

The previous operations have assigned an image location, scale, and orientation to each key point. The next step is to compute a descriptor for the local image region that is highly distinctive yet is as invariant as possible to remaining variations.

For doing this, we first create a 16*16 pixel window which have gradient values of pixels around key point in it, and we break this window into sixteen 4*4 pixel windows, of course, in order to achieve orientation invariance, the coordinates of the descriptor and the gradient orientations are rotated relative to the key point orientation.

Since we already know the gradient magnitudes and orientations of these pixels, just like the previous step, we put orientations of these pixels into a 8-bin histogram (45 degrees per bin).

Additionally, a Gaussian weighting function with σ equal to one half the width of the descriptor window is used to assign a weight to the magnitude of each sample point. The purpose of this Gaussian window is to avoid sudden changes in the descriptor with small changes in the position of the window, and to give less emphasis to gradients that are far from the center of the descriptor, as these are most affected by mis-registration errors. This process is showed in the picture below, and in the picture, assuming we used a 8*8 pixels window, and the window was broken into four 4*4 pixel windows.

By now, we have a 4*4 array of histograms with 8 orientation bins in each, that is, a 4*4*8 = 128 element feature vector for each key point.

Finally, the feature vector is modified to reduce the effects of illumination change by normalizing vectors’ length.

If we did all above process to two images, say we got k1 key points in image1, and k2 key points in image2, so we have k1*128 vectors in image1, and k2*128 vectors in image2, what we do is simply compare each key point’s vectors in image1 with key points’ vectors in image2, and find if there are pairs of key points.

**OPENCV CODE**

It is super easy to do this in OpenCV, the SIFT class already did everything for you, so just use it!

#include <iostream> #include <opencv2/opencv.hpp> #include <opencv2/nonfree/nonfree.hpp> #include "opencv2/legacy/legacy.hpp" using namespace cv; using namespace std; int main(int argc, char** argv) { //read images Mat img_1c=imread("img1.jpg"); Mat img_2c=imread("img2.jpg"); Mat img_1, img_2; //transform images into gray scale cvtColor( img_1c, img_1, CV_BGR2GRAY ); cvtColor( img_2c, img_2, CV_BGR2GRAY ); SIFT sift(50,5); vector<KeyPoint> key_points_1, key_points_2; Mat detector; //do sift, find key points sift(img_1, Mat(), key_points_1, detector); sift(img_2, Mat(), key_points_2, detector); SiftDescriptorExtractor extractor; Mat descriptors_1,descriptors_2; //compute descriptors extractor.compute(img_1,key_points_1,descriptors_1); extractor.compute(img_2,key_points_2,descriptors_2); //use burte force method to match vectors BruteForceMatcher<L2<float> >matcher; vector<DMatch>matches; matcher.match(descriptors_1,descriptors_2,matches); //draw results Mat img_matches; drawMatches(img_1c,key_points_1,img_2c,key_points_2,matches,img_matches); imshow("sift_Matches",img_matches); waitKey(0); return 0; }

You can also change values when creating SIFT, by:

SIFT::SIFT(int nfeatures=0, int nOctaveLayers=3, double contrastThreshold=0.04, double edgeThreshold=10, double sigma=1.6);

** TEST RESULT **

(using 50 key points and 5 octave layers)

🙂

Pingback: SURF, SIFT, OpenCV, Computer Vision()