WHAT IS CNN
A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers, pooling layers and then followed by one or more fully connected layers as in a standard neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.
It was invented by Prof. Yann LeCun (NYU), and the effect of CNN to the world is profound, every big IT company is trying to do something using it.
In my interpretation, the idea of CNN is to transform the 2-D images into a 3-D space, this is done by convolve the images by several kernels. Say we have M * N size image, and the kernel we use are m * n big, and we use k kernels, so after convolution, we can get k * (M – m + 1) * (N – n + 1) images. By convolution, we decreased the size of each image, but at the same time we improved the feature by increase the influence of nearby pixels. Furthermore, the kernels we use in convolution are actually got by training, after training, I visualized the trained kernels, and found that they look just like some Gabor filters (you can see it in the following content). I mis-understood this point last week, so my first version of CNN generates a Gabor filter bank (about 200 random Gabor filters), and randomly choose several to convolve with the training images (Now I call it “FakeCNN“, and you can see it in the above link). I thought about it, maybe just like Sparse Autoencoder, the trained feature are better features than artificial features and can better fit the dataset.
Use “Valid” type, means we only use the middle part of full result of convolution, by doing this ,we can avoid from facing boundary situation. As I mentioned above, the convolution of a M * N image and a m * n kernel results a (M – m + 1) * (N – n + 1).
In practice, there are mainly 3 kind of pooling methods we use, Max Pooling, Mean Pooling, and Stochastic Pooling. Max Pooling and Mean Pooling are super easy, I’ll talk about Stochastic Pooling. It was invented by M. Zeiler and R. Fergus (both NYU), the main idea of it is not only captures the strongest activation of the filter template with the input for each region.
Say we have the following activation in a 3 * 3 region, we want get one pooling result using Stochastic Pooling method.
We calculate the probability of each element in the block.
by sorting the probabilities, we got
Now we randomly choose a number from 0-8, say 3, then the pooling result is 1.4, which has the fourth (count starts from 0) big probability. You can see that there are two 1.2 in the activation, and the probabilities are both 0.1, and if the random number we generate is 4 or 5, then we will choose 1.2. This shows the activation which has bigger probability has more chance to be chosen.
When back-prop, we just put the activation we chose in the right place, and set other values 0. Like this
This is similar with what we do in Max Pooling method.
When testing, we don’t need to do this process again, we just simply do element-wise multiply.
And let this 1.5033 as the pooling result.
The full connected and softmax part are nothing different with methods we introduced before. Something new is the Convolution and Pooling part.
And the gradient can be got by:
A very popular non-linearity function is ReLU (Rectified Linear Unit), which is
f(x) = max(0, x)
and its derivative is also fairly simple:
f'(x) = 1, x > 0
f'(x) = 0, x <= 0
There’s a smooth approximate to it is called the softplus function, you can check it on Wikipedia.
MY CONVNET ARCHITECTURE
- 1 Conv layer
- 1 Pooling layer (Pooling dimension is 4 * 4)
- 2 full connected layer (Each has 200 hidden neurons)
- 1 Softmax regression layer
- Using 8 convolution kernels (13 * 13 size)
I implemented all the 3 Pooling methods, you can choose anyone to use.
There’re still something left to improve, the Stochastic Gradient Descent part, the current version is just using randomly chosen small batches to train the network, and the learning rate is calculated by finding the biggest eigenvalue of the training dataset, say λ, and set the learning rate slightly smaller than 1/λ. This method is from Prof. Yann LeCun’s Efficient BackProp. However, the training process is not fast enough, I didn’t quite understand the method used in UFLDL convnet, so hope I can understand that and improve this ConvNet demo.
I tested the code on MNIST dataset and got 0.9828 accuracy.
Here’s an image I got by visualize the kernels by Matlab.
Hey guys, I improved a version 2, which supports multi-layers Conv and Pooling process, check it HERE.
And here‘s the version 3.0.
Enjoy it 🙂
Jake Bouvrie. Notes on Convolutional Neural Networks