Convolutional Neural Networks


A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers, pooling layers and then followed by one or more fully connected layers as in a standard neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.

It was invented by Prof. Yann LeCun (NYU), and the effect of CNN to the world is profound, every big IT company is trying to do something using it.


In my interpretation, the idea of CNN is to transform the 2-D images into a 3-D space, this is done by convolve the images by several kernels. Say we have M * N size image, and the kernel we use are m * n big, and we use k kernels, so after convolution, we can get  k * (M – m + 1) * (N – n + 1) images. By convolution, we decreased the size of each image, but at the same time we improved the feature by increase the influence of nearby pixels. Furthermore, the kernels we use in convolution are actually got by training, after training, I visualized the trained kernels, and found that they look just like some Gabor filters (you can see it in the following content). I mis-understood this point last week, so my first version of CNN generates a Gabor filter bank (about 200 random Gabor filters), and randomly choose several to convolve with the training images (Now I call it “FakeCNN“, and you can see it in the above link).  I thought about it, maybe just like Sparse Autoencoder, the trained feature are better features than artificial features and can better fit the dataset.

1. Convolution.

Use “Valid” type, means we only use the middle part of full result of convolution, by doing this ,we can avoid from facing boundary situation. As I mentioned above, the convolution of a M * N image and a m * n kernel results a (M – m + 1) * (N – n + 1).

2. Pooling.

In practice, there are mainly 3 kind of pooling methods we use, Max Pooling, Mean Pooling, and Stochastic Pooling. Max Pooling and Mean Pooling are super easy, I’ll talk about Stochastic Pooling. It was invented by M. Zeiler and R. Fergus (both NYU), the main idea of it is not only captures the strongest activation of the filter template with the input for each region.

Say we have the following activation in a 3 * 3 region, we want get one pooling result using Stochastic Pooling method.


We calculate the probability of each element in the block.


by sorting the probabilities, we got


Now we randomly choose a number from  0-8, say 3, then the pooling result is 1.4, which has the fourth (count starts from 0) big probability. You can see that there are two 1.2 in the activation, and the probabilities are both 0.1, and if the random number we generate is 4 or 5, then we will choose 1.2. This shows the activation which has bigger probability has more chance to be chosen.

When back-prop, we just put the activation we chose in the right place, and set other values 0. Like this


This is similar with what we do in Max Pooling method.

When testing, we don’t need to do this process again, we just simply do element-wise multiply.


And let this 1.5033 as the pooling result.

 3. Back-prop

The full connected and softmax part are nothing different with methods we introduced before. Something new is the Convolution and Pooling part.


And the gradient can be got by:


4. Non-Linearity

A very popular non-linearity function is ReLU (Rectified Linear Unit), which is

f(x) = max(0, x)

and its derivative is also fairly simple:

f'(x) = 1, x > 0
f'(x) = 0, x <= 0

There’s a smooth approximate to it is called the softplus function, you can check it on Wikipedia.


  • 1 Conv layer
  • 1 Pooling layer (Pooling dimension is 4 * 4)
  • 2 full connected layer (Each has 200 hidden neurons)
  • 1 Softmax regression layer
  • Using 8 convolution kernels (13 * 13 size)

I implemented all the 3 Pooling methods, you can choose anyone to use.

There’re still something left to improve, the Stochastic Gradient Descent part, the current version is just using randomly chosen small batches to train the network, and the learning rate is calculated by finding the biggest eigenvalue of the training dataset, say λ, and set the learning rate slightly smaller than 1/λ. This method is from Prof. Yann LeCun’s  Efficient BackProp. However, the training process is not fast enough, I didn’t quite understand the method used in UFLDL convnet, so hope I can understand that and improve this ConvNet demo.



I tested the code on MNIST dataset and got 0.9828 accuracy.

Here’s an image I got by visualize the kernels by Matlab.

kernel They look like Gabor filters, right?

Hey guys, I improved a version 2, which supports multi-layers Conv and Pooling process, check it HERE.

And here‘s the version 3.0.

Enjoy it 🙂


Jake Bouvrie. Notes on Convolutional Neural Networks

This entry was posted in Algorithm, Machine Learning, OpenCV and tagged , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Alex
    Posted May 5, 2014 at 11:53 pm | Permalink

    hi, if the input is color-image, how to design the filters and the whole net?

    • Eric
      Posted May 5, 2014 at 11:58 pm | Permalink

      Hey Alex,

      You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image.

      I believe there’re other ways that deal with multi-channel images, but I have not tried before.


  2. Bob
    Posted May 22, 2014 at 3:23 am | Permalink

    Hi Eric,

    Your wrote: vector layer. It’s wrong? the vector has no argument. Please check it out. Thanks.

    • Eric
      Posted May 22, 2014 at 12:04 pm | Permalink

      Yeah you’re right, it means different kernels in conv layer, better to call it “kernel”, thanks.

  3. zhenghx
    Posted July 21, 2014 at 5:40 am | Permalink

    Thanks for your free code, it is very clear but I way puzzled by some code.

    in the function
    read_Mnist(string filename, vector &vec)
    the code is, c) = (int) temp;
    why you change “temp” for (uchar) to (int), I think “temp” will over write “tpmat” because “tpmat” is define “CV_8UC1”
    When I use your code, the tranning time are days(2 days) until now and I just get 0.30 accuracy.
    How long did you train to get 0.9828 accuracy

    • Eric
      Posted July 22, 2014 at 12:40 am | Permalink

      Thanks for mentioning.
      First, the mnist dataset guarantees each of its element to be (0, 255), so it’s ok to use int, btw, it’s just because I’m accustomed to use “int” in CV_8UCx format things :p
      Second, It indeed used me days for training on this net, you can also try the following net instead:
      If you have further problem with this, let me know, and I’ll check if this version I put on github is buggy.

  4. Daniel
    Posted August 19, 2014 at 12:24 am | Permalink

    Hi Eric,

    First of all, I’d like to thank you for your codes and explanation. It’s been great help for me to map theory and practice.

    I have a similar symptom as zhenghx. First, I had nothing changed (Num HiddenNeaurons = 200) and it gave me 0.110 accuracy within 6 hours. I then tried Num HiddenNeaurons = 500 which ran for about 2 days – yielded

    learning step: 199998, Cost function value = 0.123866, randomNum = 47623
    learning step: 199999, Cost function value = 0.106693, randomNum = 9355
    correct: 3969, total: 10000, accuracy: 0.3969
    Totally used time: 199141 second

    Could you let me know what setting you had for 0.98 accuracy? I got this code from the github last weekend.

    Thank you very much!

    • Eric
      Posted August 27, 2014 at 11:58 pm | Permalink

      Hey Daniel,

      First, try to use Max Pooling by:
      int Pooling_Methed = POOL_MAX;
      It was months ago that I got that 0.98 accuracy result, I’m not sure how many kernels I used but I’m sure it was Max Pooling. (Sorry)

      Second, by debugging my newest version of CNN, I found some bugs that also exist in my old versions (including this one), so when the version which I’m working on is bug-free, I’ll also edit these versions of code.

      Thank you.

  5. Peter
    Posted October 21, 2014 at 11:50 am | Permalink

    yeah, it is very useful post for me, i want to ask about stochastic pooling,
    I dont understand your explain why just randomly choose number from 0 to 8
    more details please

    • Eric
      Posted October 21, 2014 at 1:43 pm | Permalink

      Hi Peter,

      In that specific case, we have totally 9 elements, so what I mean was just to randomly choose one element in the 3*3 matrix, say, 4, and find out which interval that the 4th largest probability element falls in, and the result should be the value of this interval. The result is not chosen by any restriction like largest value, largest probability, or something like it, this is why this method is called “stochastic” pooling; however, by using this method, the larger probability elements do have more chance to be chosen. Do I explain it clearer this time?


      • Ray L
        Posted June 4, 2015 at 12:37 am | Permalink

        Actually I do not think you choose elements according to their probabilities with your method. You choose a random “rank” of the elements, which is just equivalent to choosing a random element in this pool.

  6. 赵元兴
    Posted December 29, 2014 at 9:48 pm | Permalink

    您好,又来向您请教了,我实现了ufldl的结构,在mnist上训练得到98%的识别率,可是可视化第一层的滤波器却并没有得到类似gabor的模板,请问可能是哪里的原因呢? 谢谢

    • Eric
      Posted December 30, 2014 at 1:55 am | Permalink


      • 赵元兴
        Posted December 30, 2014 at 5:25 am | Permalink 这是我的结果,貌似连笔画也不像。。 我这里是9*9的20个卷积核,规范化到0~255并放大5倍后得到的结果,

        • Eric
          Posted December 30, 2014 at 4:43 pm | Permalink

          感觉你训练出这些kernel还是挺make sense的,能看出类似morlet wavelet的样子,黑白,或者黑白黑,白黑白这样的,而且位置和方向都不一样,对应了在不同位置和方向的edge detector和line detector。可能有的参数还需要调整一下,比如核的个数啊,或者regularization的地方。

          • 赵元兴
            Posted December 30, 2014 at 10:09 pm | Permalink


  7. 何浪
    Posted January 7, 2015 at 10:00 am | Permalink


    • Eric
      Posted January 10, 2015 at 1:28 am | Permalink


  8. 何浪
    Posted January 12, 2015 at 2:56 am | Permalink


    • Eric
      Posted January 12, 2015 at 4:03 am | Permalink

      邓力的文章我没看过,所以不太清楚。我不太了解对于语音数据进行回归所用的具体方法,但是按我的理解可能用我这里的卷积网络并不一定理想,因为鉴于二维卷积的性质,我们训练出的核是对于二维特征的一种描述子(both x and y orientation)。可能对于音频,可以试试一维的卷积网络,这种网络和recurrent neural network有什么关系吗?这方面我不太熟悉,你可以去了解一下。另外我还是觉得50个样本太少了。。

  9. 何浪
    Posted January 12, 2015 at 3:30 am | Permalink


  10. 何浪
    Posted January 14, 2015 at 7:22 am | Permalink


    • Eric
      Posted January 15, 2015 at 3:53 am | Permalink


  11. John Bell
    Posted January 22, 2015 at 5:29 pm | Permalink

    Hello Eric,

    I want to get to know CNN as best i can. Looking through the code, it is not yet implemented with the GPU. I am thinking of making a branch in git and contributing. Do you think that it would work? Are changes needed first?



    • Eric
      Posted January 22, 2015 at 7:38 pm | Permalink

      Hi Daniel,

      Here’s a CUDA version of my code which is implemented by zhxfl, you may want to check it, thanks.

      • Daniel
        Posted January 23, 2015 at 10:40 pm | Permalink

        Awesome. I’ll check it out and maybe bring in the OpenCL methods.

        Thank you

  12. 史剑
    Posted March 23, 2015 at 8:45 am | Permalink


    • Eric
      Posted April 13, 2015 at 4:03 pm | Permalink

      1. 打不开MNIST有可能是因为我代码里用的是unix系统,可能需要把读取文件的地方改成win里面文件夹的形式
      2. 你的卷积核多大?全连接层多少隐藏节点?还有就是我感觉mean pooling可能导致了不收敛,我觉得max pooling比mean pooling好用很多。
      3. 还是觉得是mean pooling的问题。另外,貌似在小的dataset上,softmax没有太大优势,不过可能data复杂了,并且数量多了之后(class也多了之后),softmax就展现出优势了。

  13. Santuk
    Posted March 26, 2015 at 12:01 am | Permalink

    Hi Eric,

    I tried to run the code, with some changes
    (all changes were made to compile the original code. Lists are here :
    Line 37 : changed POOL_MAX to POOL_STOCHASTIC
    Line 117 and etc : cast sqrt()’s and pow()’s parameters to (double), therefore not causing ‘ambiguous call to overloaded function’ error )

    But whenever I call something like trainX[0].cols while debugging, ‘Debut Assertion Failed!’ Msgbox comes out, which points that microsoft visual studio 10.0vcincludevector ‘s vector subscript is out of range.

    Is this a sort of ‘ambiguous function call’? Can you help me to run your code?

    • Eric
      Posted April 13, 2015 at 4:07 pm | Permalink

      Hi Santuk,

      I’m not very familiar with MSVS, I don’t know if there’s any MS’s version of vector, so maybe use std::vector whenever we need vector?
      Or maybe you’re using any other dataset which has different image size with MNIST?

  14. Santuk
    Posted March 29, 2015 at 4:09 am | Permalink

    Hello, Eric!

    I’m just trying to run your code (thereby learning CNN), but whenever I run this, I get error :
    debug assertion failed : from vcincludevector.h, whenever debugger meets trainX[0].cols or textX[0].cols or so on.

    Can you help me running this code?

  15. Vidya
    Posted June 16, 2015 at 6:36 am | Permalink

    Hi Eric,

    Could you let me know how to change the output layer to make CNN to space displacement neural network (SDNN) ?


  16. daijie
    Posted August 4, 2015 at 4:33 am | Permalink

    oh,i am so sorry.

    • Eric
      Posted August 10, 2015 at 2:31 pm | Permalink

      你好,在stochastic pooling那里,我说选取了1.4只是随便举一个例子,具体选取哪个,当然还要看random number generator的结果了:)

  17. suzy
    Posted December 4, 2015 at 2:01 pm | Permalink

    Hi, Eric
    First I wanna thank you for this tutorial, it really helped me a lot.
    Do you have this code in MATLAB by any chance? if you do, will you please send it to my email?
    I cant understand the mechanism very well with the c++ code that you provided.
    Thank you in advance.

  18. Ari
    Posted March 18, 2016 at 7:44 pm | Permalink

    Hi Eric,

    Thank you so much for this wonderful job. I really appreciate that you’ve posted your codes here. I wonder if you have the code in Python as well as I’m using opencv python.

    Thank you,

  19. Ganesh S
    Posted May 20, 2016 at 11:55 am | Permalink

    Hi Eric,
    Thank you for sharing your knowledge :). In your code, the trainY & testY are 1 by n mats. I mean to say the labels were 1,2,3… etc. Who to train your network if my label vector is say [1,0,0] , [0,1,0] and [0,0,1] for a 3 class problem. I tried to see where you were converting the integer labels to vector binary vector form but not able to find it. Could you please let me know how to do that? Thanks in advance.

  20. Nitin
    Posted May 27, 2016 at 8:07 am | Permalink

    Mr. Eric

    I cannot understand how do I create my own test and training data, say if I want to train a classifier for face/text detection.

    Even in this post ::
    I can’t figure out how to make the dataset in ubyte format.

  21. Narmada
    Posted July 2, 2016 at 10:39 pm | Permalink

    POOL_STOCHASTIC is undeclared identifier error message,is it included in any header file

2 Trackbacks

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>