### WHAT IS CNN

A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers, pooling layers and then followed by one or more fully connected layers as in a standard neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.

It was invented by** Prof. Yann LeCun** (NYU), and the effect of CNN to the world is profound, every big IT company is trying to do something using it.

### THE IDEA

In my interpretation, the idea of CNN is to transform the 2-D images into a 3-D space, this is done by convolve the images by several kernels. Say we have **M * N** size image, and the kernel we use are **m * n** big, and we use **k** kernels, so after convolution, we can get **k * (M – m + 1) * (N – n + 1)** images. By convolution, we decreased the size of each image, but at the same time we improved the feature by increase the influence of nearby pixels. Furthermore, the kernels we use in convolution are actually got by training, after training, I visualized the trained kernels, and found that they look just like some Gabor filters (you can see it in the following content). I **mis-understood** this point last week, so my first version of CNN generates a Gabor filter bank (about 200 random Gabor filters), and randomly choose several to convolve with the training images (Now I call it “**FakeCNN**“, and you can see it in the above link). I thought about it, maybe just like Sparse Autoencoder, the trained feature are better features than artificial features and can better fit the dataset.

**1. Convolution.**

Use “Valid” type, means we only use the middle part of full result of convolution, by doing this ,we can avoid from facing boundary situation. As I mentioned above, the convolution of a **M * N** image and a **m * n** kernel results a **(M – m + 1) * (N – n + 1)**.

**2. Pooling.**

In practice, there are mainly 3 kind of pooling methods we use, Max Pooling, Mean Pooling, and Stochastic Pooling. Max Pooling and Mean Pooling are super easy, I’ll talk about Stochastic Pooling. It was invented by** M. Zeiler and R. Fergus** (both NYU), the main idea of it is not only captures the strongest activation of the filter template with the input for each region.

Say we have the following activation in a 3 * 3 region, we want get one pooling result using Stochastic Pooling method.

We calculate the probability of each element in the block.

by sorting the probabilities, we got

Now we randomly choose a number from 0-8, say 3, then the pooling result is 1.4, which has the fourth (count starts from 0) big probability. You can see that there are two 1.2 in the activation, and the probabilities are both 0.1, and if the random number we generate is 4 or 5, then we will choose 1.2. This shows the activation which has bigger probability has more chance to be chosen.

When back-prop, we just put the activation we chose in the right place, and set other values 0. Like this

This is similar with what we do in Max Pooling method.

When testing, we don’t need to do this process again, we just simply do element-wise multiply.

And let this 1.5033 as the pooling result.

** 3. Back-prop**

The full connected and softmax part are nothing different with methods we introduced before. Something new is the Convolution and Pooling part.

And the gradient can be got by:

**4. Non-Linearity**

A very popular non-linearity function is **ReLU (Rectified Linear Unit)**, which is

**f(x) = max(0, x)**

and its derivative is also fairly simple:

**f'(x) = 1, x > 0**

**f'(x) = 0, x <= 0**

There’s a smooth approximate to it is called the **softplus** function, you can check it on Wikipedia.

**MY CONVNET ARCHITECTURE**

- 1 Conv layer
- 1 Pooling layer (Pooling dimension is 4 * 4)
- 2 full connected layer (Each has 200 hidden neurons)
- 1 Softmax regression layer
- Using 8 convolution kernels (13 * 13 size)

I implemented all the 3 Pooling methods, you can choose anyone to use.

There’re still something left to improve, the Stochastic Gradient Descent part, the current version is just using randomly chosen small batches to train the network, and the learning rate is calculated by finding the biggest eigenvalue of the training dataset, say** λ**, and set the learning rate slightly smaller than **1/λ**. This method is from Prof. Yann LeCun’s **Efficient BackProp**. However, the training process is not fast enough, I didn’t quite understand the method used in UFLDL convnet, so hope I can understand that and improve this ConvNet demo.

**SOURCE CODE**

**https://github.com/xingdi-eric-yuan/single-layer-convnet**

### TEST RESULT

I tested the code on MNIST dataset and got **0.9828** accuracy.

Here’s an image I got by visualize the kernels by Matlab.

They look like Gabor filters, right?

Hey guys, I improved a version 2, which supports multi-layers Conv and Pooling process, check it **HERE**.

And **here**‘s the version 3.0.

Enjoy it 🙂

### References:

Jake Bouvrie. Notes on Convolutional Neural Networks

## 39 Comments

hi, if the input is color-image, how to design the filters and the whole net?

Hey Alex,

You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image.

I believe there’re other ways that deal with multi-channel images, but I have not tried before.

🙂

Hi Eric,

Your wrote: vector layer. It’s wrong? the vector has no argument. Please check it out. Thanks.

Yeah you’re right, it means different kernels in conv layer, better to call it “kernel”, thanks.

Thanks for your free code, it is very clear but I way puzzled by some code.

in the function

read_Mnist(string filename, vector &vec)

the code is

tpmat.at(r, c) = (int) temp;

why you change “temp” for (uchar) to (int), I think “temp” will over write “tpmat” because “tpmat” is define “CV_8UC1”

When I use your code, the tranning time are days(2 days) until now and I just get 0.30 accuracy.

How long did you train to get 0.9828 accuracy

Hi,

Thanks for mentioning.

First, the mnist dataset guarantees each of its element to be (0, 255), so it’s ok to use int, btw, it’s just because I’m accustomed to use “int” in CV_8UCx format things :p

Second, It indeed used me days for training on this net, you can also try the following net instead:

https://github.com/xingdi-eric-yuan/multi-layer-convnet

If you have further problem with this, let me know, and I’ll check if this version I put on github is buggy.

Thanks.

Hi Eric,

First of all, I’d like to thank you for your codes and explanation. It’s been great help for me to map theory and practice.

I have a similar symptom as zhenghx. First, I had nothing changed (Num HiddenNeaurons = 200) and it gave me 0.110 accuracy within 6 hours. I then tried Num HiddenNeaurons = 500 which ran for about 2 days – yielded

learning step: 199998, Cost function value = 0.123866, randomNum = 47623

learning step: 199999, Cost function value = 0.106693, randomNum = 9355

correct: 3969, total: 10000, accuracy: 0.3969

Totally used time: 199141 second

Could you let me know what setting you had for 0.98 accuracy? I got this code from the github last weekend.

Thank you very much!

Hey Daniel,

First, try to use Max Pooling by:

int Pooling_Methed = POOL_MAX;

It was months ago that I got that 0.98 accuracy result, I’m not sure how many kernels I used but I’m sure it was Max Pooling. (Sorry)

Second, by debugging my newest version of CNN, I found some bugs that also exist in my old versions (including this one), so when the version which I’m working on is bug-free, I’ll also edit these versions of code.

Thank you.

yeah, it is very useful post for me, i want to ask about stochastic pooling,

I dont understand your explain why just randomly choose number from 0 to 8

more details please

thanks!

Hi Peter,

In that specific case, we have totally 9 elements, so what I mean was just to randomly choose one element in the 3*3 matrix, say, 4, and find out which interval that the 4th largest probability element falls in, and the result should be the value of this interval. The result is not chosen by any restriction like largest value, largest probability, or something like it, this is why this method is called “stochastic” pooling; however, by using this method, the larger probability elements do have more chance to be chosen. Do I explain it clearer this time?

Thanks,

Eric

Actually I do not think you choose elements according to their probabilities with your method. You choose a random “rank” of the elements, which is just equivalent to choosing a random element in this pool.

您好，又来向您请教了，我实现了ufldl的结构，在mnist上训练得到98%的识别率，可是可视化第一层的滤波器却并没有得到类似gabor的模板，请问可能是哪里的原因呢？ 谢谢

你训练出的kernel长什么样？如果是mnist的话，不像gabor很正常。因为gabor是用自然图片训练出来的，mnist训练完的kernel长得像是笔划的样子。

http://photo.163.com/junhun-2008/#m=2&aid=195835179&pid=9137011744 这是我的结果，貌似连笔画也不像。。 我这里是9*9的20个卷积核，规范化到0~255并放大5倍后得到的结果，

感觉你训练出这些kernel还是挺make sense的，能看出类似morlet wavelet的样子，黑白，或者黑白黑，白黑白这样的，而且位置和方向都不一样，对应了在不同位置和方向的edge detector和line detector。可能有的参数还需要调整一下，比如核的个数啊，或者regularization的地方。

多谢大牛，那我再调调参数，由于是自己写的程序，所以总担心是不是哪里写错了，哪里精度不够之类的问题，另外看你公布的卷积核还是很漂亮的~

我现在有50个训练样本，每个样本有1534维，标签是50*1。还有50个测试样本，也为1534维，标签是50*1.如何使用卷积神经网络呢？谢谢。

你好何浪，根据我的经验，你的训练样本数量太少了，相对于1534维的features，对这个数据进行卷及神经网络的话必然会overfitting。建议扩充样本个数(我觉得至少几千几万个)，或者对样本进行PCA，降低维度。不知道你的数据是什么数据，如果不是图像的话，用卷积网络可能效果并不一定太好。

谢谢你的耐心解答，我的这50个是样本数，1534维是从中提出的特征的维数，我是想用卷积神经网络做回归呢，然后我用真实的值和预测值进行比较，想得出均方根误差。但是我看过邓力写的一篇语音识别的文章，但是没有看懂里面的具体细节，所以现在正在困惑中。

邓力的文章我没看过，所以不太清楚。我不太了解对于语音数据进行回归所用的具体方法，但是按我的理解可能用我这里的卷积网络并不一定理想，因为鉴于二维卷积的性质，我们训练出的核是对于二维特征的一种描述子(both x and y orientation)。可能对于音频，可以试试一维的卷积网络，这种网络和recurrent neural network有什么关系吗?这方面我不太熟悉，你可以去了解一下。另外我还是觉得50个样本太少了。。

或者您给我说一下，我的这个数据如何放在卷积神经网络中进行训练和测试，要改哪些参数？谢谢啦。

一维卷积神经网络的代码，或者你给我说下思路，谢谢啦。

具体我也没实现过，我想大概就是把二维的卷积换成一维而已，就类似正常的卷积，m长度的信号和n长度和核卷积之后得到长度是m-n+1长度的输出，前向传播比较好理解，反向传播的公式可能需要思考一下或者查查资料。卷积层之后的non-linearity层应该还是需要的。之后就和二维的没什么区别了，还是全连接层以及输出。

Hello Eric,

I want to get to know CNN as best i can. Looking through the code, it is not yet implemented with the GPU. I am thinking of making a branch in git and contributing. Do you think that it would work? Are changes needed first?

Regards,

Daniel

Hi Daniel,

Here’s a CUDA version of my code which is implemented by zhxfl, you may want to check it, thanks.

https://github.com/zhxfl/CUDA-CNN

Awesome. I’ll check it out and maybe bring in the OpenCL methods.

Thank you

你好，首先谢谢您的代码。

我是最近一个月才开始接触CNN的，我下载了您的代码，在vs2013运行opencv已经设置好，发现打不开MNIST库，请问是为什么呀？

另外我自己写的一个CNN是输入层+两个卷积层加池化（平均）+全连结+softmax，输入图片是28*28，有140张的时候误差可以很快收敛，700张的时候就收敛不了，图片是用于文字识别的，所以区别都挺大的。请问原因有可能有哪些呢？当然我最长的训练时间也就用了两天。

还有一个问题想咨询您，就是如果在池化层后用了激活函数，尤其是sigmoid函数后反向训练时梯度会消失呀，小样本都无法收敛，请问这个正常么？还有就是最后一层用softmax效果还不如普通的输出层效果好。

谢谢了。

1. 打不开MNIST有可能是因为我代码里用的是unix系统，可能需要把读取文件的地方改成win里面文件夹的形式

2. 你的卷积核多大？全连接层多少隐藏节点？还有就是我感觉mean pooling可能导致了不收敛，我觉得max pooling比mean pooling好用很多。

3. 还是觉得是mean pooling的问题。另外，貌似在小的dataset上，softmax没有太大优势，不过可能data复杂了，并且数量多了之后（class也多了之后），softmax就展现出优势了。

Hi Eric,

I tried to run the code, with some changes

(all changes were made to compile the original code. Lists are here :

Line 37 : changed POOL_MAX to POOL_STOCHASTIC

Line 117 and etc : cast sqrt()’s and pow()’s parameters to (double), therefore not causing ‘ambiguous call to overloaded function’ error )

But whenever I call something like trainX[0].cols while debugging, ‘Debut Assertion Failed!’ Msgbox comes out, which points that microsoft visual studio 10.0vcincludevector ‘s vector subscript is out of range.

Is this a sort of ‘ambiguous function call’? Can you help me to run your code?

Thanks.

Hi Santuk,

I’m not very familiar with MSVS, I don’t know if there’s any MS’s version of vector, so maybe use std::vector whenever we need vector?

Or maybe you’re using any other dataset which has different image size with MNIST?

Hello, Eric!

I’m just trying to run your code (thereby learning CNN), but whenever I run this, I get error :

debug assertion failed : from vcincludevector.h, whenever debugger meets trainX[0].cols or textX[0].cols or so on.

Can you help me running this code?

Thanks!

Hi Eric,

Could you let me know how to change the output layer to make CNN to space displacement neural network (SDNN) ?

Thanks

oh，i am so sorry.

我没有看懂为啥随机生产的数据选取了1.4，从概率方面来说非常不合理。

请解释一下，谢谢。如果你不能看懂中文，可以用百度翻译。谢谢

你好，在stochastic pooling那里，我说选取了1.4只是随便举一个例子，具体选取哪个，当然还要看random number generator的结果了:)

Hi, Eric

First I wanna thank you for this tutorial, it really helped me a lot.

Do you have this code in MATLAB by any chance? if you do, will you please send it to my email?

I cant understand the mechanism very well with the c++ code that you provided.

Thank you in advance.

Hi Eric,

Thank you so much for this wonderful job. I really appreciate that you’ve posted your codes here. I wonder if you have the code in Python as well as I’m using opencv python.

Thank you,

Ari

Hi Eric,

Thank you for sharing your knowledge :). In your code, the trainY & testY are 1 by n mats. I mean to say the labels were 1,2,3… etc. Who to train your network if my label vector is say [1,0,0] , [0,1,0] and [0,0,1] for a 3 class problem. I tried to see where you were converting the integer labels to vector binary vector form but not able to find it. Could you please let me know how to do that? Thanks in advance.

Mr. Eric

I cannot understand how do I create my own test and training data, say if I want to train a classifier for face/text detection.

Even in this post :: http://yann.lecun.com/exdb/mnist/

I can’t figure out how to make the dataset in ubyte format.

POOL_STOCHASTIC is undeclared identifier error message,is it included in any header file

## 2 Trackbacks

[…] The REAL CONVNET is HERE […]

[…] the last CNN post, I was working on a new version of CNN, which support multi-layers Conv and Pooling process, […]