I chose Prof. Yann LeCun‘s Deep Learning as one of my three attending courses this semester, because that I have no background of machine learning, so I also viewing Prof. Andrew Ng’s Machine Learning lectures online recent days.

At the beginning of the lectures, it was about linear regression. I heard about some simple linear regression years ago in the statistics class, however what I only knew was something like the Newton-Raphson method, and didn’t really implemented the method. So I summarized something about this method, and implemented it using C++ and Armadillo (which is a C++ linear algebra library). This article is as a study note to remind me these knowledge and can also help others who are learning it.

**WHAT IS LINEAR REGRESSION**

When given some variables X, and corresponding results Y, linear regression is the approach to find, or guess the relationship between X and Y. If X contains only unidimensional variables, then linear regression can be represented by the following graph, which we all very familiar with.

Every time someone who talks about linear regression will most likely to say that “If your friend wants to sell a house…”. Given all those points above, we can guess the relationship between the area of house and the price of house using linear regression.

Furthermore, when our X is multi-dimensional, especially more than 2-dimension, it will be less likely to represent the problem through a graph, but the problem is still similar with the 1-d problem, nothing but guess the θs in** θ_{0} + X_{1}θ_{1} + X_{2}θ_{2} + … + X_{j}θ_{j} = Y**, where j is the dimension of X.

**METHOD 1: GRADIENT DESCENT**

In most cases, people say gradient descent means **batch gradient descent**, which means the method looks at **every** example in the entire training set on every step. We use LMS (Least Mean Squares) algorithm here, and this is the formula of it.

In the above formula, **α** is called learning rate, which partially decided the length of each iteration step; **J(θ)** is called cost function, which is defined as:

and in **J(θ)**, we can see **h _{θ}(x)** which means the value of

**y’**that calculated using current

**θ**, as LMS algorithm, what we do is try to find a group of

**θ**, that make the sum of errors minimal.

In the ** θ_{j}** formula above, we can see that we must work out the partial derivative of

**J(θ)**, in geometry, partial derivative can be represented by the slope of a curve, since the partial derivative can be either positive or negative, and the value of it varies with the ratio between x and y, so this is the other factor which decides the move length of every iteration.

After calculus, we can get the following formula:

in which, j is the dimension of training samples, and m is the amount of training samples in each dimension. We keep iterating until ** θ** converge, converge means the value of it does not change much after training.

There is another kind of gradient descent method called **stochastic gradient descent**, which is an algorithm that repeatedly run through the training set, and each time encounters a training example, it update the parameters according to the gradient of the error with respect to that single training example only. It can be represented by the following formula:

If the training example is very large, we can use this method, and update ** θ_{j}** after every single iteration. Maybe we can get

**which good enough by part of the example, we can avoid continue training instead of using whole example to do training.**

**θ**_{j}Additionally, there are several things that need to point out.

Firstly, it is important and tricky to choose a good learning rate **α**. If it is too small, it will take us long long time on waiting for ** θ** to converge, furthermore, if we move very very little step every time, we may fall into kind of tiny local minimum, which are made by noise. If

**α**is too big,

**may never converge, because every time we stride over the minimum we are finding.**

**θ**Secondly, when we have multi-dimensional training example X, we can converge faster by using a trick called Feature Scaling, which means that if the range of each features are not in the same order of magnitude, we can try to normalize them.

Thirdly, it is notable that gradient descent method works well in the situation that the data has only one global, and no other local, optima, and in this condition, gradient descent always converges to the global minimum; however, if there are some local minimums in the data, it is not good idea to use the above method.

**METHOD 2: THE NORMAL EQUATIONS**

We can use a simple equation to calculate ** θ**, that is:

It is easy to implement, and easy to remember. I’ll not give unnecessary details about this method, but there are still something I need to point out.

1. For specific reasons that **X ^{T}X** have no inverse matrix, just use

**pinv()**function in Matlab/Octive/Armadillo, this function returns the pseudoinverse of the matrix.

2. Since the time complexity of matrix inverse is O(N^{3}), thus it will be really slow to use this method facing some really large X.

**IMPLEMENTATION**

struct normal{ double scale; double translate; }; arma::colvec vec2colvec(vector<double>& vec){ int length = vec.size(); arma::colvec A(length); for(int i=0; i<length; i++){ A(i) = vec[i]; } return A; } arma::mat vec2mat(vector<vector<double> >&vec){ int col = vec.size(); int row = vec[0].size(); arma::mat A(row, col); for(int i = 0; i<col; i++){ for(int j=0; j<row; j++){ A(j, i) = vec[i][j]; } } return A; } normal getScale(vector<double> &vec, double newmin, double newmax){ double max, min; for(int i=0; i<vec.size(); i++){ if(i == 0){ max = vec[i]; min = vec[i]; }else{ if(vec[i] > max){ max = vec[i]; } if(vec[i] < min){ min = vec[i]; } } } normal result; result.translate = (max - min) / 2 - max; result.scale = (max - min) / (newmax - newmin); return result; } arma::colvec BatchGradientDescent(vector<vector<double> >&vecX, vector<double>& vecY){ int nfeatures = vecX.size() + 1; int nsamples = vecX[0].size(); //add X0 = 1. vector<double> tempvec; for(int i=0; i<vecX[0].size(); i++){ tempvec.push_back(1.0); } vecX.insert(vecX.begin(), tempvec); /* //try to normalize vecX //X0 should not be normalized. //first calculate the scale and translate vector<normal> nor; normal tpnor; tpnor.scale = 1.0; tpnor.translate = 0.0; nor.push_back(tpnor); for(int i = 1; i<nfeatures; i++){ tpnor = getScale(vecX[i], -1.0, 1.0); nor.push_back(tpnor); } //now do scale and translate for(int i=1; i<nfeatures; i++){ for(int j=0; j<nsamples; j++){ vecX[i][j] += nor[i].translate; vecX[i][j] /= nor[i].scale; } } // cout<<nor[1].translate<<", "<<nor[1].scale<<endl; */ //change vecX and vecY into matrix or vector. arma::colvec y = vec2colvec(vecY); arma::mat x = vec2mat(vecX); //set learning rate; double lrate = 0.0001; //build theta vector, and randomize initial values arma::colvec theta(nfeatures); for(int i=0; i<nfeatures; i++){ theta(i) = 0.0; } arma::colvec thetatemp; int counter = 0; while(1){ arma::rowvec thetaT = theta.t(); arma::colvec descent(nfeatures); for(int j=0; j<nfeatures; j++){ double sum = 0.0; for(int i=0; i<nsamples; i++){ rowvec xit = x.row(i); colvec xi = xit.t(); double h = as_scalar(thetaT * xi); double errortemp = y(i) - h; errortemp *= xi(j); sum += errortemp; } sum *= lrate; descent(j) = sum; } thetatemp = theta + descent; // cout<<"************** round "<<counter<<endl // <<theta<<endl; int converge = 0; for(int i=0; i<nfeatures; i++){ if(fabs(theta(i) - thetatemp(i)) / theta(i) > 0.00001) break; else ++converge; } if(converge == nfeatures || ++counter > 100000){ return thetatemp; } theta = thetatemp; } } arma::colvec StochasticGradientDescent(vector<vector<double> >&vecX, vector<double>& vecY){ int nfeatures = vecX.size() + 1; int nsamples = vecX[0].size(); //add X0 = 1. vector<double> tempvec; for(int i=0; i<vecX[0].size(); i++){ tempvec.push_back(1.0); } vecX.insert(vecX.begin(), tempvec); /* //try to normalize vecX //X0 should not be normalized. //first calculate the scale and translate vector<normal> nor; normal tpnor; tpnor.scale = 1.0; tpnor.translate = 0.0; nor.push_back(tpnor); for(int i = 1; i<nfeatures; i++){ tpnor = getScale(vecX[i], -1.0, 1.0); nor.push_back(tpnor); } //now do scale and translate for(int i=1; i<nfeatures; i++){ for(int j=0; j<nsamples; j++){ vecX[i][j] += nor[i].translate; vecX[i][j] /= nor[i].scale; } } // cout<<nor[1].translate<<", "<<nor[1].scale<<endl; */ //change vecX and vecY into matrix or vector. arma::colvec y = vec2colvec(vecY); arma::mat x = vec2mat(vecX); //set learning rate; double lrate = 0.0001; //build theta vector, and randomize initial values arma::colvec theta(nfeatures); arma::colvec thetatemp(nfeatures); for(int i=0; i<nfeatures; i++){ theta(i) = 0.0; } int counter = 0; while(1){ thetatemp = theta; arma::colvec descent(nfeatures); for(int j=0; j<nfeatures; j++){ int i = 0; for(i=0; i<nsamples; i++){ double thetaj = theta(j); arma::rowvec thetaT = theta.t(); rowvec xit = x.row(i); colvec xi = xit.t(); double h = as_scalar(thetaT * xi); double errortemp = y(i) - h; errortemp *= xi(j); theta(j) += lrate * errortemp; if(fabs(theta(j) - thetaj) / theta(j) <= 0.00001) break; } // cout<<"j = "<<j<<", i = "<<i<<endl; } // cout<<"************** round "<<counter<<endl // <<theta<<endl; int converge = 0; for(int i=0; i<nfeatures; i++){ if(fabs(theta(i) - thetatemp(i)) / theta(i) > 0.00001) break; else ++converge; } if(converge == nfeatures || ++counter > 100000){ return theta; } } } arma::colvec NormalEquation(vector<vector<double> >&vecX, vector<double>& vecY){ //add X0 = 1. vector<double> tempvec; for(int i=0; i<vecX[0].size(); i++){ tempvec.push_back(1.0); } vecX.insert(vecX.begin(), tempvec); arma::colvec y = vec2colvec(vecY); arma::mat x = vec2mat(vecX); arma::mat xT = x.t(); arma::mat temp; temp = xT * x; temp = pinv(temp); arma::colvec result = temp * xT * y; return result; }

Happy Chinese New Year !!!

:p

Pingback: Machine Learning, Gradient Descent, Matlab, Logistic Regression, Sigmoid()