Back Propagation Neural Networks

I used to thought that BP Neural Networks algorithm belongs to supervised learning, however, after learned about Sparse Autoencoder algorithm, I realized it can also be used for unsupervised learning (use the unlabeled data itself as both input and output). BP neural networks is the base of a lot of other advanced neural networks algorithm, it is easy, but powerful.


Neural networks is built by certain amount of computational units, each of these unit is called a ‘Neuron’, the reason of using these biological words is that, neuroscientists told us how our brains work, and we are trying to mimic part of a brain, make machines more powerful.

First we did, as usual, establish an ideal model which describe the neural system, and our definition of neuron is something like this:

111 2

Neuron is an unit which have inputs and outputs, accompanies with each inputs, there is a weight that decide how this input value distributes, means what percentage of this input value will be taken by each neuron. For the above neuron, the input is

W1 * X1 + W2 * X2 + W3 * X3 + 1 * b,

sometimes, we also treat this bias b as W0, and set X0 always has the value +1, so the input can also be written as WTX. What about the output? The output is output = f(input), we call this f() as activation function, here, I’ll use sigmoid function as the activation function (sure you can use tanh() as well, as I said in the Logistic Regression post), that is, O1 = O2 = O3 = sigmoid(WTX).


If we have multi-neurons, we get a neural network, a simple example is like this:



In this particular case, we say we have 3 layers in the network, 1 input layer (L1), 1 hidden layer (L2), and 1 output layer (L3). In both input layer and hidden layer, we have a “+1” unit, we call these bias units, we can either count bias as one of the neuron in its layer or not counting it, this is depend on how you implement the neural networks algorithm. See the rightmost of this network, the output of output layer hW,b(x), means the value of output that calculated using current weight W and b, this is exactly the same as Linear regression and Logistic regression.


Our target is to get a group of weight by training, that minimize the error between the value we get by calculate through this network and the actual value of a dataset, so what the training process does is nothing but iteratively update the values of weight. The training algorithm is as follow.

  1. Randomly initialize weights.
  2. Implement forward propagation to get hW,b(x) for any x.
  3. Implement code to compute cost function J(θ).
  4. Implement back propagation to compute partial derivatives.
  5. Do gradient checking.
  6. Use gradient descent or other advanced optimization method with back propagation to try to minimize J(θ) as a function of parameters θ.

Step by step, let’s begin building a bp neural networks algorithm!

First we initialize our weight matrices to be some random value between (-a, a), which a is some small number, say, 0.12.

Then implement forward propagation. For example, we want to calculate the activation of the above example neural network, this is how we do it.


Forward propagation means we calculate each activation a from left to right, layer by layer, and in which, f is the activation function, sigmoid or tanh. We use the input x as our a1 directly, so if we see each layer’s activation as a vector, we can simplify the calculate process to:


and hW,b(x) is the a vector of right most layer.

Next step is implement function which calculate the value of cost function, here is the formula.


In which, m is the amount of training samples. The left part of it is the average sum-of-squares error term, and the right part is called regularization term, which tends to decrease the magnitude of the weights, and helps prevent overfitting.

Then it’s time to implement the function that calculate partial derivatives of cost function, and here is what we do it using back propagation:

  • Perform a feedforward pass, computing the activations for layers L2L3, and so on up to the output layer Ln.
  • For each output unit i in layer n (the output layer), set


  • From right to left, For l = n-1, n-2, n-3, …, 2


  • Compute the desired partial derivatives, which are given as:




  • Last, we compute J’W,b(θ) by dividing sample’s amount and adding regularization term:


Because we have calculated the activation function of the network, so we can easily calculate f'(z) by:



Before we train the weights matrices, we can use gradient checking method to check if our cost function and partial derivatives function works correctly. For example, the cost function is J(θ), if we define a Epsilon, and calculate the cost function J(θ+Epsilon) and J(θ-Epsilon), and then calculate the difference of these two value divided by two Epsilon, if this Epsilon is small, the value we get here is the approximate of derivative of cost function; actually, if this Epsilon is small enough, this is exactly the mathematical definition of derivative:


By altering every value in weights matrices, and compare the value with the derivative value we got from back propagation, we are able to know whether our functions works correctly. In practice, it’s good to set this Epsilon to 1e-4. It is important to remember, if we found our algorithms work well, just disable this gradient checking part of code, because it is very slow calculation, and we don’t need to run it in every iteration, since we already sure that our functions will give us right outputs.

Now with all the above powerful functions, the last thing we need to do is implement a gradient descent (or other similar algorithms), and update the weights parameters, to minimize our cost function value.



In which, α is learning rate.



I implemented it using C++ and Armadillo (which is a C++ linear algebra library). You can set parameters as you like such as the amount of hidden layer, how much neurons each hidden layer has, and so on. Play with it and enjoy it!

// Back Propagation Neural Networks

#include <iostream>
#include <armadillo>
#include <math.h>
using namespace arma;
using namespace std;

#define elif else if
#define MAX_ITER 10000

double lrate = 0.1;
double lambda = 0.0;

int numHiddenLayerNode;
int numOutputNodes;
int numHiddenLayers = 1;

colvec vec2colvec(vector<double>& vec){
    int length = vec.size();
    colvec A(length);
    for(int i=0; i<length; i++){
        A(i) = vec[i];
    return A;

rowvec vec2rowvec(vector<double>& vec){
    colvec A = vec2colvec(vec);
    return A.t();

mat vec2mat(vector<vector<double> >&vec){
    int cols = vec.size();
    int rows = vec[0].size();
    mat A(rows, cols);
    for(int i = 0; i<rows; i++){
        for(int j=0; j<cols; j++){
            A(i, j) = vec[j][i];
    return A;

colvec log(colvec vec){
    for(int i = 0; i < vec.size(); i++){
        vec(i) = log(vec(i));
    return vec;

rowvec log(rowvec vec){
    for(int i = 0; i < vec.size(); i++){
        vec(i) = log(vec(i));
    return vec;

double sigmoid(double z){
    return 1.0 / (1.0 + exp(-z));

rowvec sigmoid(rowvec z){
    for(int i=0; i<z.size(); i++){
        z(i) = sigmoid(z(i));
    return z;

colvec sigmoid(colvec z){
    rowvec temp = z.t();
    return (sigmoid(temp)).t();

double dsigmoid(double z){
    return z * (1.0 - z);

colvec dsigmoid(colvec a){
    colvec one = ones<colvec>(a.size());
    return a % (one - a);

rowvec dsigmoid(rowvec a){
    rowvec one = ones<rowvec>(a.size());
    return a % (one - a);

vector<colvec> getActivation(mat x, vector<mat>& weightsMatrix, int m){

    vector<colvec> a;
    colvec temp1(x.n_rows);
    for(int i=0; i<numHiddenLayers; i++){
        colvec temp(numHiddenLayerNode);
    colvec temp2(numOutputNodes);
    colvec one = ones<colvec>(1);
    for(int i = 0; i < a.size(); i++){
        if(i == 0) a[i] = x.col(m);
            colvec xtemp = a[i - 1];
            xtemp =  join_cols(one, xtemp);
            a[i] = weightsMatrix[i - 1] * xtemp;
            a[i] = sigmoid(a[i]);
    return a;    

//h(xi) is just last vector of a
colvec gethm(vector<colvec> a){
    return a[a.size() - 1];

double getCostFunction(mat x, vector<mat>& weightsMatrix, mat y, double lambda){

    int nsamples = x.n_cols;
    double sum = 0.0;
    for(int m = 0; m < nsamples; m++){
        vector<colvec> a = getActivation(x, weightsMatrix, m);
        colvec hx = gethm(a);
        colvec err = hx - y.col(m);
        err = err % err;
        double temp = 0.0;
        for(int i=0; i<err.size(); i++){
            temp += err(i);
        sum += temp / 2;
    sum = sum / (double)nsamples;
    double temp = lambda / 2 / nsamples;
    double sum2 = 0.0;
    for(int i = 0; i < weightsMatrix.size() - 1; i++){
        for(int j = 0; j < weightsMatrix[i].n_cols; j++){
            for(int k = 0; k < weightsMatrix[i].n_rows; k++){
                sum2 += weightsMatrix[i](k, j) * weightsMatrix[i](k, j);
    return sum + temp * sum2;

vector<mat> getdJ(mat x, mat y, vector<mat>& weightsMatrix, double lambda){
    //big delta is temp variables for calculating dJ
    //let every variables in bigDelta to be zero.
    vector<mat> bigDelta;
    for(int i=0; i<weightsMatrix.size(); i++){
        mat temp = zeros<mat>(weightsMatrix[i].n_rows, weightsMatrix[i].n_cols);
    vector<mat> dJ;
    for(int i=0; i<weightsMatrix.size(); i++){
        mat temp = zeros<mat>(weightsMatrix[i].n_rows, weightsMatrix[i].n_cols);
    int nsamples = x.n_cols;
    //use backProp method
    for(int m = 0; m < nsamples; m++){
        vector<colvec> a = getActivation(x, weightsMatrix, m);
        vector<colvec> tempDelta;
        for(int i=0; i<a.size(); i++){
            colvec temp = zeros<colvec>(a[i].size());
        //no tempDelta[0]
        for(int l = tempDelta.size() - 1; l > 0; l --){
            if(l == tempDelta.size() - 1){
                tempDelta[l] = (a[l] - y.col(m)) % dsigmoid(a[l]);
                mat mult = weightsMatrix[l].t() * tempDelta[l + 1];
                tempDelta[l] = mult.rows(1, mult.n_rows - 1) % dsigmoid(a[l]);
        for(int l = 0; l < bigDelta.size(); l++){
            colvec tp = ones<colvec>(1);
            tp =  join_cols(tp, a[l]);
            bigDelta[l] += tempDelta[l + 1] * tp.t();    
    for(int l = 0; l < bigDelta.size(); l++){
        dJ[l] = bigDelta[l] / (double)nsamples;
        dJ[l] = dJ[l] + lambda * weightsMatrix[l];
        for(int j = 0; j < dJ[l].n_rows; j++){
            dJ[l](j, 0) = dJ[l](j, 0) - lambda * weightsMatrix[l](j, 0);
    return dJ;

colvec calculateY(colvec x, vector<mat> weightsMatrix){

    colvec result(x);
    colvec tp = ones<colvec>(1);
    for(int i=0; i<weightsMatrix.size(); i++){
        result = join_cols(tp, result);
        result = weightsMatrix[i] * result;
        result = sigmoid(result);
    return result;

void bpnn(vector<vector<double> >&vecX, vector<vector<double> >&vecY, vector<vector<double> >& testX, vector<vector<double> >& testY){

    int nsamples = vecX.size();
    int nfeatures = vecX[0].size();
    //change vecX and vecY into matrix or vector.
    mat y = vec2mat(vecY);
    mat x = vec2mat(vecX);
    numHiddenLayerNode = nfeatures * 5;
    numOutputNodes = vecY[0].size();
    //build weights matrices and randomly initialize them.
    vector<mat> weightsMatrix;
    mat tempmat;
    double init_epsilon = 0.12;
    //input --> first hidden layer:
    tempmat = randu<mat>(numHiddenLayerNode, nfeatures + 1);
    //hidden layer --> hidden layer :
    for(int i=0; i< numHiddenLayers - 1; i++){
        tempmat = randu<mat>(numHiddenLayerNode, numHiddenLayerNode + 1);
    //last hidden layer --> output layer:
    tempmat = randu<mat>(numOutputNodes, numHiddenLayerNode + 1);
    for(int i=0; i<weightsMatrix.size(); i++){
        weightsMatrix[i] = weightsMatrix[i] * (2 * init_epsilon) - init_epsilon;
    //till now, we finished building weights matrices.
    //Gradient Checking (remember to disable this part after you're sure the 
    //cost function and dJ function are correct)
    vector<mat> dJ = getdJ(x, y, weightsMatrix, lambda);
    double epsilon = 1e-4;
    for(int i=0; i<weightsMatrix.size(); i++){
        cout<<"################ Weight Layer "<<i<<endl;
        for(int j=0; j<weightsMatrix[i].n_rows; j++){
            for(int k=0; k<weightsMatrix[i].n_cols; k++){
                double memo = weightsMatrix[i](j, k);
                weightsMatrix[i](j, k) = memo + epsilon;
                double value1 = getCostFunction(x, weightsMatrix, y, lambda);
                weightsMatrix[i](j, k) = memo - epsilon;
                double value2 = getCostFunction(x, weightsMatrix, y, lambda);
                double tp = (value1 - value2) / (2 * epsilon);
                cout<<j<<", "<<k<<", "<<tp<<", "<<dJ[i](j, k)<<", "<<tp / dJ[i](j, k)<<endl;
                weightsMatrix[i](j, k) = memo;
    int converge = 0;
    double lastcost = 0.0;
    while(converge < MAX_ITER){
        vector<mat> dJ = getdJ(x, y, weightsMatrix, lambda);
        for(int j = 0; j < weightsMatrix.size(); j++){
            weightsMatrix[j] -= lrate * dJ[j];
        double cost = getCostFunction(x, weightsMatrix, y, lambda);
        cout<<"learning step: "<<converge<<", Cost function value = "<<cost<<endl;
        if(fabs((cost - lastcost) ) <= 5e-6 && converge > 0) break;
        lastcost = cost;
        ++ converge;
    int correct = 0;
    for(int i=0; i<testX.size(); i++){
        colvec tpcol = vec2colvec(testX[i]);
        colvec result = calculateY(tpcol, weightsMatrix);
        if((result(0) >= 0.5 && testY[i][0] == 1) || (result(0) < 0.5 && testY[i][0] == 0))
            ++ correct;
    cout<<"right: "<<correct<<", total: "<<testX.size()<<", accuracy: "<<double(correct) / (double)(testX.size())<<endl;

int main(int argc, char** argv)
    long start, end;
    //read training X from .txt file
    FILE *streamX, *streamY;
    streamX = fopen("trainX.txt", "r");
    int numofX = 30;
    vector<vector<double> > vecX;
    double tpdouble;
    int counter = 0;
        if(fscanf(streamX, "%lf", &tpdouble)==EOF) break;
        if(counter / numofX >= vecX.size()){
            vector<double> tpvec;
        vecX[counter / numofX].push_back(tpdouble);
        ++ counter;

    cout<<vecX.size()<<", "<<vecX[0].size()<<endl;

    //read training Y from .txt file
    int numofY = 1;
    streamY = fopen("trainY.txt", "r");
    vector<vector<double> > vecY;
    counter = 0;
        if(fscanf(streamY, "%lf", &tpdouble)==EOF) break;
        if(counter / numofY >= vecY.size()){
            vector<double> tpvec;
        vecY[counter / numofY].push_back(tpdouble);
        ++ counter;

    for(int i = 1; i<vecX.size(); i++){
        if(vecX[i].size() != vecX[i - 1].size()) return 0;
    if(vecX.size() != vecY.size()) return 0;

    streamX = fopen("testX.txt", "r");
    vector<vector<double> > vecTX;
    counter = 0;
        if(fscanf(streamX, "%lf", &tpdouble)==EOF) break;
        if(counter / numofX >= vecTX.size()){
            vector<double> tpvec;
        vecTX[counter / numofX].push_back(tpdouble);
        ++ counter;

    streamY = fopen("testY.txt", "r");
    vector<vector<double> > vecTY;
    counter = 0;
        if(fscanf(streamY, "%lf", &tpdouble)==EOF) break;
        if(counter / numofY >= vecTY.size()){
            vector<double> tpvec;
        vecTY[counter / numofY].push_back(tpdouble);
        ++ counter;

    start = clock();
    bpnn(vecX, vecY, vecTX, vecTY);
    end = clock();
    cout<<"used time: "<<((double)(end - start)) / CLOCKS_PER_SEC<<endl;

    return 0;


I used Wisconsin Breast Cancer dataset, which has 284 training samples and 284 test samples, there are 30 features in each sample, you can download the dataset here:

trainX.txt   trainY.txt

testX.txt     testY.txt

Here’s the test result, it could be faster, because the accuracy is already good enough when learning step 1,000.





This entry was posted in Algorithm, Machine Learning and tagged , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Eric
    Posted October 3, 2015 at 2:06 pm | Permalink

    Mr Yang could you explain me what is the physical meaning of the results?. Thanks

  2. Eric
    Posted October 14, 2015 at 12:17 am | Permalink

    Mr Yang
    Could you tell me how to modify this code instead of test data it could make the process with input data and over it make forecast?

  3. Posted March 11, 2016 at 9:54 am | Permalink

    is just intended as basic information. En 1981 grabé una entrevista conJoachim Prinz, el rabino y fan when they wear the cheap NFL jerseys. In Pittsburgh Steelers order to give spectators the From teenagers to middle aged people everyone is demanding a Steelers jersey for him! This has WV and Marcus Hook, PA..Continue to cook the sauce. Now add the milk and nutmeg, stirring

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>