Eric Yuan's Blog

Perstando et Praestando

  • About Me

Softmax Regression

February 28, 2014 / 1 Comment

WHAT IS SOFTMAX

Softmax regression always as one module in a deep learning network, and most likely to be the last module, the output module. What is it? It is a generalized version of logistic regression. Just like logistic regression, it belongs to supervised learning, and the superiority is, the class label y can be more than two values. Which means, compare to logistic regression, the amount of class will not be restricted to two. However, someone may recall that we can use “1 VS others” method to deal with multi-class problems, I hold this question for a while and let’s see formulae of softmax regression.

First let’s see its hypothesis hθ(x)

222

in which, k means given a test input x, the output y can be any value in set [1, 2, … , k], we want to estimate the probability of the class label taking on each of the k different possible values. So the hθ(x) will be a k dimensional vector.

However, when the products \theta_i^T x^{(i)} are large, the exponential function e^{\theta_i^T x^{(i)}} will become very very large and possibly overflow. To prevent overflow, we can simply subtract some large constant value from each of the \theta_j^T x^{(i)} terms before computing the exponential. In practice, for each example, you can use the maximum of the \theta_j^T x^{(i)} terms as the constant.

333

And followed by cost function and gradient:

666

 

444

555

 

In which, the λ part is called weight decay, which penalizes large values of the parameters. After implementing these part, we can use gradient descent or other advanced methods to get optimal θ by minimizing J(θ) with respect to θ.

SOURCE CODE

// Softmax regression

#include <iostream>
#include <armadillo>
#include <math.h>
using namespace arma;
using namespace std;

#define elif else if
#define MAX_ITER 100000

double cost = 0.0;
mat grad;
double lrate = 0.1;
double lambda = 0.0;
int nclasses = 2;

colvec vec2colvec(vector<double>& vec){
    int length = vec.size();
    colvec A(length);
    for(int i=0; i<length; i++){
        A(i) = vec[i];
    }
    return A;
}

rowvec vec2rowvec(vector<double>& vec){
    colvec A = vec2colvec(vec);
    return A.t();
}

mat vec2mat(vector<vector<double> >&vec){
    int cols = vec.size();
    int rows = vec[0].size();
    mat A(rows, cols);
    for(int i = 0; i<rows; i++){
        for(int j=0; j<cols; j++){
            A(i, j) = vec[j][i];
        }
    }
    return A;
}

void update_CostFunction_and_Gradient(mat x, rowvec y, mat weightsMatrix, double lambda){

    int nsamples = x.n_cols;
    int nfeatures = x.n_rows;
    //calculate cost function
    mat theta(weightsMatrix);
    mat M = theta * x;
    mat temp = repmat(max(M, 0), nclasses, 1);
    M = M - temp;
    M = arma::exp(M);
    temp = repmat(sum(M, 0), nclasses, 1);
    M = M / temp;

    mat groundTruth = zeros<mat> (nclasses, nsamples);
    for(int i=0; i<y.size(); i++){
        groundTruth(y(i), i) = 1;
    }
    temp = groundTruth % (arma::log(M));
    cost = -accu(temp) / nsamples;
    cost += accu(arma::pow(theta, 2)) * lambda / 2;

    //calculate gradient
    temp = groundTruth - M;
    temp = temp * x.t();
    grad = - temp / nsamples;
    grad += lambda * theta;
}

rowvec calculateY(mat x, mat weightsMatrix){

    mat theta(weightsMatrix);
    mat M = theta * x;
    mat temp = repmat(max(M, 0), nclasses, 1);
    M = M - temp;
    M = arma::exp(M);
    temp = repmat(sum(M, 0), nclasses, 1);
    M = M / temp;
    M = arma::log(M);
    rowvec result = zeros<rowvec>(M.n_cols);
    for(int i=0; i<M.n_cols; i++){
        int maxele = INT_MIN;
        int which = 0;
        for(int j=0; j<M.n_rows; j++){
            if(M(j, i) > maxele){
                maxele = M(j, i);
                which = j;
            }
        }
        result(i) = which;
    }
    return result;
}

void softmax(vector<vector<double> >&vecX, vector<double> &vecY, vector<vector<double> >& testX, vector<double>& testY){

    int nsamples = vecX.size();
    int nfeatures = vecX[0].size();
    //change vecX and vecY into matrix or vector.
    rowvec y = vec2rowvec(vecY);
    mat x = vec2mat(vecX);

    double init_epsilon = 0.12;
    mat weightsMatrix = randu<mat>(nclasses, nfeatures);
    weightsMatrix = weightsMatrix * (2 * init_epsilon) - init_epsilon;
    grad = zeros<mat>(nclasses, nfeatures);

/*
    //Gradient Checking (remember to disable this part after you're sure the 
    //cost function and dJ function are correct)
    update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda);
    mat dJ(grad);
    cout<<"test!!!!"<<endl;
    double epsilon = 1e-4;
    for(int i=0; i<weightsMatrix.n_rows; i++){
        for(int j=0; j<weightsMatrix.n_cols; j++){
            double memo = weightsMatrix(i, j);
            weightsMatrix(i, j) = memo + epsilon;
            update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda);
            double value1 = cost;
            weightsMatrix(i, j) = memo - epsilon;
            update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda);
            double value2 = cost;
            double tp = (value1 - value2) / (2 * epsilon);
            cout<<i<<", "<<j<<", "<<tp<<", "<<dJ(i, j)<<endl;
            weightsMatrix(i, j) = memo;
        }
    }
    */

    int converge = 0;
    double lastcost = 0.0;
    while(converge < MAX_ITER){
        update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda);
        weightsMatrix -= lrate * grad;

        cout<<"learning step: "<<converge<<", Cost function value = "<<cost<<endl;
        if(fabs((cost - lastcost) ) <= 5e-6 && converge > 0) break;
        lastcost = cost;
        ++ converge;
    }
    cout<<"############result#############"<<endl;

    rowvec yT = vec2rowvec(testY);
    mat xT = vec2mat(testX);
    rowvec result = calculateY(xT, weightsMatrix);
    rowvec error = yT - result;
    int correct = error.size();
    for(int i=0; i<error.size(); i++){
        if(error(i) != 0) --correct;
    }
    cout<<"correct: "<<correct<<", total: "<<error.size()<<", accuracy: "<<double(correct) / (double)(error.size())<<endl;

}

int main(int argc, char** argv)
{
    long start, end;
    //read training X from .txt file
    FILE *streamX, *streamY;
    streamX = fopen("trainX.txt", "r");
    int numofX = 30;
    vector<vector<double> > vecX;
    double tpdouble;
    int counter = 0;
    while(1){
        if(fscanf(streamX, "%lf", &tpdouble)==EOF) break;
        if(counter / numofX >= vecX.size()){
            vector<double> tpvec;
            vecX.push_back(tpvec);
        } 
        vecX[counter / numofX].push_back(tpdouble);
        ++ counter;
    }
    fclose(streamX);

    cout<<vecX.size()<<", "<<vecX[0].size()<<endl;

    //read training Y from .txt file
    streamY = fopen("trainY.txt", "r");
    vector<double> vecY;
    while(1){
        if(fscanf(streamY, "%lf", &tpdouble)==EOF) break;
        vecY.push_back(tpdouble);
    }
    fclose(streamY);

    for(int i = 1; i<vecX.size(); i++){
        if(vecX[i].size() != vecX[i - 1].size()) return 0;
    }
    if(vecX.size() != vecY.size()) return 0;

    streamX = fopen("testX.txt", "r");
    vector<vector<double> > vecTX;
    counter = 0;
    while(1){
        if(fscanf(streamX, "%lf", &tpdouble)==EOF) break;
        if(counter / numofX >= vecTX.size()){
            vector<double> tpvec;
            vecTX.push_back(tpvec);
        } 
        vecTX[counter / numofX].push_back(tpdouble);
        ++ counter;
    }
    fclose(streamX);

    streamY = fopen("testY.txt", "r");
    vector<double> vecTY;
    while(1){
        if(fscanf(streamY, "%lf", &tpdouble)==EOF) break;
        vecTY.push_back(tpdouble);
    }
    fclose(streamY);

    start = clock();
    softmax(vecX, vecY, vecTX, vecTY);
    end = clock();
    cout<<"Training used time: "<<((double)(end - start)) / CLOCKS_PER_SEC<<" second"<<endl;
    return 0;
}

 TEST RESULT

I used Wisconsin Breast Cancer dataset, which has 284 training samples and 284 test samples, there are 30 features in each sample, you can download the dataset here:

trainX.txt   trainY.txt

testX.txt     testY.txt

Here’s the result.

777

 

You can see the speed is very fast, that is because I used Vectorization method during the training process, which means minimize the amount of loop I use. Check my update_CostFunction_and_Gradient function, you will see the trick.

SOFTMAX VS MULTI-BINARY CLASSIFIERS

Now let’s retrieve back to the above question about softmax VS multi-binary classifiers. As Prof. Andrew Ng says, which algorithm to use depend on whether the classes are mutually exclusive, which means, whether the classes are mixed. For example:

  • Case 1. classes are [1, 2, 3, 4, 5, 6]
  • Case 2. classes are [1, 2, 3, odd, even, positive]

For case 1, Softmax regression classifier would be fine, in the second case, use multi-logistic regression.

🙂

Posted in: Algorithm, Machine Learning Tagged: Armadillo, C++, Machine Learning, Softmax, Vectorization
  • Pingback: Machine Learning, Deep Learning, Softmax, OpenCV, C++()

Pages

  • About Me

Categories

  • Algorithm
  • Deep Learning
  • Dynamic Programming
  • Graphics
  • LeetCode
  • Machine Learning
  • Machine Reading Comprehension
  • Maths
  • NLP
  • OpenCV
  • Something else
  • Twaddle
  • Uncategorized
  • Vision

Archives

  • July 2016
  • August 2015
  • June 2015
  • April 2015
  • March 2015
  • October 2014
  • September 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • November 2013
  • October 2013
  • September 2013

Copyright © 2019 Eric Yuan's Blog.

Me WordPress Theme by themehall.com