Sparse Coding


Sparse coding is one of the very famous unsupervised methods in this decade, it is a dictionary learning process, which target is to find a dictionary that we can use a linear combination of vectors in this dictionary to represent any training input vector. For better capture structures and patterns inherent in the input vectors, we use an over-complete dictionary, moreover, we want the linear combination to be sparse, which means there are only a very small part of values in it are non-zero. This is how people define sparse coding cost function,

111 in which, D is our desired dictionary, h is the latent representation, or the “linear combination”. We want to reconstruct the original input x as well as possible, so we use the L2-Norm part to minimize the reconstruction error; we also want the latent representation has many zeros, so we use the L1-Norm part as sparsity penalty.


Seems simple, but when implementing it, a problem raises, the L1-Norm is not differentiable at zero, so how can we use gradient descent method to solve sparse coding problems? I’ll introduce two solutions here.

1. Iterative Shrinkage and Thresholding Algorithm (a.k.a. ISTA)

Because the L1-Norm part with only respect to h, so during h updating, we doing this: 222 First we update normally from the reconstruction error part, then if h changes sign because of the L1-Norm part gradient, we clamp h to 0. The “if…else…” part is also known as Shrinkage function.

 2. Smooth L1-Norm

Here we simply using a smooth version of L1-Norm, and make it differentiable at zero.

111   This method is from Prof. Andrew Ng’s UFLDL website, and in the code of this post, I’m also using this method. Of course, there are other ways that dealing with this problem, you are able to find a whole bunch of methods related to improving sparse coding method online.


  1. Initialize Dictionary D randomly.
  2. Repeat until convergence:
  •         Select a random mini-batch
  •         Initialize h
  •         Find the h that minimizes J(D, h) for the D found in the previous step
  •         Solve for the D that minimizes J(D, h) for the h found in the previous step



My implementation of sparse coding using C++ and Armadillo. 


I used this image as test image.


First I added random noise into this image, the std of noise is equal to 1/3 of the std of origin image, then I extracted all overlapping 8 * 8 sub-images in this image, as my training dataset.

The following figures (gif) are dictionaries learned by using 49, 121, 225, 400 dictionaries.

49 dictionaries:



121 dictionaries:


225 dictionaries:


400 dictionaries:


I tried to implement denoising using sparse coding (SEE THIS PAPER),  after training dictionary, I averaged the reconstruction of each patch and get a reconstructed image (using the 49 dictionaries result).


Figure 1 is the original image with noise, Figure 2 is the reconstructed image, actually it’s hard to say the reconstructed image is denoised, so I used imstats() in Matlab.


It is denoised SOMEHOW, I believe I can do this better by using better parameters, such as penalty parameters and so on. If you have any idea, please tell me. 


This entry was posted in Algorithm, Machine Learning and tagged , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Sara
    Posted October 3, 2014 at 4:21 am | Permalink

    I think you made a mistake in Method2 smoohth with L1 norm according to sparse learning (
    ‘To “smooth out” the L1 norm, we use sqrt{x^2 + epsilon} in place of |x|, ‘. Also, It isn’t right to use the L1 norm again in the right hand side of the equation that would be the approximation for the L1 norm.

    • Eric
      Posted October 3, 2014 at 1:08 pm | Permalink

      Hi Sara,
      I re-read the website you gave me and my code as well, I think it was a typo, you are exactly right, the right hand side L1 norm should be erased. Thank you for sharing this with me.

  2. ngapweitham
    Posted August 8, 2015 at 5:47 am | Permalink

    I have two questions

    1 : where is the “data.txt”?
    2 : About the codes “x = zeros(block_size * block_size, DATA_SIZE);”
    There are 64 rows and 142129 cols, is that mean every row contain one image?


    • Eric
      Posted August 10, 2015 at 2:27 pm | Permalink


      For data, I randomly chose a single-channel image, and extracted every single 8*8 (block_size = 8) sub-image as a an input, so the “x” in your second question is 64 * 142129 for that specific image, you can change the DATA_SIZE to match your image. Because that “x” matrix is too large, so I didn’t upload that, I’m sure you can easily generate one by using Matlab, or other equivalent tools:p

      • ngapweitham
        Posted August 11, 2015 at 7:46 am | Permalink

        Ah, I get it, every columns corresponds to one block(8 * 8) pixels, thanks

  3. Kei
    Posted December 4, 2015 at 4:31 am | Permalink


    I want to use your sc program, but I couldn’t understand a few points.
    1.About input data, what range are its values? 0 to 255? 0 to 1? or -1 to 1?
    2.What is the definition of J(D,h)?
    3.If I want learned dictionary images, should I convert every column of ‘sc’ program’s output into an image?

    I realize you are very busy, but would you be able to answer my question?
    Forgive my bad English.
    Thank you.

  4. Shi
    Posted January 10, 2016 at 1:59 am | Permalink

    Read your program, I feel very good,and want download to learn

  5. Joshua
    Posted August 13, 2016 at 3:30 pm | Permalink

    Could you elaborate on how you reconstructed the image by averaging the reconstructions of the patches? I’m guessing for patches that overlapped, you took the average of every overlapping pixel?


Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>