### INTRODUCTION

Sparse coding is one of the very famous unsupervised methods in this decade, it is a dictionary learning process, which target is to find a dictionary that we can use a linear combination of vectors in this dictionary to represent any training input vector. For better capture structures and patterns inherent in the input vectors, we use an over-complete dictionary, moreover, we want the linear combination to be sparse, which means there are only a very small part of values in it are non-zero. This is how people define sparse coding cost function,

in which, **D** is our desired dictionary, **h** is the latent representation, or the “linear combination”. We want to reconstruct the original input **x** as well as possible, so we use the L2-Norm part to minimize the reconstruction error; we also want the latent representation has many zeros, so we use the L1-Norm part as sparsity penalty.

### METHODS

Seems simple, but when implementing it, a problem raises, the L1-Norm is not differentiable at zero, so how can we use gradient descent method to solve sparse coding problems? I’ll introduce two solutions here.

#### 1. Iterative Shrinkage and Thresholding Algorithm (a.k.a. ISTA)

Because the L1-Norm part with only respect to **h**, so during **h** updating, we doing this: First we update normally from the reconstruction error part, then if **h** changes sign because of the L1-Norm part gradient, we clamp **h** to **0**. The “if…else…” part is also known as Shrinkage function.

#### 2. Smooth L1-Norm

Here we simply using a smooth version of L1-Norm, and make it differentiable at zero.

This method is from Prof. Andrew Ng’s **UFLDL** website, and in the code of this post, I’m also using this method. Of course, there are other ways that dealing with this problem, you are able to find a whole bunch of methods related to improving sparse coding method online.

### ALGORITHM

- Initialize Dictionary
**D**randomly. - Repeat until convergence:

- Select a random mini-batch
- Initialize
**h** - Find the
**h**that minimizes**J(D, h)**for the**D**found in the previous step - Solve for the
**D**that minimizes**J(D, h)**for the**h**found in the previous step

### CODE

My implementation of sparse coding using C++ and Armadillo.

**https://github.com/xingdi-eric-yuan/sparse-coding **

### TEST

I used this image as test image.

First I added random noise into this image, the std of noise is equal to **1/3** of the std of origin image, then I extracted all overlapping **8 * 8** sub-images in this image, as my training dataset.

The following figures (gif) are dictionaries learned by using 49, 121, 225, 400 dictionaries.

49 dictionaries:

121 dictionaries:

225 dictionaries:

400 dictionaries:

I tried to implement denoising using sparse coding (**SEE THIS PAPER**), after training dictionary, I averaged the reconstruction of each patch and get a reconstructed image (using the 49 dictionaries result).

Figure 1 is the original image with noise, Figure 2 is the reconstructed image, actually it’s hard to say the reconstructed image is denoised, so I used **imstats()** in Matlab.

It is denoised SOMEHOW, I believe I can do this better by using better parameters, such as penalty parameters and so on. If you have any idea, please tell me.

🙂

## 8 Comments

I think you made a mistake in Method2 smoohth with L1 norm according to sparse learning (http://ufldl.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation):

‘To “smooth out” the L1 norm, we use sqrt{x^2 + epsilon} in place of |x|, ‘. Also, It isn’t right to use the L1 norm again in the right hand side of the equation that would be the approximation for the L1 norm.

Hi Sara,

I re-read the website you gave me and my code as well, I think it was a typo, you are exactly right, the right hand side L1 norm should be erased. Thank you for sharing this with me.

I have two questions

1 : where is the “data.txt”?

2 : About the codes “x = zeros(block_size * block_size, DATA_SIZE);”

There are 64 rows and 142129 cols, is that mean every row contain one image?

Thanks

Hi,

For data, I randomly chose a single-channel image, and extracted every single 8*8 (block_size = 8) sub-image as a an input, so the “x” in your second question is 64 * 142129 for that specific image, you can change the DATA_SIZE to match your image. Because that “x” matrix is too large, so I didn’t upload that, I’m sure you can easily generate one by using Matlab, or other equivalent tools:p

Ah, I get it, every columns corresponds to one block(8 * 8) pixels, thanks

Hello.

I want to use your sc program, but I couldn’t understand a few points.

1.About input data, what range are its values? 0 to 255? 0 to 1? or -1 to 1?

2.What is the definition of J(D,h)?

3.If I want learned dictionary images, should I convert every column of ‘sc’ program’s output into an image?

I realize you are very busy, but would you be able to answer my question?

Forgive my bad English.

Thank you.

Read your program, I feel very good,and want download to learn

Could you elaborate on how you reconstructed the image by averaging the reconstructions of the patches? I’m guessing for patches that overlapped, you took the average of every overlapping pixel?

Thanks,