I’m learning Prof. Andrew Ng’s Unsupervised Feature Learning and Deep Learning tutorial, I finished the first exercise, the tutorial is very professional and easy to learn. I don’t think I need to go through the detail of what Sparse Autoencoder is, I’ll put my code of the exercise here, if you have any question about it, feel free to let me know.
Following is only the part that I wrote, for other part of code, you can get them HERE.
Something I want to say is, if you successfully implemented the gradient checking part, you can then disable the gradient checking code before training the Sparse Autoencoder, because since the first time we know our cost function and derivative are correct, we will not care about them; if we run the gradient checking code on every iteration, the whole process will be extremely slow.
CODE
% sparseAutoencoderCost.m % calculate cost function and derivatives function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter % sparsityParam: The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: weight of sparsity penalty term % data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); % Cost and gradient variables (your code needs to compute these values). % Here, we initialize them to zeros. cost = 0; W1grad = zeros(size(W1)); W2grad = zeros(size(W2)); b1grad = zeros(size(b1)); b2grad = zeros(size(b2)); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder, % and the corresponding gradients W1grad, W2grad, b1grad, b2grad. % % W1grad, W2grad, b1grad and b2grad should be computed using backpropagation. % Note that W1grad has the same dimensions as W1, b1grad has the same dimensions % as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with % respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) % with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term % [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 % of the lecture notes (and similarly for W2grad, b1grad, b2grad). % % Stated differently, if we were using batch gradient descent to optimize the parameters, % the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. % [nFeatures, nSamples] = size(data); % first calculate the regular cost function J [a1, a2, a3] = getActivation(W1, W2, b1, b2, data); errtp = ((a3 - data) .^ 2) ./ 2; err = sum(sum(errtp)) ./ nSamples; % now calculate pj which is the average activation of hidden units pj = sum(a2, 2) ./ nSamples; % the second part is weight decay part err2 = sum(sum(W1 .^ 2)) + sum(sum(W2 .^ 2)); err2 = err2 * lambda / 2; % the third part of overall cost function is the sparsity part err3 = zeros(hiddenSize, 1); err3 = err3 + sparsityParam .* log(sparsityParam ./ pj) + (1 - sparsityParam) .* log((1 - sparsityParam) ./ (1 - pj)); cost = err + err2 + beta * sum(err3); % following are for calculating the grad of weights. delta3 = -(data - a3) .* dsigmoid(a3); delta2 = bsxfun(@plus, (W2' * delta3), beta .* (-sparsityParam ./ pj + (1 - sparsityParam) ./ (1 - pj))); delta2 = delta2 .* dsigmoid(a2); nablaW1 = delta2 * a1'; nablab1 = delta2; nablaW2 = delta3 * a2'; nablab2 = delta3; W1grad = nablaW1 ./ nSamples + lambda .* W1; W2grad = nablaW2 ./ nSamples + lambda .* W2; b1grad = sum(nablab1, 2) ./ nSamples; b2grad = sum(nablab2, 2) ./ nSamples; %------------------------------------------------------------------- % After computing the cost and gradient, we will convert the gradients back % to a vector format (suitable for minFunc). Specifically, we will unroll % your gradient matrices into a vector. grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; end %------------------------------------------------------------------- % Here's an implementation of the sigmoid function, which you may find useful % in your computation of the costs and the gradients. This inputs a (row or % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end %------------------------------------------------------------------- % This function calculate dSigmoid % function dsigm = dsigmoid(a) dsigm = a .* (1.0 - a); end %------------------------------------------------------------------- % This function return the activation of each layer % function [ainput, ahidden, aoutput] = getActivation(W1, W2, b1, b2, input) ainput = input; ahidden = bsxfun(@plus, W1 * ainput, b1); ahidden = sigmoid(ahidden); aoutput = bsxfun(@plus, W2 * ahidden, b2); aoutput = sigmoid(aoutput); end |
% computeNumericalGradient.m % for the use of gradient check function numgrad = computeNumericalGradient(J, theta) % numgrad = computeNumericalGradient(J, theta) % theta: a vector of parameters % J: a function that outputs a real-number. Calling y = J(theta) will return the % function value at theta. % Initialize numgrad with zeros numgrad = zeros(size(theta)); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: % Implement numerical gradient checking, and return the result in numgrad. % (See Section 2.3 of the lecture notes.) % You should write code so that numgrad(i) is (the numerical approximation to) the % partial derivative of J with respect to the i-th input argument, evaluated at theta. % I.e., numgrad(i) should be the (approximately) the partial derivative of J with % respect to theta(i). % % Hint: You will probably want to compute the elements of numgrad one at a time. size(theta) EPSILON = 1e-4; for i=1:size(theta) i memo = theta(i); theta(i) = memo + EPSILON; value1 = J(theta); theta(i) = memo - EPSILON; value2 = J(theta); theta(i) = memo; numgrad(i) = (value1 - value2) ./ (2 * EPSILON); end %% --------------------------------------------------------------- end |
% sampleIMAGES.m % sampling patches for learning function patches = sampleIMAGES() % sampleIMAGES % Returns 10000 patches for training load IMAGES; % load images from disk patchsize = 8; % we'll use 8x8 patches numpatches = 10000; % Initialize patches with zeros. Your code will fill in this matrix--one % column per patch, 10000 columns. patches = zeros(patchsize*patchsize, numpatches); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Fill in the variable called "patches" using data % from IMAGES. % % IMAGES is a 3D array containing 10 images % For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image, % and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize % it. (The contrast on these images look a bit off because they have % been preprocessed using using "whitening." See the lecture notes for % more details.) As a second example, IMAGES(21:30,21:30,1) is an image % patch corresponding to the pixels in the block (21,21) to (30,30) of % Image 1 counter = 1; ranimg = ceil(rand(1, numpatches) * 10); ranpix = ceil(rand(2, numpatches) * (512 - patchsize)); ranpixm = ranpix + patchsize - 1; while(counter <= numpatches) whichimg = ranimg(1, counter); whichpix = ranpix(:, counter); whichpixm = ranpixm(:, counter); patch = IMAGES(whichpix(1):whichpixm(1), whichpix(2):whichpixm(2), whichimg); repatch = reshape(patch, patchsize * patchsize, 1); patches(:, counter) = repatch; counter = counter + 1; end %% --------------------------------------------------------------- % For the autoencoder to work well we need to normalize the data % Specifically, since the output of the network is bounded between [0,1] % (due to the sigmoid activation function), we have to make sure % the range of pixel values is also bounded between [0,1] patches = normalizeData(patches); end %% --------------------------------------------------------------- function patches = normalizeData(patches) % Squash data to [0.1, 0.9] since we use sigmoid as the activation % function in the output layer % Remove DC (mean of images). patches = bsxfun(@minus, patches, mean(patches)); % Truncate to +/-3 standard deviations and scale to -1 to 1 pstd = 3 * std(patches(:)); patches = max(min(patches, pstd), -pstd) / pstd; % Rescale from [-1,1] to [0.1,0.9] patches = (patches + 1) * 0.4 + 0.1; end |
RESULTS
Here’s the result image, which shows the weight of W1, that is exactly what the hidden layer learned during the training process, and we can see it is a set of edge detectors.
Enjoy it 🙂