WHAT IS RBM
Restricted Boltzmann Machine is one of the special cases of Boltzmann Machine, which restricted all visible-visible connections and hidden-hidden connections, which makes for each hidden unit, it connects to all visible units, and for each visible unit, it connects to all hidden units. Following is a figure which shows the model of RBM.
RBM is kind of unsupervised learning method, it is often uses as the pre-training of large deep networks, it is also the basis of DBN.
There are tons of tutorials and blogs online that introduce the underlying math of RBM (energy-based model and so on), so I’m not going to make the effort on explain these things, instead, I’m going to talk about some details on implementing it.
In order to train an RBM, we’d like to minimize the average negative log-likelihood:
And when doing that by using stochastic gradient descent,
The idea of CD method is replace the expectation by a point estimate at v(t), and we can get this v(t) by Gibbs sampling, v(t) is the node where the chain converges.
Here’s how this Gibbs sampling works (say the weight matrix between hidden layer and visible layer is W, the bias vector of hidden layer is B, the bias of visible layer is C):
- Convert training data x into binary form, by build a v(0) matrix, each element of v(0) is randomly chosen to be 1 (versus 0) with probability of x (For correctly doing this, x should be pre-processed by normalization).
- Now positive phase, calculate a positive-probability matrix which is Sigmoid(W’ * v(0) + B), then normalize this matrix.
- Build hidden layer matrix h(0), each element of h(0) is randomly chosen to be 1 (versus 0) with positive-probability matrix we got in last step.
- Now negative phase, calculate a negative-probability matrix which is Sigmoid(W * h(0) + C), then normalize this matrix.
- Build new visible layer matrix v(1), each element of v(1) is randomly chosen to be 1 (versus 0) with negative-probability matrix we just got.
- Repeat doing the above steps, with positive phase and negative phase in turn, until the chain converges.
Say the chain converges at v(t), then sum((v(0) – v(t)) ^ 2) is the error.
If we manually make the CD method iterates k times of Gibbs sampling, then this method is called CD-k. In general, the bigger k is, the less biased the estimate of the gradient will be. However, in practice, the CD method works pretty good even we let k equals 1, especially we train this RBM net for pre-training.
After we get this v(t), or x-tilde in following formulae, we can then update parameters W, B, and C:
The whole CD method is nothing but keep Gibbs sampling and parameters updating until stopping criteria.
I implemented single layer RBM using C++ and Armadillo.
(All results are shown by gif images, be patience please…)
I used the MNIST dataset for testing.
First I tried to train 50 different features for only digit 7.
Then I tried if I can see the difference between CD-1 trained weights and CD-5 trained weights (100 hidden size), here are the results:
CD-1 (left), CD-5 (right):
And I tried to learn a 500 hidden size net, here’s the result (this image is really big, about 20M).