### WHAT IS RBM

Restricted Boltzmann Machine is one of the special cases of Boltzmann Machine, which restricted all visible-visible connections and hidden-hidden connections, which makes for each hidden unit, it connects to all visible units, and for each visible unit, it connects to all hidden units. Following is a figure which shows the model of RBM.

RBM is kind of unsupervised learning method, it is often uses as the pre-training of large deep networks, it is also the basis of DBN.

There are tons of tutorials and blogs online that introduce the underlying math of RBM (energy-based model and so on), so I’m not going to make the effort on explain these things, instead, I’m going to talk about some details on implementing it.

### CONTRASTIVE DIVERGENCE

In order to train an RBM, we’d like to minimize the average negative log-likelihood:

And when doing that by using stochastic gradient descent,

we found that the negative phase part is hard to compute, so we use a method called Contrastive Divergence to approximate it.

The idea of CD method is replace the expectation by a point estimate at **v(t)**, and we can get this **v(t)** by Gibbs sampling, **v(t)** is the node where the chain converges.

Here’s how this Gibbs sampling works (say the weight matrix between hidden layer and visible layer is **W**, the bias vector of hidden layer is** B**, the bias of visible layer is **C**):

- Convert training data x into binary form, by build a
**v(0)**matrix, each element of**v(0)**is randomly chosen to be**1**(versus**0**) with probability of**x**(For correctly doing this,**x**should be pre-processed by normalization). - Now positive phase, calculate a positive-probability matrix which is
**Sigmoid(W’ * v(0) + B)**, then normalize this matrix. - Build hidden layer matrix
**h(0)**, each element of**h(0)**is randomly chosen to be**1**(versus**0**) with positive-probability matrix we got in last step. - Now negative phase, calculate a negative-probability matrix which is
**Sigmoid(W * h(0) + C)**, then normalize this matrix. - Build new visible layer matrix
**v(1)**, each element of**v(1)**is randomly chosen to be**1**(versus**0**) with negative-probability matrix we just got. - Repeat doing the above steps, with positive phase and negative phase in turn, until the chain converges.

Say the chain converges at **v(t)**, then **sum((v(0) – v(t)) ^ 2)** is the error.

If we manually make the CD method iterates k times of Gibbs sampling, then this method is called CD-k. In general, the bigger** k** is, the less biased the estimate of the gradient will be. However, in practice, the CD method works pretty good even we let **k** equals** 1**, especially we train this RBM net for pre-training.

After we get this **v(t)**, or **x-tilde** in following formulae, we can then update parameters **W**, **B**, and **C**:

The whole CD method is nothing but keep Gibbs sampling and parameters updating until stopping criteria.

### CODE

I implemented single layer RBM using C++ and Armadillo.

**https://github.com/xingdi-eric-yuan/single-layer-rbm **

### TEST RESULT

(All results are shown by gif images, be patience please…)

I used the **MNIST** dataset for testing.

First I tried to train 50 different features for only digit 7.

Then I tried if I can see the difference between CD-1 trained weights and CD-5 trained weights (100 hidden size), here are the results:

CD-1 (left), CD-5 (right):

And I tried to learn a 500 hidden size net, here’s the result (this image is really big, about 20M).

Enjoy it.

### References

Hugo Larochelle‘s Neural networks class – Université de Sherbrooke on Youtube