In my previous post, I introduced the basic ideas of Recurrent Neural Networks, as the 2nd post of RNNs, we’ll focus on long short-term memory method.
LONG SHORT TERM MEMORY
One of the very famous problems of RNNs is the vanishing gradient, the problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections.
The shading of the nodes in the unfolded network indicates their sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations of the hidden layer, and the network ‘forgets’ the first inputs.
Long Short-Term Memory deals with this kind of problem, it basically are recurrent networks made of memory blocks. Each block contains one or more self-connected memory cells and three multiplicative units—the input, output and forget gates—that provide continuous analogues of write, read and reset operations for the cells.
The above figure shows what’s inside a LSTM block, in which, black arrows represent full matrices multiplications, dashed arrows represent weighted peephole connections (using diagonal matrices), f, g, h are non-linearity functions. The multiplicative gates allow LSTM memory cells to store and access information over long period of time, thereby mitigating the vanishing gradient problem. For example, as long as the input gate remains closed (has an activation near 0), the activation of the cell will not be overwritten by the new inputs arriving in the network, and can therefore be made available to the net much later in the sequence, by opening the output gate.
The structure of my LSTM networks is similar with the structure we used in THE SIMPLE RNN, except we’re using Jordan-type RNNs this time, which means we use not only the last output for gradient calculating, but also all the former outputs.
W represents weight between two time slot (horizontal), U represents weight between hidden layers (vertical), V represents peephole weights (diagonal). i, f, o, c represent input, forget, output, cell.
a^t represents input of gate at time t, i^t, f^t, o^t represent activation at time t. delta represents derivative, h^t represents output, epsilon_h represents output derivative, epsilon_s represents state derivative of cell, more details can be found in the following equations.
2. Forward Pass
in which, prev means the previous layer output at time t. Asterisk represents element-wise multiplication. sigma, g, h represent non-linearity function.
3. Backward Pass
It is a C++ implementation of LSTM RNNs, using OpenCV as the linear algebra library.
It works though it is still buggy, it can get 98.5844% of training accuracy and 95.3615% of test accuracy after 25 epoch of training on the toy dataset. I’ll try to fix the bugs ASAP and upgrade this net to bi-directional version. If you find any bug, please let me know :p
Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks