Recurrent Neural Networks II — LSTM

In my previous post, I introduced the basic ideas of Recurrent Neural Networks, as the 2nd post of RNNs, we’ll focus on long short-term memory method.

LONG SHORT TERM MEMORY

One of the very famous problems of RNNs is the vanishing gradient, the problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections.

1

The shading of the nodes in the unfolded network indicates their sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations of the hidden layer, and the network ‘forgets’ the first inputs.

Long Short-Term Memory deals with this kind of problem, it basically are recurrent networks made of memory blocks. Each block contains one or more self-connected memory cells and three multiplicative units—the input, output and forget gates—that provide continuous analogues of write, read and reset operations for the cells.

2

The above figure shows what’s inside a LSTM block, in which, black arrows represent full matrices multiplications, dashed arrows represent weighted peephole connections (using diagonal matrices), f, g, h are non-linearity functions. The multiplicative gates allow LSTM memory cells to store and access information over long period of time, thereby mitigating the vanishing gradient problem. For example, as long as the input gate remains closed (has an activation near 0), the activation of the cell will not be overwritten by the new inputs arriving in the network, and can therefore be made available to the net much later in the sequence, by opening the output gate.

METHODOLOGY

The structure of my LSTM networks is similar with the structure we used in THE SIMPLE RNN, except we’re using Jordan-type RNNs this time, which means we use not only the last output for gradient calculating, but also all the former outputs.

1. Notation

3

W represents weight between two time slot (horizontal), U represents weight between hidden layers (vertical), V represents peephole weights (diagonal). i, f, o, c represent input, forget, output, cell.

4

a^t represents input of gate at time t, i^t, f^t, o^t represent activation at time t. delta represents derivative, h^t represents output, epsilon_h represents output derivative, epsilon_s represents state derivative of cell, more details can be found in the following equations.

2. Forward Pass

5

in which, prev means the previous layer output at time t. Asterisk represents element-wise multiplication. sigma, g, h represent non-linearity function.

3. Backward Pass

Cell outputs:6

Output gates:7

States:8

Cells:9

Forget gates:10

Input gates:11

SOURCE CODE

https://github.com/xingdi-eric-yuan/recurrent-net-lstm

It is a C++ implementation of LSTM RNNs, using OpenCV as the linear algebra library.

It works though it is still buggy, it can get 98.5844% of training accuracy and 95.3615% of test accuracy after 25 epoch of training on the toy dataset. I’ll try to fix the bugs ASAP and upgrade this net to bi-directional version. If you find any bug, please let me know :p

REFERENCES

Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks

This entry was posted in Machine Learning, NLP, OpenCV and tagged , , , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

4 Comments

  1. Posted January 31, 2016 at 9:17 am | Permalink

    Toy Soldiers: War Chest – Hall of Fame …

  2. YangPan
    Posted July 8, 2016 at 4:20 am | Permalink

    Dear,friends!
    It is greate for your sharing.I have try to run your programe,but there maybe are some bugs in it,when i change the value of the MOMENTUM or the hidden layer num from 2 to 3 or the hidden units from 512 to 128,the programe will run into end of too big cost,and i see some Notes in your codes, it seems that you have Foreseen the bugs,So what is problem,thanks very much!

  3. Mr T
    Posted January 5, 2017 at 4:59 am | Permalink

    This is, mostly, copy and paste from Alex Graves PhD Thesis. See here http://www.cs.toronto.edu/~graves/phd.pdf page 33, 34, 38, 39

    No reference. And you don’t even mention him.

    • Eric
      Posted January 5, 2017 at 11:10 am | Permalink

      Hi Mr T,
      I didn’t realize how serious this issue is when I was writing these posts, but yes as you said, I’ll add reference to all the posts. Thanks for the comment.
      Eric

2 Trackbacks

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*