Long Short Term Memory

What is Long Short Term Memory?

Long Short Term Memory (LSTM) networks are an extension of artificial recurrent neural networks (RNN) that are designed to learn sequence (temporal) data and their long-term dependencies more precisely than conventional RNNs. These networks can overcome the problem of vanishing gradients and short-term memory of traditional RNN. It is extensively used in deep learning tasks such as stock market prediction, handwriting recognition, speech recognition, natural language processing and others. These networks can remember short-term memories for a very long time, so they are named as long short-term memory.

Long Short-Term Memory Architecture

The main components of a classic LSTM architecture are cell state and its regulators. The cell state is the memory unit of the network. The cell state carries information that can be stored in, written to, or read from a previous cell state via gates that open or close. Even information from previous steps can enter the cell state and carry relevant information throughout the processing of the sequence. The control gates are analog in nature and implemented with element-wise multiplication by sigmoid or hyperbolic tangent (tanh) functions. These gates are similar to neural network nodes that decide which information is allowed to enter the cell state. The gates have their weights, and through a recurrent neural network learning process, it will decide which information is relevant to keep or forget during training.

A single classic Long Short Term Memory unit consists of a cell state and its three gates: an input gate, an output gate and a forget gate, along with other nonlinear functions and pointwise operators. In short, the forget gate decides what is relevant to keep from the prior cell state. The input gate decides what information is relevant to update in the current cell state of the LSTM unit. The output gate determines the present hidden state that will be passed to the next LSTM unit. A pictorial representation of a single LSTM unit and its working is given below.

Non-linear functions: Gates contain a non-linear activation function such as sigmoid or tanh. In the sigmoid activation function, the output values vary between 0 to 1 in a non-linear fashion. This activation is helpful to update or forget the data, by multiplying with 1 data can be retained and by multiplying with 0 data is forgotten. Tanh activation is also similar to sigmoid function with output values varying between -1 to 1 and centred at 0.
Forget gate: The first block represented in the LSTM architecture is the forget gate (ft). The information from the current input (Xt) and the previous hidden state (ht) is passed through the sigmoid activation function. If the output value is closer to 0 means forget, and the closer to 1 means to retain.
Input gate: It works as an input to the cell state. It consists of two parts; first, we pass the previous hidden state (ht) and current input (Xt) into a sigmoid function to decide which values will be updated. Then, pass the same two inputs into the tanh activation to regulate the network. Finally, multiply the tanh output (C’t) with the sigmoid output (it) to decide which information is important to update the cell state.
Cell state: The input from the previous cell state (Ct-1) is pointwise multiplied with the forget gate output. If the forget output is 0, then it will discard the previous cell output (Ct-1). This output is pointwise added with the input gate output to update the new cell state (Ct). The present cell state will become the input to the next LSTM unit.
Output gate: The hidden state contains information on previous inputs and is used for prediction. The output gate regulates the present hidden state (ht). The previous hidden state (ht-1) and current input (xt) are passed to the sigmoid function. This output is multiplied with the output of the tanh function to obtain the present hidden state. The current state (Ct) and present hidden state (ht) are the final outputs from a classic LSTM unit.