Long Short Term Memory

What is Long Short Term Memory?

Long Short Term Memory (LSTM) networks are an extension of artificial recurrent neural networks (RNN) that are designed to learn sequence (temporal) data and their long-term dependencies more precisely than conventional RNNs. These networks can overcome the problem of vanishing gradients and short-term memory of traditional RNN. It is extensively used in deep learning tasks such as stock market prediction, handwriting recognition, speech recognition, natural language processing and others. These networks can remember short-term memories for a very long time, so they are named as long short-term memory. 

Long Short-Term Memory Architecture

The main components of a classic LSTM architecture are cell state and its regulators. The cell state is the memory unit of the network. The cell state carries information that can be stored in, written to, or read from a previous cell state via gates that open or close. Even information from previous steps can enter the cell state and carry relevant information throughout the processing of the sequence. The control gates are analog in nature and implemented with element-wise multiplication by sigmoid or hyperbolic tangent (tanh) functions. These gates are similar to neural network nodes that decide which information is allowed to enter the cell state. The gates have their weights, and through a recurrent neural network learning process, it will decide which information is relevant to keep or forget during training.

 A single classic Long Short Term Memory unit consists of a cell state and its three gates: an input gate, an output gate and a forget gate, along with other nonlinear functions and pointwise operators. In short, the forget gate decides what is relevant to keep from the prior cell state. The input gate decides what information is relevant to update in the current cell state of the LSTM unit. The output gate determines the present hidden state that will be passed to the next LSTM unit. A pictorial representation of a single LSTM unit and its working is given below.

long short term memory

  • Non-linear functions: Gates contain a non-linear activation function such as sigmoid or tanh. In the sigmoid activation function, the output values vary between 0 to 1 in a non-linear fashion. This activation is helpful to update or forget the data, by multiplying with 1 data can be retained and by multiplying with 0 data is forgotten. Tanh activation is also similar to sigmoid function with output values varying between -1 to 1 and centred at 0. 
  • Forget gate: The first block represented in the LSTM architecture is the forget gate (ft). The information from the current input (Xt) and the previous hidden state (ht) is passed through the sigmoid activation function. If the output value is closer to 0 means forget, and the closer to 1 means to retain. 
  • Input gate: It works as an input to the cell state. It consists of two parts; first, we pass the previous hidden state (ht) and current input (Xt) into a sigmoid function to decide which values will be updated. Then, pass the same two inputs into the tanh activation to regulate the network. Finally, multiply the tanh output (C’t) with the sigmoid output (it) to decide which information is important to update the cell state. 
  • Cell state: The input from the previous cell state (Ct-1) is pointwise multiplied with the forget gate output. If the forget output is 0, then it will discard the previous cell output (Ct-1). This output is pointwise added with the input gate output to update the new cell state (Ct). The present cell state will become the input to the next LSTM unit. 
  • Output gate: The hidden state contains information on previous inputs and is used for prediction. The output gate regulates the present hidden state (ht). The previous hidden state (ht-1) and current input (xt) are passed to the sigmoid function. This output is multiplied with the output of the tanh function to obtain the present hidden state. The current state (Ct) and present hidden state (ht) are the final outputs from a classic LSTM unit.  

Long Short Term Memory Variants

The LSTM variation is in terms of usage of gates to achieve better performance parameters over the basic LSTM. There are many variants of the LSTM network, and a few prominent ones are listed below. 

  • Peephole connections: In classic LSTM, the gates determine what to forget, what to add, based only on the previous hidden state and present input, but does not consider the contents of the cell state. Intuitively, it makes sense that the LSTM unit wants to know the memories it already contains before replacing it with a new one. To overcome this problem, in peephole connections, the cell state contents are given as the input to the output gate. This configuration is shown to improve the ability to count and time distances between rare events.  
  • Gated recurrent unit: This variant combines the gating functions of the input gate and the forget gate into a simple update gate. Further, the cell state and the hidden output are combined into a single hidden state layer, while it also contains an intermediate and internal hidden state. It is a bit simpler than LSTM and due to its simplicity trains a little faster than the classic LSTM. The gated recurrent unit is extensively used for a sequence to sequence learning such as machine translation, music and text generation. However, they are less powerful than classic LSTMs due to their limitations in counting.    
  • LSTM with attention: The attention means the ability to focus on specific elements in data, and in our case, the attention is on hidden state outputs of LSTM. Google used this variant to achieve a state-of-the-art neural machine translation that powers Google translate.


Long Short Term Memory is a kind of artificial neural network and hence needs to be trained with a training dataset prior to its employment in real-world applications. Some of the prominent applications are listed below. 

  • Language modelling or text generation
  • Speech and handwriting recognition
  • Image to text translation or image captioning
  • Language translation
  • Music and speech synthesis 
  • Image generation using attention models
  • Video-to-text conversion
  • Protein secondary structure prediction