RNN Cyclic network has been widely used in sequence problem processing . But the standard version is used RNN Model time , Gradient vanishing is often encountered gradient vanishing And gradient explosion gradient
explosion problem .
RNN Shortcomings of
RNN The gradient vanishing and gradient exploding are different from other networks , There are different parameters in each layer of fully connected network and convolutional network , and RNN
Each processing unit of Cell( The operation of processing a single sequence element is called a processing unit Cell) Share the same weight matrix W. In the last introduction RNN Network algorithm can be seen , The processing units are fully connected , The sequence will be multiplied by the weight matrix in the process of forward propagation W, So as to form a series Wn, When W<1 Time , If the sequence is long , Then the results are closer 0; When w>1 Time , After many iterations , The number will increase rapidly . Back propagation has the same problem .
The gradient explosion problem is generally solved by “ Gradient clipping ” Method improvement , The gradient vanishing makes the data in front of the sequence unable to play its due role , cause “ Long distance dependence ”(Long-Term
Dependencies) problem , in other words RNN Only short distance dependencies can be handled .
This is similar to the convolutional neural network in processing image problems to deepen the network layer , There is no way to improve the effect . Although it can be improved by adjusting parameters in theory , But it's very difficult , Finally, image processing solves this problem by modifying the network structure and using residual network . same ,RNN The structure has also been improved , use LSTM and GRU network . As RNN A variant of , They are more used .
LSTM Long and short term memory network
LSTM yes Long Short Term Memory
Networks Abbreviation for , Long and short term memory network , This method is used in 1997 It was proposed in 1997 , Mainly used to solve the problem “ Long distance dependence ” problem . differ RNN A single hidden layer is used to describe the law ,LSTM New cell state added Cell
state, abbreviation c, Several gating parameters are used to control the reading , write , Forgetting operation .
gating gate
Gating theory comes from biology , Some of the cells in the spinal cord are like gates ( You can't get through until the door is open ), Cut off and block some pain signals from entering the brain . In neural network, activation function is usually used to control data transmission , Such as activation function sigmoid It is often used to control whether the signal is passed or not , Its value ranges from 0-1,0 Indicates blocking ,1 It means that it is completely passed ,0-1 Part of the data is passed through , So as to achieve selective input , Selective output , Selective memory .
algorithm
The picture above depicts LSTM Network pair input Xt( Each element in the sequence ) Processing generated output ht The process of forward propagation of information . The author divides it into six steps , Circle and number in the picture .
The first step : Computational forgetting gate , Forgetting gate forget gate abbreviation f, Used to control whether the state of the previous layer is forgotten Cell
state. The input of the gate is the state of the previous hidden layer h(t-1) And the current xt, Through a sigmoid( use σ express ) Activation function , Get the current time t The value of the forgetting gate ft ,W and b Is the parameter and offset of the gate . such as : When the input word is “ however ” Time , Think that the memory in front is no longer important ,ft The value is 0, Clear the memory before ( Just an example , Don't be more serious ). The formula is :
Step two : Calculation input gate , Input gate input gate abbreviation i, It's used to Cell state Add new content in
, The input of the gate is also the state of the previous hidden layer h(t-1) And the current xt calculation it. for example : When the input is “,” Time , It is considered that the input does not carry contributing information ,it The value is 0, Ignore the input .
The third step : Calculate the input value , This step is similar to RNN An algorithm for calculating hidden layer parameters in virtual reality , The input is also the state of the previous hidden layer h(t-1) And the current xt calculation gt, It is the specific impact of this input , The activation function here uses tanh.
Step four : Compute output gate , Output gate output gate, abbreviation o, In use Cell state Value to calculate the output value ht In the process of using ot Control output
, The input of the gate is also the state of the previous hidden layer h(t-1) And the current xt.
Step five : Calculate the current cell state Cell state, With forgetting door f Control the status of the previous step c(t-1), Using input gate i control
Control current input g, The current state is calculated ct( Part of the past information is forgotten , Some new information has been added ).
Step six : Through the current cell state c And output gate o Calculate hidden layer h, The last two steps organize the data through the gates .
Standard RNN The model is rough , Adjust only one set of parameters , and LSTM Refine the problem into several subproblems , It is necessary to calculate multiple groups iteratively W parameter , The amount of computation is larger than that of the ordinary one RNN It's a lot bigger .LSTM The core principle of information management is to maintain the integrity of information , It assumes that each state is obtained by adding a change to the previous state ( Similar to residual network ), That is to add two sets of information , It's different from RNN Multiply layer by layer , So the gradient explosion is improved / The problem of gradient vanishing . It works better for longer sequences .
usage
Pytorch Provided LSTM Calling method and RNN similar , Just put the “RNN” Change to “LSTM” that will do , No other adjustments are required .
And RNN The difference is that , Pass the function forward before calling forward Time , Both incoming and outgoing parameters can contain h and c Two sets of values , The format is :
among input It's input ,output It's output , The second parameter (h0,c0) by Tuple type ,h0 and c0 They are the initial values of the two hidden layers ; same LSTM The value of the back hidden layer is also calculated (hn,cn) As the return value .h
and c What is the dimension (num_layers, batch_size, hidden_size).
GRU Gating cycle unit
GRU It's a gating cycle unit Gated Recurrent
Unit Abbreviation for , This method is used in 2014 It was proposed in 1997 , yes LSTM Variants of the Internet , It's better than LSTM The network structure is simpler , The logic is clearer , Faster , And the effect is also very good .GRU The model has only two doors : Update and reset doors . It's network structure and RNN More similar , The data input in the sequence is received at each step , The output of the previous hidden layer , And output hidden layer .
The image above depicts the right GRU Processing input Xt Generate output ht The process of forward propagation of information . The author divides it into four steps , Circle and number in the picture .
The first step : Calculate update gate , Update door update
gate Referred to as z, Its function is similar to LSTM The door of forgetting , Used to control the proportion of previous information and new input data in the current state , The input of the gate is also the state of the previous hidden layer ht-1 And the current input xt, The bias parameter is omitted b.
Step two : Calculate reset door , Reset door reset gate Often referred to as r, Its function is similar to LSTM Input gate in , The input of the gate is also the state of the previous hidden layer ht-1 And the current input xt.
The third step : Calculate the input value , The input value changes from the state of the previous hidden layer ht-1, Current xt And reset the door rt It can be calculated . It can be regarded as the influence of the current input on the state .
Step four : Calculate current state , The current state consists of two parts , The first part is the influence of previous information , The latter part is the impact of the current input , parameter zt Is the value of the update door , It goes through the activation function sigmoid, Value in 0-1 between , in other words , The sum of the weights of the two parts is 1, Balance the proportion of the two by updating the gate .
And LSTM comparison , State layer State cell Omitted , By hidden layer h Realize its function , The output gate is omitted o, The bias parameters of each layer are removed b, It is simplified in several steps , Less resources .
Pytorch Specific calling methods and methods of RNN similar , It will not be repeated here .
optimization RNN network
Deep learning tools are generally available API Direct call RNN Model , image Keras The tool uses only one statement to create a LSTM Model , In addition to calling API What do programmers need to do ?
Each processing unit of recurrent neural network is connected with the next unit through one or more fully connected networks , be similar to CNN Multi layer network based on Internet , So the longer the sequence , The more complex the calculation is , The complexity of the model should be considered when designing the network , Estimate training time , involve : Iterations , Sequence length , How to segment sequence , Hidden layers , Number of hidden layer elements , Learning rate , Pass the hidden layer state to the next iteration , Hyperparameter , And the initial value of parameters .
such as :RNN The error of is often not smooth convergence , Especially when the sequence is long , It's hard to fix the learning rate , Recommended use Adam Optimizer automatically adjusts learning parameters .
Technology