Neural network that can support ** time series data **
A series of data that are observed at regular intervals in chronological order and that have statistical dependencies on each other.
--Voice data --Text data --Stock price data
In forward propagation, the calculation result of the previous intermediate layer is processed by weight and bias, and then used as a part of the input to the intermediate layer. To work with a time series model, we need a recursive structure that holds the initial state and the state of the past time t-1 and recursively finds t at the next time.
BPTT
Backpropagation Through Time A type of backpropagation method Parameter adjustment method in RNN A form in which the backpropagation method is implemented including the flow in time series. It is not suitable for long-term use because the amount of calculation is large.
np.dot(X.T, delta[:,t].reshape(1,-1))
$ \frac{\partial E }{\partial W*{in}} = \frac{ \partial E}{ \partial v^t} \big[ \frac{ \partial u^t}{ \partial W*{(out)}}\big] ^t = \delta ^{out,t} [ z^t] ^T$
np.dot(z[:,t+1].reshape(-1,1), delta_out[:,t].reshape(-1,1))
np.dot(z[:,t].reshape(-1,1), delta[:,t].reshape(1,-1))
$ \frac{ \partial E}{ \partial u^t} = \frac{\partial E}{ \partial v^t} \frac{\partial v^t}{\partial i^t} = f'(u^t)W_{out}^T \delta^{out,t} = \delta ^t$
delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * functions.d_sigmoid(u[:,t+1])
$ W*{(in)}^{t+1} = W*{(in)}^t - \epsilon \frac {\partial E}{ \partial W*{(in)}} = W*{(in)}^t - \epsilon \sum_{z=0}^{T_t} \delta ^{t-z} \big[ x^{t-z} \big] ^T$
W_in -= learning_rate * W_in_grad
$ W*{(out)}^{t+1} = W*{(out)}^t - \epsilon \frac {\partial E}{ \partial W*{(out)}} = W*{(out)}^t - \epsilon \delta ^{out,t} \big[ z^t \big] ^T$
W_out -= learning_rate * W_out_grad
$ W^{t+1} = W^t - \frac{\partial E}{ \partial W} = W*{(in)}^t - \epsilon \sum*{z=0}^{T_t} \delta ^{t-z} \big[ x^{t-z-1} \big] $
W -= learning_rate * W_grad
E^t = loss(y^t,d^t) \\
= loss(g(w_{(out)}z^t + c),d^t) \\
= loss(g(W_{(out)}f(W_{(in)}x^t + Wz^{t-1} + b)+c),d^t)
Section2 LSTM
RNN Challenges The more you go back in time series, the more the gradient disappears → It is difficult to learn long time series
solution The one that solves the problem by changing the structure is called LSTM.
Big picture of LSTM
CEC
Gradient disappearance and gradient explosion can be solved with a gradient of 1
Task Input data has a uniform weight regardless of time dependence → The learning characteristics of the neural network are lost in the first place
solution Input gate and output gate
By adding an input gate and an output gate The weight of the input value to each gate can be changed by the weight matrix W and U. → Solve CEC problems
Input gate
Output gate
Current status All past information is stored All information
Task Cannot delete past information when it is no longer needed Always pulled by past information
solution Oblivion gate
Gate for deleting CEC information that is no longer needed
Task I want to propagate the past information stored in CEC to other nodes at any time, or forget it at any time. The value of CEC itself does not affect gate control.
solution Peephole connection
A structure that makes it possible to propagate the value of CEC itself via a weight matrix.
GRU
In LSTM, there was a problem that there were many parameters and the calculation load was high. →GRU
Images show that GRU has reduced that parameter
Use future information as well as past information in time series data
Section5 Seq2Seq
Encoder-A type of Decoder model It is used as a model for machine translation and dialogue.
Encoder RNN
A structure in which text data input by the user is divided into tokens such as words and passed. Taking: Decompose sentences into tokens such as words by morphological analysis and convert them into IDs for each token. Embedding: Convert to distributed representation vector based on ID Encoder RNN: Input vectors to RNN in order Input vec1 to RNN and output hidden state. Repeat this and the next input vec2 into the RNN ... and so on. Set the hidden state when the last vec is inserted as the final state. This is a vector that represents the meaning of the input
Decoder RNN
A structure in which the system generates output data for each token such as a word. 1.Decoder RNN: Output the generation probability of each token from the final state (thought vector) of Encoder RNN Set final state as the initial state of Decoder RNN and enter Embedding. 2.Sampling: Randomly select tokens based on generation probability 3. Embedding: token as the next input to the Decoder RNN 4.Detokenize: Repeat 1-3 to convert the token obtained in 2 into a character string.
HRED
seq2seq can only answer one question at a time → There is no context to the question, just the response continues As a solution → HRED
What is HRED Generate the next utterance from the past n-1 utterances More human response Seq2Seq + Context RNN (Context RNN: A structure that converts a conventional conversation context into a vector.)
HERD Challenges There is no diversity in the flow of conversation Tends to learn short common answers
VHRED
A solution to the HREAD problem by adding the concept of VAE latent variables to HREAD.
VAE
One of unsupervised learning. Teacher data is not used. The network that converts the input data to the latent variable z is the Encoder. Decoder is a neural network that restores the original image using the latent variable z as input.
merit Dimensionality reduction can be performed.
VAE
In the case of a normal autoencoder, data is pushed into the latent variable z, but the state of its structure is unknown.
In the case of VAE, the probability distribution z to N (0,1) is assumed for the latent variable z. Allows data to be pushed into a structure called the probability distribution of the latent variable z
Section6 Word2Vec
You can't give NN a variable-length string like a word → You can only study with a fixed length
Word2Vec
Word2vec
For converting variable-length character strings to fixed-length format. Create a vocabulary from training data Create one-hot-vector for input data with vocabulary as dimension → Distributed representation
Learning the distributed representation of large-scale data has become feasible with realistic calculation speed and memory capacity.
Until now, a weight matrix only for "vocabulary x vocabulary" was born. In Word2vec, a weight matrix is born with "vocabulary x arbitrary word vector dimension".
Section7 AttentionMechanism
Difficult to handle long sentences. The longer the sentence, the larger the dimension of the internal representation of the input.
Learn "Which words in the input and output are related" →Attention Mechanism
3×3
Weights from one middle layer to the next
$ z=t^2$
$ dz/dt = 2t$
$ x ・ s_0: Use s_1 ・ w_ {in} ・ w ・ w_ {out} $
$ z1 = sigmoid(S_0 w + x_1 w{(in)} + b)$
$ y1= sigmoid(z_1 W{(out)} + c)$
(1)0.15 (2)0.25 (3)0.35 (4)0.45
(2)
"The movie was interesting. By the way, I'm so hungry that something \ _ \ _."
Oblivion gate
LSTM
The number of parameters is large and the calculation load is high.
CEC
Gradient continues to be passed at 1 and loses weight The concept of learning is compromised
LSTMs have more parameters than GRUs
(1) RNNs in the forward and reverse directions with respect to time are constructed, and these two intermediate layer representations are used as features.
(2) A type of Encoder-Decoder model that uses RNN, and is used for models such as machine translation.
(3) A neural network that recursively performs an operation of creating an expression vector (phrase) from adjacent words on a tree structure such as a syntax tree (with the same weight) and obtains the expression vector of the entire sentence.
(4) A type of RNN that solves the vanishing gradient problem, which is a problem in simple RNNs, by introducing the concept of CEC and gate.
(2)
(1) Bidirectional RNN (2)seq2seq (3)RNN (4)LSTM
ses2seq
I can only answer one question at a time
HRED
Can answer in context
VHRED
You can answer according to the context, and you can answer in a variety of ways
Introducing \ _ \ _ \ _ \ _ into the latent variable of the self-encoder.
Probability distribution
Compared to RNN, word2vec can now be calculated with a realistic amount of resources
seq2se1 can only learn fixed length ones seq2se1 + Attention can translate long sentences
https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/3Day/3_1_simple_RNN.ipynb
simple_RNN The accuracy is improved when the number of nodes in the middle layer is around 64. Increasing the learning rate causes a gradient explosion When changing the sigmoid function to the ReLU function, a gradient explosion occurs and learning does not proceed at all. Learning does not proceed when changing to tanh The accuracy improves when the number of hidden layers is increased, but the error is large at the beginning of learning. If the weight is increased, the error is small at the beginning of learning, but the gradient explodes from there. There is an existing problem, and new models are being created to solve it, and the accuracy of learning is increasing.
Recommended Posts