Section1 Concept of Recurrent Neural Network

The big picture of RNN

What is RNN?

Neural network that can support ** time series data **

Data in chronological order

A series of data that are observed at regular intervals in chronological order and that have statistical dependencies on each other.

--Voice data --Text data --Stock price data

About RNN

In forward propagation, the calculation result of the previous intermediate layer is processed by weight and bias, and then used as a part of the input to the intermediate layer. To work with a time series model, we need a recursive structure that holds the initial state and the state of the past time t-1 and recursively finds t at the next time.

BPTT

What is BPTT?

Backpropagation Through Time A type of backpropagation method Parameter adjustment method in RNN A form in which the backpropagation method is implemented including the flow in time series. It is not suitable for long-term use because the amount of calculation is large.

BPTT mathematical notation

\frac{\partial E}{\partial W_{(in)}} = \frac{\partial E}{\partial u^t} \big[ \frac{\partial u^t}{\partial W_{(in)}} \big]^T = \delta ^t[x^t]^T

np.dot(X.T, delta[:,t].reshape(1,-1))

$ \frac{\partial E }{\partial W*{in}} = \frac{ \partial E}{ \partial v^t} \big[ \frac{ \partial u^t}{ \partial W*{(out)}}\big] ^t = \delta ^{out,t} [ z^t] ^T$

np.dot(z[:,t+1].reshape(-1,1), delta_out[:,t].reshape(-1,1))

\frac{ \partial E }{\partial W} = \frac{\partial E}{\partial u^t} \big[ \frac {\partial u^t}{\partial W} \big] ^t = \delta ^t [z^{t-1}]^T

np.dot(z[:,t].reshape(-1,1), delta[:,t].reshape(1,-1))

$ \frac{ \partial E}{ \partial u^t} = \frac{\partial E}{ \partial v^t} \frac{\partial v^t}{\partial i^t} = f'(u^t)W_{out}^T \delta^{out,t} = \delta ^t$

delta[:,t] = (np.dot(delta[:,t+1].T, W.T) + np.dot(delta_out[:,t].T, W_out.T)) * functions.d_sigmoid(u[:,t+1])

Parameter update

$ W*{(in)}^{t+1} = W*{(in)}^t - \epsilon \frac {\partial E}{ \partial W*{(in)}} = W*{(in)}^t - \epsilon \sum_{z=0}^{T_t} \delta ^{t-z} \big[ x^{t-z} \big] ^T$

W_in -= learning_rate * W_in_grad

$ W*{(out)}^{t+1} = W*{(out)}^t - \epsilon \frac {\partial E}{ \partial W*{(out)}} = W*{(out)}^t - \epsilon \delta ^{out,t} \big[ z^t \big] ^T$

W_out -= learning_rate * W_out_grad

$ W^{t+1} = W^t - \frac{\partial E}{ \partial W} = W*{(in)}^t - \epsilon \sum*{z=0}^{T_t} \delta ^{t-z} \big[ x^{t-z-1} \big] $

W -= learning_rate * W_grad

The big picture of BPTT

E^t = loss(y^t,d^t) \\
= loss(g(w_{(out)}z^t + c),d^t) \\
= loss(g(W_{(out)}f(W_{(in)}x^t + Wz^{t-1} + b)+c),d^t)

Section2 LSTM

RNN Challenges The more you go back in time series, the more the gradient disappears → It is difficult to learn long time series

solution The one that solves the problem by changing the structure is called LSTM.

Big picture of LSTM

CEC

Gradient disappearance and gradient explosion can be solved with a gradient of 1

Task Input data has a uniform weight regardless of time dependence → The learning characteristics of the neural network are lost in the first place

solution Input gate and output gate

Input gate and output gate

By adding an input gate and an output gate The weight of the input value to each gate can be changed by the weight matrix W and U. → Solve CEC problems

Input gate

Output gate

Current status All past information is stored All information

Task Cannot delete past information when it is no longer needed Always pulled by past information

solution Oblivion gate

Oblivion gate

Gate for deleting CEC information that is no longer needed

Task I want to propagate the past information stored in CEC to other nodes at any time, or forget it at any time. The value of CEC itself does not affect gate control.

solution Peephole connection

Except hole connection

A structure that makes it possible to propagate the value of CEC itself via a weight matrix.

GRU

In LSTM, there was a problem that there were many parameters and the calculation load was high. →GRU

Images show that GRU has reduced that parameter

Section4 Bidirectional RNN

Use future information as well as past information in time series data

Section5 Seq2Seq

Encoder-A type of Decoder model It is used as a model for machine translation and dialogue.

Encoder RNN

A structure in which text data input by the user is divided into tokens such as words and passed. Taking: Decompose sentences into tokens such as words by morphological analysis and convert them into IDs for each token. Embedding: Convert to distributed representation vector based on ID Encoder RNN: Input vectors to RNN in order Input vec1 to RNN and output hidden state. Repeat this and the next input vec2 into the RNN ... and so on. Set the hidden state when the last vec is inserted as the final state. This is a vector that represents the meaning of the input

Decoder RNN

A structure in which the system generates output data for each token such as a word. 1.Decoder RNN: Output the generation probability of each token from the final state (thought vector) of Encoder RNN Set final state as the initial state of Decoder RNN and enter Embedding. 2.Sampling: Randomly select tokens based on generation probability 3. Embedding: token as the next input to the Decoder RNN 4.Detokenize: Repeat 1-3 to convert the token obtained in 2 into a character string.

HRED

seq2seq can only answer one question at a time → There is no context to the question, just the response continues As a solution → HRED

What is HRED Generate the next utterance from the past n-1 utterances More human response Seq2Seq + Context RNN (Context RNN: A structure that converts a conventional conversation context into a vector.)

HERD Challenges There is no diversity in the flow of conversation Tends to learn short common answers

VHRED

A solution to the HREAD problem by adding the concept of VAE latent variables to HREAD.

VAE

Autoencoder

One of unsupervised learning. Teacher data is not used. The network that converts the input data to the latent variable z is the Encoder. Decoder is a neural network that restores the original image using the latent variable z as input.

merit Dimensionality reduction can be performed.

VAE

In the case of a normal autoencoder, data is pushed into the latent variable z, but the state of its structure is unknown.

In the case of VAE, the probability distribution z to N (0,1) is assumed for the latent variable z. Allows data to be pushed into a structure called the probability distribution of the latent variable z

Section6 Word2Vec

RNN Challenges

You can't give NN a variable-length string like a word → You can only study with a fixed length

solution

Word2Vec

Word2vec

For converting variable-length character strings to fixed-length format. Create a vocabulary from training data Create one-hot-vector for input data with vocabulary as dimension → Distributed representation

merit

Learning the distributed representation of large-scale data has become feasible with realistic calculation speed and memory capacity.

Until now, a weight matrix only for "vocabulary x vocabulary" was born. In Word2vec, a weight matrix is born with "vocabulary x arbitrary word vector dimension".

Section7 AttentionMechanism

Challenges of seq2seq

Difficult to handle long sentences. The longer the sentence, the larger the dimension of the internal representation of the input.

solution

Learn "Which words in the input and output are related" →Attention Mechanism

Confirmation test

Answer the size of the output image when the input image of size 5x5 is folded with the filter of size 3x3. The stride is 2 and the padding is 1.

answer

3×3

RNN networks have three main weights. One is the weight applied when defining the current middle layer from the input, and the other is the weight applied when defining the output from the middle layer. Explain the remaining one weight.

answer

Weights from one middle layer to the next

Find dz/dx using the principle of chain rule.

$ z=t^2$

t=x+y

answer

$ dz/dt = 2t$

\frac{dt}{dx} = \frac{dz}{dt} \frac{dt}{dx} = 2t *1 = 2(x+y)

Express y1 as a mathematical formula in the figure below.

$ x ・ s_0: Use s_1 ・ w_ {in} ・ w ・ w_ {out} $

Define the bias with any character
Apply the sigmoid function g (x) to the output of the intermediate layer.

answer

$ z1 = sigmoid(S_0 w + x_1 w{(in)} + b)$

$ y1= sigmoid(z_1 W{(out)} + c)$

When the sigmoid function is differentiated, it takes the maximum value when the input value is 0. Select the correct value from the options.

（1）0.15 （2）0.25 （3）0.35 （4）0.45

answer

(2)

Suppose you want to enter the following sentence into an LSTM and predict the words that fit in the blanks. The word "very" in the text is not considered to have any effect if it disappears in the blank prediction. Which gate is considered to work in such a case?

"The movie was interesting. By the way, I'm so hungry that something \ _ \ _."

answer

Oblivion gate

Briefly describe the challenges facing LSTMs and CECs.

answer

LSTM

The number of parameters is large and the calculation load is high.

CEC

Gradient continues to be passed at 1 and loses weight The concept of learning is compromised

Briefly describe the difference between LSTMs and GRUs.

answer

LSTMs have more parameters than GRUs

Choose from the options below that describe seq2seq.

(1) RNNs in the forward and reverse directions with respect to time are constructed, and these two intermediate layer representations are used as features.

(2) A type of Encoder-Decoder model that uses RNN, and is used for models such as machine translation.

(3) A neural network that recursively performs an operation of creating an expression vector (phrase) from adjacent words on a tree structure such as a syntax tree (with the same weight) and obtains the expression vector of the entire sentence.

(4) A type of RNN that solves the vanishing gradient problem, which is a problem in simple RNNs, by introducing the concept of CEC and gate.

answer

(2)

(1) Bidirectional RNN (2)seq2seq (3)RNN (4)LSTM

Briefly describe the difference between seq2seq and HRED and between HRED and VHRED.

answer

ses2seq

I can only answer one question at a time

HRED

Can answer in context

VHRED

You can answer according to the context, and you can answer in a variety of ways

Answer the blanks in the VAE description below.

Introducing \ _ \ _ \ _ \ _ into the latent variable of the self-encoder.

answer

Probability distribution

Briefly describe the difference between RNN and word2vec, seq2seq and Attention.

answer

RNN and word2vec

Compared to RNN, word2vec can now be calculated with a realistic amount of resources

seq2se1 and seq2se1 + Attention

seq2se1 can only learn fixed length ones seq2se1 + Attention can translate long sentences

Exercise results

https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/3Day/3_1_simple_RNN.ipynb

Consideration

simple_RNN The accuracy is improved when the number of nodes in the middle layer is around 64. Increasing the learning rate causes a gradient explosion When changing the sigmoid function to the ReLU function, a gradient explosion occurs and learning does not proceed at all. Learning does not proceed when changing to tanh The accuracy improves when the number of hidden layers is increased, but the error is large at the beginning of learning. If the weight is increased, the error is small at the beginning of learning, but the gradient explodes from there. There is an existing problem, and new models are being created to solve it, and the accuracy of learning is increasing.

Rabbit Challenge 3DAY

Section1 Concept of Recurrent Neural Network

The big picture of RNN

What is RNN?

Data in chronological order

About RNN

What is BPTT?

BPTT mathematical notation

Parameter update

The big picture of BPTT

Input gate and output gate

Oblivion gate

Except hole connection

Section4 Bidirectional RNN

Autoencoder

RNN Challenges

solution

merit

Challenges of seq2seq

solution

Confirmation test

Answer the size of the output image when the input image of size 5x5 is folded with the filter of size 3x3. The stride is 2 and the padding is 1.

answer

RNN networks have three main weights. One is the weight applied when defining the current middle layer from the input, and the other is the weight applied when defining the output from the middle layer. Explain the remaining one weight.

answer

Find dz/dx using the principle of chain rule.

answer

Express y1 as a mathematical formula in the figure below.

answer

When the sigmoid function is differentiated, it takes the maximum value when the input value is 0. Select the correct value from the options.

answer

Suppose you want to enter the following sentence into an LSTM and predict the words that fit in the blanks. The word "very" in the text is not considered to have any effect if it disappears in the blank prediction. Which gate is considered to work in such a case?

answer

Briefly describe the challenges facing LSTMs and CECs.

answer

Briefly describe the difference between LSTMs and GRUs.

answer

Choose from the options below that describe seq2seq.

answer

Briefly describe the difference between seq2seq and HRED and between HRED and VHRED.

answer

Answer the blanks in the VAE description below.

answer

Briefly describe the difference between RNN and word2vec, seq2seq and Attention.

answer

RNN and word2vec

seq2se1 and seq2se1 + Attention

Exercise results

Consideration