Section1 Reinforcement learning

What is reinforcement learning?

A field of machine learning that aims to create agents who can choose actions in the environment so that rewards can be maximized in the long run. → A mechanism to improve the principle of deciding actions based on the profits (rewards) given as a result of actions

Agent: Protagonist

Agents act according to policy and receive rewards from an environment commensurate with it Image of training policies to maximize rewards

Application example of reinforcement learning

For marketing Agent: Software that determines which customers will send campaign emails based on their profile and purchase history. Action: You will have to choose between two actions, send and non-send, for each customer. Reward: Receive a negative reward of campaign cost and a positive reward of sales estimated to be generated by the campaign

Trade-off between exploration and utilization

There is a trade-off between insufficient usage and insufficient search. Reinforcement learning adjusts this well

Insufficient search

If you always take only the best behaviors in the historical data, you cannot find other best behaviors.

Insufficient use

If you always take only unknown actions, you cannot make use of your past experience

Reinforcement learning difference

Differences between reinforcement learning and supervised and unsupervised learning Different goals ・ Unsupervised and with supervised learning, the goal is to find patterns contained in the data and predict from the data Reinforcement learning aims to find good strategies

Value function

There are two types of value functions, the state value function and the action value function.

State value function

When focusing on the value of environmental conditions when determining value Value increases if the environment is in good condition Agent behavior is irrelevant

V^{\pi}(s)

Behavioral value function

When focusing on the value that combines the state of the environment and the value when determining the value Value when an agent acts in a certain state

Q^{\pi}(s,a)

Policy function

A function that gives a probability of what action to take in a certain environmental state

\pi(s)=a

Policy gradient method

A method of optimizing by modeling a policy

\theta^{(t+1)}=\theta^{(t)}+\epsilon\nabla J(\theta)

\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a|s)Q^{\pi}(s,a))]

$ t $: time $ \ theta $: Weight $ \ Epsilon $: Learning rate $ J $: Error function

Section2 Alpha Go There are two types, AlphaGo Lee and AlphaGo Zero

AlphaGo Lee Uses ValueNet and PolicyNet CNN

PolicyNet (policy function)

Uses 19x19 2D data Have 48 channels Probability of starting 19x19 can be obtained

ValueNet (value function)

Uses 19x19 2D data Has 49 channels (additional turn) You can get a winning percentage in the range of -1 to 1. Flatten is sandwiched because it is the output of winning or losing.

Alpha Go Lee learning steps

1. Learning RollOut Policy and Policy Net with supervised learning
Learning PolicyNet with reinforcement learning
1. Learning ValueNet with reinforcement learning

RollOutPolicy Linear policy function instead of NN It is used to get a high-speed start probability during the search.

Monte Carlo tree search

Currently the most effective search method for computer Go software

AlphaGo Zero

Differences between AlphaGo Lee and AlphaGo Zero

Created only by reinforcement learning without any supervised learning
Eliminate heuristic elements from feature input and only place stones
PolicyNet and ValueNet integrated into one network
Introduced Residual Net (described later) 5. Eliminate RollOut simulation from Monte Carlo tree search

PolicyValueNet

PolicyNet and ValueNet are integrated, and since we want to obtain the output of the policy function and the number of values, respectively, it becomes an NN with a branched structure in the middle.

Residual Block Create a shortcut on the network Reduce network depth Vanishing gradient problem is less likely to occur The basic structure is Convolution → Batch Norm → ReLU → Convolution → Batch Norm → Add → ReLU Ensemble effect

PreActivation Residual Blocks are arranged in Batch Nor → ReLU → Convolution → Batch Norm → ReLU → Convolution → Add to improve performance.

wideResNet ResNet with k times the filter of Convolution. By increasing the number of filters, even if the layer is shallow, the performance is equal to or better than that of the deep layer.

PyramidNet ResNet increasing the number of filters in each layer

aection3 Lightweight and high-speed technology

How to learn a model fast How to run a model on a non-high performance computer

Distributed deep learning

--Deep learning uses a lot of data and a lot of time for parameter adjustment, so high-speed calculation is required. --I want to perform efficient learning by constructing a neural network in parallel using multiple computational resources (workers). --Data parallelization, model parallelization, and high-speed GPU technology are indispensable

Data parallel

--Copy parent model to each worker (computer, etc.) as child model --Split the data and let each worker calculate

Increase the number of computers, GPUs, TPUs, etc. to distribute calculations and speed up learning Data parallelism is determined by how to match the parameters of each model, whether it is synchronous or asynchronous.

Synchronous type

Synchronous parameter update flow. Wait for each worker to finish the calculation, calculate the average of the gradients when all the workers have gradients, and update the parameters of the parent model.

Asynchronous type

Each worker does not wait for each other's calculations, but updates for each child model. The trained child model is pushed to the parameter server. When starting a new learning, learn from the model popped from the parameter server.

Synchronous / asynchronous comparison

--The processing speed is faster for the asynchronous type, which does not wait for each other's workers to calculate. --Asynchronous type cannot use the parameters of the latest model, so learning tends to be unstable. -> Stale Gradient Problem ――Currently, the synchronous type is often more accurate, so it is the mainstream.

Model parallel

--Split the parent model into each worker and train each model. Restore to one model after training with all data --Model parallelization is recommended when the model is large, and data parallelization is recommended when the data is large.

The more parameters the model has, the more efficient it will be.

Acceleration by GPU

GPGPU (General-purpose on GPU) A general term for GPUs used for purposes other than graphics, which was the original purpose of use. CPU Few high-performance cores Good at complicated and continuous processing GPU Many cores with relatively low performance Good at simple parallel processing Neural network learning can be speeded up because there are many simple matrix operations.

GPGPU development environment

CUDA Platform for parallel computing on GPU Only available on GPUs developed by NVIDIA Easy to use as it is provided for deep learning OpenCL Open parallel computing platform Can be used from GPUs of companies other than NVIDIA (Intel, AMD, ARM, etc.) Not specialized in calculations for deep learning

Weight saving

--Quantization --Distillation --Pruning

Quantization is often used

Quantization

Larger networks require a large amount of parameters and require a lot of memory and arithmetic processing for learning and inference. → Reduce memory and arithmetic processing by reducing the 64-bit floating point of normal parameters to lower precision such as 32 bits.

Billions of parameters require a lot of memory to store weights Decrease the accuracy of one parameter information and reduce the amount of information to be stored Double-precision arithmetic (64 bit) and single-precision arithmetic (32 bit) have very different arithmetic performance, so more calculations can be performed by reducing the accuracy by quantization. 16bit is safe

merit

Speed up calculations Memory saving

Demerit

Decreased accuracy

distillation

A highly accurate model is a model with a large neuron scale, so it requires a lot of memory and arithmetic processing for inference. → Create a lightweight model using the knowledge of large-scale models

Model simplification

Passing on the knowledge of learned high-precision models to lightweight models

merit

Distillation allows you to create more accurate models with less learning

Pruning

As the network grows, a large number of parameters do not affect the accuracy of calculations for all neurons → Make the model lighter and faster by deleting neurons that do not contribute much to the accuracy of the model.

Number of neurons and accuracy

Determine the threshold for how much it contributes to accuracy and determine what to remove neurons Higher thresholds reduce the number of neurons and reduce accuracy

Section4 Applied Technology

Introducing the model actually used ※ Width: $ H $ Height: $ W $ Channel: $ C $ Number of filters: $ M $

General convolution layer

--Input map (number of channels): $ H \ times W \ times C $ --Convolution kernel size: $ K \ times K \ times C $ --Number of output channels: $ M $

Computational complexity of total output: $ H \ times W \ times K \ times K \ times C \ times M $ General convolution layer requires a lot of calculation

MobileNet Lighter version of image recognition model Achieves weight reduction by combining Depthwise Convolution and Pointwise Convolution

Depthwise Convolution Convolution is performed for each channel of the input map Combine output maps with them Considering that a normal convolution kernel depends on all layers, the amount of calculation can be greatly reduced. Since each layer is a convolution, the relationship between layers is not considered at all. Usually solved by using it as a set with PW convolution The amount of calculation is reduced by the number of filters (M)

Computational complexity of total output: $ H \ times W \ times C \ times K \ times K $ (Computational complexity of total output: $ H \ times W \ times K \ times K \ times C \ times M $)

Pointwise Convolution Also known as 1 x 1 conv Convolution is performed for each point on the input map Output maps (number of channels) can be created for the number of filters (any size can be specified) $ K \ times Reduced computational complexity for K $

Computational complexity of total output: $ H \ times W \ times C \ times M $ (Computational complexity of total output: $ H \ times W \ times K \ times K \ times C \ times M $)

Summary

The amount of calculation is reduced by dividing the output of Depthwise Convolution into Pointwise Convolution for general convolution calculation.

DenseNet Image recognition network In NN, there was a problem that learning became more difficult as the layer became deeper. CNN architectures such as ResNet addressed the problem by creating a path from the front layer to the rear layer via an identity connection. DenseNet, which uses a module called DenseBlock, is one such architecture.

Initial convolution-> Dense block-> conversion layer-> discrimination layer

DenseBlock Add up the dragon power of the layer before the output layer The structure is such that the number of channels gradually increases. Specifically, Batch normalization → Conversion by Relu function → Processing by convolution layer Add the input feature map to the output calculated in the previous slide If the number of channels in the input feature map is $ l \ times k $, the output will be $ (l + 1) \ times k $. If the output of layer l is taken

x_1 = H_1([x_0,x_1,x_2, \dots ,x_{l-1}])

When k channels increase with each passing, k is called the "growth rate" of the network. Transition Layer In CNN, change the channel size in the middle layer (return the size before inputting DenseBlock, etc.) To change the size of the feature map and perform downsampling, connect the Dence blocks with a layer called Transition Layer.

Difference between DenseNet and ResNet

In DenseBlock, all the output from each front layer is used as input to the rear layer. In Ressidual Block, only the input of the front layer is input to the rear layer.

BatchNorm Normalize the distribution of data flowing between layers so that the mean is 0 and the variance is 1 in mini-batch units. When there are N samples of $ H \ times W \ times C $, N ___same channel ___ are the unit of normalization (in color) Batch Normalization has the effects of shortening learning time, reducing dependence on initial values, and suppressing overfitting in neural networks.

Problems with Batch Norm

If the size of the mini-batch cannot be increased due to the influence of Batch Size, the effect will be diminished. It is difficult to experiment with the effect because the number of mini batches must be changed depending on the hardware. Under conditions where Batch Size is small, learning may not converge, and normalization techniques such as Layer Normalization are often used instead.

Layer Norm Pay attention to ___ one of N samples . $ H \ times W \ times C $ ___ All pixel is the unit of normalization (in one image) Solved the problem of Batch Norm that does not depend on the number of mini-batch

Instance Norm Normalize for each channel of each sample Contributes to control normalization ・ Used for image style transfer and texture composition tasks

Wavenet Speech generation model → time series data Deep learning model that produces raw speech waveforms Pixel CNN applied to voice Apply convolution to time series data Dilated comvolution --Release the link to fold as the layer gets deeper --You can easily increase a wider range of information

Seq2seq A system that takes a series as an input and outputs a series The input series is Encoded (converted to internal state), and the internal state is decoded (converted to series). --Translation (English → Japanese) --Voice recognition (waveform → text) --Chatbot (text → text)

Language model

What gives the probability of word sequence Mathematically, simultaneous establishment can be decomposed into posterior probabilities.

Example) You say good bye → 0.092 (nature) You say good die → 0.0000032 (unnatural)

RNN x language model

In sentences, the simultaneous probability when each word appears can be decomposed by posterior probability, and the next word can be predicted at a certain point by learning with RNN.

Implementation

https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/4Day/lecture_chap1_exercise_public.ipynb

Trancefomer Not using RNN (all you need is Attntion) Completed 36 million English-French learning in 3.5 days with 8 GPUs (much less computational complexity than other models at the time)

--Add location information to word vector --Calculate Attention with multiple heads --Full combination that processes independently for each word position --Regularize and summarize the dimensions to the decoder --Entered not to see future information --Make predictions from the input information and encoder information

Encoder-Decoder Encoder-Decoder model is vulnerable to sentence length --Express the content of the translation source sentence with one vector ――When the sentence becomes long, there is not enough expression

Attention Encoder-Solving Decoder issues Use the hidden state of words in the source sentence when selecting the target word Distribute the weight to each hidden layer so that when all are added, it becomes 1. Functions similar to those of dictionary objects

souce Target Attention Decide what to pay attention to whether the information you should aim for is close to the information you received

Self Attention Determine which information to pay attention to only by your own input

Trancefomer Encoder Encode each word with context in mind by Self Attention

Position Wise Feed Forwrd Networks Determine the output of each Attention layer Mold output while retaining position information Layer to apply linear transformation

Scaled dot product attention Calculate attention for all words at once

Multi Head attention Combine the output of 8 Scaled dot product attention Linearly transform the combined Collect different information for each (like ensemble learning)

Add Learn the difference between input and output Add input to output on implementation Reduction of learning / test errors

Norm（Layder Norm） Accelerate learning

Position Encoding Encode word location information Since it is not an RNN, to add information about the word order of the word string

Implementation

https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/4Day/lecture_chap2_exercise_public.ipynb

Consideration

It was found that the learning speed and accuracy of Trancefomer are much faster and more accurate than that of Trancefomer and Seq2seq.

Object recognition

Input data is an image Object recognition tasks in a broad sense are classified into four categories.

name	output	position	Instance distinction
Classification	Single or multiple class labels for images	Not interested	Not interested
Object detection	Bounding Box	Interested	Not interested
Semantic field division	Single class label for each pixel	Interested	Not interested
Individual domain division	Single class label for each pixel	Interested	Interested

It becomes difficult in the order of classification → object detection → semantic area division → individual area division

Object detection

Predict where and what is what kind of confidence (Bounding Box)

data set

name	class	Train+Val	Number of Boxes/image
VOC12	20	11,540	2.4
ILSVRC17	200	476,668	1.1
MS COCO18	80	123,287	7.3
OICOD18	500	1,743,042	7.0

Box/If the image is small, it looks like an icon, and it is easy to get away from everyday feeling. Box/If the image is large, partial overlap can be seen, which is close to the context of daily life.

Evaluation index

The difference from classification is that the number of BBoxes changes depending on the confidence threshold.

IoU In object detection, we want to evaluate not only the class label but also the prediction accuracy of the object position.

Area of overlap = TP Area of Union = TP + FP + FN

Precision/Recal

Set thresholds for Confidence and IoU Conf. Threshold: 0.5 IoU threshold: 0.5

conf.	pred.	IoU
P1	0.92	Man
P2	0.85	car
P3	0.81	car
P4	0.70	dog
P5	0.69	Man
P6	0.54	car

TP from P1: IoU> 0.5 (detects people) P2: FP from IoU <0.5 P3: TP from IoU> 0.5 (car detected) P4: TP from IoU> 0.5 (dog detected) P5: IoU> 0.5, but it has already been detected, so FP P6: FP from IoU <0.5

Precision：\frac{3}{3+3}＝ 0.50

Recall：\frac{3}{0+3}＝ 1.00

Average Precision Conf. Threshold: $ \ beta $ Precision：R( \beta ) Recall：P( \beta ) Precision-Recall curve:P=f( R )

Average Precision (lower area of PR curve): $ AP =　\int_0^1 P(R)dR $

FPS：Flames per Second Due to the demands of object detection applications, detection speed is also an issue in addition to detection accuracy.

segmentation

problem

Image resolution drops due to convolution and pooling Must be the same size as the input size and have a single class label for each pixel Must be restored to its original size → Up-sampling wall There are the following two solutions

Deconvolution
Transposed

Deconvolution/Transposed

The figure above shows how the 3 × 3 feature map is up-sampling to 5 × 5 by Deconv. With kernel size = 3, padding = 1, stride = 1.

--Specify kernel size padding stride as in normal Conv. Layer --Open the pixel interval of the feature map by stride ――Around the feature map (kernel size ―― 1) ――Make a margin by padding --Perform a convolution operation

Note that it is often called deconvolution, but it is not the inverse operation of convolution → Of course, the information lost by pooling is not restored.

Dilated　Convolution A device to expand the receptive field at the Convolution stage without using pooling

Expand the receptive field to 5x5, which gives a gap between 3x3 (rate = 2) Computational complexity is equivalent to 3x3 Eventually the receptive field can be expanded to 15x15

Rabbit Challenge 4Day

Section1 Reinforcement learning

What is reinforcement learning?

Application example of reinforcement learning

Trade-off between exploration and utilization

Insufficient search

Insufficient use

Reinforcement learning difference

Value function

State value function

Behavioral value function

Policy function

Policy gradient method

PolicyNet (policy function)

ValueNet (value function)

Alpha Go Lee learning steps

Monte Carlo tree search

Differences between AlphaGo Lee and AlphaGo Zero

aection3 Lightweight and high-speed technology

Distributed deep learning

Data parallel

Synchronous type

Asynchronous type

Synchronous / asynchronous comparison

Model parallel

Acceleration by GPU

GPGPU development environment

Weight saving

Quantization

merit

Demerit

distillation

Model simplification

merit

Pruning

Number of neurons and accuracy

Section4 Applied Technology

General convolution layer

Summary

Difference between DenseNet and ResNet

Problems with Batch Norm

Language model

RNN x language model

Implementation

Implementation

Consideration

Object recognition

Object detection

data set

Evaluation index

segmentation

problem