A field of machine learning that aims to create agents who can choose actions in the environment so that rewards can be maximized in the long run. → A mechanism to improve the principle of deciding actions based on the profits (rewards) given as a result of actions
Agent: Protagonist
Agents act according to policy and receive rewards from an environment commensurate with it Image of training policies to maximize rewards
For marketing Agent: Software that determines which customers will send campaign emails based on their profile and purchase history. Action: You will have to choose between two actions, send and non-send, for each customer. Reward: Receive a negative reward of campaign cost and a positive reward of sales estimated to be generated by the campaign
There is a trade-off between insufficient usage and insufficient search. Reinforcement learning adjusts this well
If you always take only the best behaviors in the historical data, you cannot find other best behaviors.
If you always take only unknown actions, you cannot make use of your past experience
Differences between reinforcement learning and supervised and unsupervised learning Different goals ・ Unsupervised and with supervised learning, the goal is to find patterns contained in the data and predict from the data Reinforcement learning aims to find good strategies
There are two types of value functions, the state value function and the action value function.
When focusing on the value of environmental conditions when determining value Value increases if the environment is in good condition Agent behavior is irrelevant
V^{\pi}(s)
When focusing on the value that combines the state of the environment and the value when determining the value Value when an agent acts in a certain state
Q^{\pi}(s,a)
A function that gives a probability of what action to take in a certain environmental state
\pi(s)=a
A method of optimizing by modeling a policy
\theta^{(t+1)}=\theta^{(t)}+\epsilon\nabla J(\theta)
\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}[(\nabla_{\theta}log\pi_{\theta}(a|s)Q^{\pi}(s,a))]
$ t $: time $ \ theta $: Weight $ \ Epsilon $: Learning rate $ J $: Error function
Section2 Alpha Go There are two types, AlphaGo Lee and AlphaGo Zero
AlphaGo Lee Uses ValueNet and PolicyNet CNN
Uses 19x19 2D data Have 48 channels Probability of starting 19x19 can be obtained
Uses 19x19 2D data Has 49 channels (additional turn) You can get a winning percentage in the range of -1 to 1. Flatten is sandwiched because it is the output of winning or losing.
RollOutPolicy Linear policy function instead of NN It is used to get a high-speed start probability during the search.
Currently the most effective search method for computer Go software
AlphaGo Zero
PolicyValueNet
PolicyNet and ValueNet are integrated, and since we want to obtain the output of the policy function and the number of values, respectively, it becomes an NN with a branched structure in the middle.
Residual Block Create a shortcut on the network Reduce network depth Vanishing gradient problem is less likely to occur The basic structure is Convolution → Batch Norm → ReLU → Convolution → Batch Norm → Add → ReLU Ensemble effect
PreActivation Residual Blocks are arranged in Batch Nor → ReLU → Convolution → Batch Norm → ReLU → Convolution → Add to improve performance.
wideResNet ResNet with k times the filter of Convolution. By increasing the number of filters, even if the layer is shallow, the performance is equal to or better than that of the deep layer.
PyramidNet ResNet increasing the number of filters in each layer
How to learn a model fast How to run a model on a non-high performance computer
--Deep learning uses a lot of data and a lot of time for parameter adjustment, so high-speed calculation is required. --I want to perform efficient learning by constructing a neural network in parallel using multiple computational resources (workers). --Data parallelization, model parallelization, and high-speed GPU technology are indispensable
--Copy parent model to each worker (computer, etc.) as child model --Split the data and let each worker calculate
Increase the number of computers, GPUs, TPUs, etc. to distribute calculations and speed up learning Data parallelism is determined by how to match the parameters of each model, whether it is synchronous or asynchronous.
Synchronous parameter update flow. Wait for each worker to finish the calculation, calculate the average of the gradients when all the workers have gradients, and update the parameters of the parent model.
Each worker does not wait for each other's calculations, but updates for each child model. The trained child model is pushed to the parameter server. When starting a new learning, learn from the model popped from the parameter server.
--The processing speed is faster for the asynchronous type, which does not wait for each other's workers to calculate. --Asynchronous type cannot use the parameters of the latest model, so learning tends to be unstable. -> Stale Gradient Problem ――Currently, the synchronous type is often more accurate, so it is the mainstream.
--Split the parent model into each worker and train each model. Restore to one model after training with all data --Model parallelization is recommended when the model is large, and data parallelization is recommended when the data is large.
The more parameters the model has, the more efficient it will be.
GPGPU (General-purpose on GPU) A general term for GPUs used for purposes other than graphics, which was the original purpose of use. CPU Few high-performance cores Good at complicated and continuous processing GPU Many cores with relatively low performance Good at simple parallel processing Neural network learning can be speeded up because there are many simple matrix operations.
CUDA Platform for parallel computing on GPU Only available on GPUs developed by NVIDIA Easy to use as it is provided for deep learning OpenCL Open parallel computing platform Can be used from GPUs of companies other than NVIDIA (Intel, AMD, ARM, etc.) Not specialized in calculations for deep learning
--Quantization --Distillation --Pruning
Quantization is often used
Larger networks require a large amount of parameters and require a lot of memory and arithmetic processing for learning and inference. → Reduce memory and arithmetic processing by reducing the 64-bit floating point of normal parameters to lower precision such as 32 bits.
Billions of parameters require a lot of memory to store weights Decrease the accuracy of one parameter information and reduce the amount of information to be stored Double-precision arithmetic (64 bit) and single-precision arithmetic (32 bit) have very different arithmetic performance, so more calculations can be performed by reducing the accuracy by quantization. 16bit is safe
Speed up calculations Memory saving
Decreased accuracy
A highly accurate model is a model with a large neuron scale, so it requires a lot of memory and arithmetic processing for inference. → Create a lightweight model using the knowledge of large-scale models
Passing on the knowledge of learned high-precision models to lightweight models
Distillation allows you to create more accurate models with less learning
As the network grows, a large number of parameters do not affect the accuracy of calculations for all neurons → Make the model lighter and faster by deleting neurons that do not contribute much to the accuracy of the model.
Determine the threshold for how much it contributes to accuracy and determine what to remove neurons Higher thresholds reduce the number of neurons and reduce accuracy
Introducing the model actually used ※ Width: $ H $ Height: $ W $ Channel: $ C $ Number of filters: $ M $
--Input map (number of channels): $ H \ times W \ times C $ --Convolution kernel size: $ K \ times K \ times C $ --Number of output channels: $ M $
Computational complexity of total output: $ H \ times W \ times K \ times K \ times C \ times M $ General convolution layer requires a lot of calculation
MobileNet Lighter version of image recognition model Achieves weight reduction by combining Depthwise Convolution and Pointwise Convolution
Depthwise Convolution Convolution is performed for each channel of the input map Combine output maps with them Considering that a normal convolution kernel depends on all layers, the amount of calculation can be greatly reduced. Since each layer is a convolution, the relationship between layers is not considered at all. Usually solved by using it as a set with PW convolution The amount of calculation is reduced by the number of filters (M)
Computational complexity of total output: $ H \ times W \ times C \ times K \ times K $ (Computational complexity of total output: $ H \ times W \ times K \ times K \ times C \ times M $)
Pointwise Convolution Also known as 1 x 1 conv Convolution is performed for each point on the input map Output maps (number of channels) can be created for the number of filters (any size can be specified) $ K \ times Reduced computational complexity for K $
Computational complexity of total output: $ H \ times W \ times C \ times M $ (Computational complexity of total output: $ H \ times W \ times K \ times K \ times C \ times M $)
The amount of calculation is reduced by dividing the output of Depthwise Convolution into Pointwise Convolution for general convolution calculation.
DenseNet Image recognition network In NN, there was a problem that learning became more difficult as the layer became deeper. CNN architectures such as ResNet addressed the problem by creating a path from the front layer to the rear layer via an identity connection. DenseNet, which uses a module called DenseBlock, is one such architecture.
Initial convolution-> Dense block-> conversion layer-> discrimination layer
DenseBlock Add up the dragon power of the layer before the output layer The structure is such that the number of channels gradually increases. Specifically, Batch normalization → Conversion by Relu function → Processing by convolution layer Add the input feature map to the output calculated in the previous slide If the number of channels in the input feature map is $ l \ times k $, the output will be $ (l + 1) \ times k $. If the output of layer l is taken
When k channels increase with each passing, k is called the "growth rate" of the network. Transition Layer In CNN, change the channel size in the middle layer (return the size before inputting DenseBlock, etc.) To change the size of the feature map and perform downsampling, connect the Dence blocks with a layer called Transition Layer.
In DenseBlock, all the output from each front layer is used as input to the rear layer. In Ressidual Block, only the input of the front layer is input to the rear layer.
BatchNorm Normalize the distribution of data flowing between layers so that the mean is 0 and the variance is 1 in mini-batch units. When there are N samples of $ H \ times W \ times C $, N ___same channel ___ are the unit of normalization (in color) Batch Normalization has the effects of shortening learning time, reducing dependence on initial values, and suppressing overfitting in neural networks.
If the size of the mini-batch cannot be increased due to the influence of Batch Size, the effect will be diminished. It is difficult to experiment with the effect because the number of mini batches must be changed depending on the hardware. Under conditions where Batch Size is small, learning may not converge, and normalization techniques such as Layer Normalization are often used instead.
Layer Norm Pay attention to ___ one of N samples . $ H \ times W \ times C $ ___ All pixel is the unit of normalization (in one image) Solved the problem of Batch Norm that does not depend on the number of mini-batch
Instance Norm Normalize for each channel of each sample Contributes to control normalization ・ Used for image style transfer and texture composition tasks
Wavenet Speech generation model → time series data Deep learning model that produces raw speech waveforms Pixel CNN applied to voice Apply convolution to time series data Dilated comvolution --Release the link to fold as the layer gets deeper --You can easily increase a wider range of information
Seq2seq A system that takes a series as an input and outputs a series The input series is Encoded (converted to internal state), and the internal state is decoded (converted to series). --Translation (English → Japanese) --Voice recognition (waveform → text) --Chatbot (text → text)
What gives the probability of word sequence Mathematically, simultaneous establishment can be decomposed into posterior probabilities.
Example) You say good bye → 0.092 (nature) You say good die → 0.0000032 (unnatural)
In sentences, the simultaneous probability when each word appears can be decomposed by posterior probability, and the next word can be predicted at a certain point by learning with RNN.
https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/4Day/lecture_chap1_exercise_public.ipynb
Trancefomer Not using RNN (all you need is Attntion) Completed 36 million English-French learning in 3.5 days with 8 GPUs (much less computational complexity than other models at the time)
--Add location information to word vector --Calculate Attention with multiple heads --Full combination that processes independently for each word position --Regularize and summarize the dimensions to the decoder --Entered not to see future information --Make predictions from the input information and encoder information
Encoder-Decoder Encoder-Decoder model is vulnerable to sentence length --Express the content of the translation source sentence with one vector ――When the sentence becomes long, there is not enough expression
Attention Encoder-Solving Decoder issues Use the hidden state of words in the source sentence when selecting the target word Distribute the weight to each hidden layer so that when all are added, it becomes 1. Functions similar to those of dictionary objects
souce Target Attention Decide what to pay attention to whether the information you should aim for is close to the information you received
Self Attention Determine which information to pay attention to only by your own input
Trancefomer Encoder Encode each word with context in mind by Self Attention
Position Wise Feed Forwrd Networks Determine the output of each Attention layer Mold output while retaining position information Layer to apply linear transformation
Scaled dot product attention Calculate attention for all words at once
Multi Head attention Combine the output of 8 Scaled dot product attention Linearly transform the combined Collect different information for each (like ensemble learning)
Add Learn the difference between input and output Add input to output on implementation Reduction of learning / test errors
Norm(Layder Norm) Accelerate learning
Position Encoding Encode word location information Since it is not an RNN, to add information about the word order of the word string
https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/4Day/lecture_chap2_exercise_public.ipynb
It was found that the learning speed and accuracy of Trancefomer are much faster and more accurate than that of Trancefomer and Seq2seq.
Input data is an image Object recognition tasks in a broad sense are classified into four categories.
name | output | position | Instance distinction |
---|---|---|---|
Classification | Single or multiple class labels for images | Not interested | Not interested |
Object detection | Bounding Box | Interested | Not interested |
Semantic field division | Single class label for each pixel | Interested | Not interested |
Individual domain division | Single class label for each pixel | Interested | Interested |
It becomes difficult in the order of classification → object detection → semantic area division → individual area division
Predict where and what is what kind of confidence (Bounding Box)
name | class | Train+Val | Number of Boxes/image |
---|---|---|---|
VOC12 | 20 | 11,540 | 2.4 |
ILSVRC17 | 200 | 476,668 | 1.1 |
MS COCO18 | 80 | 123,287 | 7.3 |
OICOD18 | 500 | 1,743,042 | 7.0 |
Box/If the image is small, it looks like an icon, and it is easy to get away from everyday feeling. Box/If the image is large, partial overlap can be seen, which is close to the context of daily life.
The difference from classification is that the number of BBoxes changes depending on the confidence threshold.
IoU In object detection, we want to evaluate not only the class label but also the prediction accuracy of the object position.
Area of overlap =
Precision/Recal
Set thresholds for Confidence and IoU Conf. Threshold: 0.5 IoU threshold: 0.5
conf. | pred. | IoU |
---|---|---|
P1 | 0.92 | Man |
P2 | 0.85 | car |
P3 | 0.81 | car |
P4 | 0.70 | dog |
P5 | 0.69 | Man |
P6 | 0.54 | car |
TP from P1: IoU> 0.5 (detects people) P2: FP from IoU <0.5 P3: TP from IoU> 0.5 (car detected) P4: TP from IoU> 0.5 (dog detected) P5: IoU> 0.5, but it has already been detected, so FP P6: FP from IoU <0.5
Precision:
Recall:
Average Precision
Conf. Threshold: $ \ beta $
Precision:
Average Precision (lower area of PR curve):
FPS:Flames per Second Due to the demands of object detection applications, detection speed is also an issue in addition to detection accuracy.
Image resolution drops due to convolution and pooling Must be the same size as the input size and have a single class label for each pixel Must be restored to its original size → Up-sampling wall There are the following two solutions
Deconvolution/Transposed
The figure above shows how the 3 × 3 feature map is up-sampling to 5 × 5 by Deconv. With kernel size = 3, padding = 1, stride = 1.
--Specify kernel size padding stride as in normal Conv. Layer --Open the pixel interval of the feature map by stride ――Around the feature map (kernel size ―― 1) ――Make a margin by padding --Perform a convolution operation
Note that it is often called deconvolution, but it is not the inverse operation of convolution → Of course, the information lost by pooling is not restored.
Dilated Convolution A device to expand the receptive field at the Convolution stage without using pooling
Expand the receptive field to 5x5, which gives a gap between 3x3 (rate = 2) Computational complexity is equivalent to 3x3 Eventually the receptive field can be expanded to 15x15
Recommended Posts