The method of initializing weight parameters in a neural network is summarized with implementation (initial values of Xavier, He et al., Etc.)

Introduction

The basic contents of how to give the initial value of the weight parameter in the neural network are summarized. The weight parameter is a coefficient used when combining neurons. In the figure below, it is the value to be multiplied with the input layer when connecting between the layers.

image.png

References

This time, I created the article with reference to the following two books published by O'Reilly.

  1. Deep Learning from scratch-the theory and implementation of deep learning learned in Python
  2. Practical machine learning with scikit-learn and TensorFlow

The outline is below.

Setting weight parameters

In the multi-layer perceptron, ** the method of updating the weight parameters is important ** because it determines the learning accuracy and speed of the model. Updating the weight parameter requires a gradient of the loss function. This gradient is determined by the ** inverse error propagation method ** in a multi-layer perceptron.

Consider how the weight parameters should be set in this inverse error propagation method. As a sample program, consider a multi-layer perceptron ** with 5 layers, each layer having 100 neurons. The input data randomly generates 1000 data with a Gaussian distribution and sends them to this multi-layer perceptron. The activation function uses the sigmoid function.

Weight.py


input_data = np.random.randn(1000, 100)  #1000 data
node_num = 100  #Number of nodes (neurons) in each hidden layer
hidden_layer_size = 5  #5 hidden layers
activations = {}  #Store the activation result here

x = input_data

for i in range(hidden_layer_size):
    if i != 0:
        x = activations[i-1]

    w = np.random.normal(0,0,(node_num, node_num))#Change this value to determine the weight parameter
    a = np.dot(x, w)
    z = sigmoid(a)
    activations[i] = z

When the weight parameter value is large

First, consider ** when the initial value of the weight parameter is extremely large **. Let's look at the weight parameter at that time and the activation result of each layer (= value after activation function). When the ** standard deviation of the weight parameter is 10 **.

image.png

You can see that the activation value of each layer has a biased distribution of 0 and 1. ** If the output value of the activation function is biased to 0 or 1, it means that the gradient value becomes smaller when using this output value in updating the weight parameter. ** This will lead to the weight parameters not being updated. This is a problem called ** gradient disappearance **.

When the weight parameter is 0

Next, when the standard deviation of the weight parameter is 0.0.

image.png

The result is that the activation value is concentrated at 0.5. ** This is a biased activation situation. ** The problem of gradient disappearance mentioned earlier has not occurred. ** However, being biased means that there is a problem with the expressiveness of the model. This is because if the same value is output from multiple neurons, the meaning of multiple existence will be lost. In other words, the same result can be obtained by reducing the number of layers.

Relationship with the development process of deep learning

The gradient disappearance phenomenon seen earlier is an empirically observed problem. This issue is unique to multilayer perceptrons and was one of the reasons why research has not progressed much (≒ not popular).

Although the idea of multi-layer perceptron (≒ neural network) itself has been proposed since the 1980s, it has not developed much due to problems such as ** enormous calculation cost, gradient disappearance, and local optimum solution **. In the 2010s, ** development of high-performance hardware such as GPU and algorithms to solve various problems **, which seems to be leading to today's fashion.

Solving the problem of gradient disappearance (about the initial value of Xavier)

Regarding this problem related to gradient disappearance, an initialization method called ** pre-learning proposed by Hinton et al. Of the University of Toronto in 2006 has been proposed. ** **

Reducing the Dimensionality of Data with Neural Networks http://www.cs.toronto.edu/~hinton/science.pdf

In this paper, we propose a method to learn the weight parameter of the multi-layer perceptron ** once as an autoencoder and make it an appropriate value. ** On the other hand, there is also the problem of how to give the weight parameter of this autoencoder itself.

On the other hand, there was progress in 2010 on the behavior of this gradient disappearance. The following paper was published by the research group of Xavier et al.

Understanding the difficulty of training deep feedforward neural networks http://proceedings.mlr.press/v9/glorot10a.html

I raised questions about the most used sigmoid functions and weight initialization techniques at the time. At this time, random initialization was performed using a normal distribution with a mean of 0 and a standard deviation of 1. In this case, the output variance of each layer will be greater than the input variance. As the network progresses in the forward direction, the output of the sigmoid function will be 0 or 1 due to the large variance.

Therefore, the authors propose to perform an operation to initialize the weight parameter in each layer. When $ n_ {in} $ is the number of neurons in the input layer and $ n_ {out} $ is the number of neurons in the output layer, the distribution shown below is given.

\theta \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\textrm{in}}+n_{\textrm{out}}}}, \sqrt{\frac{6}{n_{\textrm{in}}+n_{\textrm{out}}}}\right) \quad \textrm{or} \quad \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\textrm{in}}+n_{\textrm{out}}}}\right)

This initialization method is called the initial value of Xavier or the initial value of Glorot.

Confirm by implementation

Let's try using the initial value of Xavier.

image.png

This result shows that the distribution is wider than the previous results. Since there is a moderate spread, it seems that learning can be done efficiently without limiting the expressiveness of the sigmoid function.

About the initial value of He

Similarly, the research group of He et al. Also proposes a method to eliminate the gradient disappearance by the following formula. The initial value of Xavier mentioned earlier is a suitable initial value when the activation function is symmetrical such as the sigmoid function and the Tanh function. On the other hand, the method suitable for using the ReLU function is the initial value of He.

Let the input dimension of each layer be $ n_ {\ textrm {in}} $ and initialize it as follows. $ \theta \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\textrm{in}}}}, \sqrt{\frac{6}{n_{\textrm{in}}}}\right) \quad \textrm{or} \quad \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\textrm{in}}}}\right) $

Note that $ \ mathcal {U} $ represents a uniform distribution and $ \ mathcal {N} $ represents a normal distribution.

Compare using MNIST data

Now, let's train the MNIST data with 5 layers of multi-layer perceptrons (100 neurons in each layer). At this time, the weight parameter of the initial value is given as follows and calculated.

  1. Standard deviation 0.01
  2. Initial value of Xavier (Gaussian distribution), activation function is sigmoid function
  3. Initial value of He (Gaussian distribution), activation function is ReLU function

All optimization methods use SGD.

007.png

It was found that the loss function decreased in the order of He initial value, Xavier initial value, and standard deviation 0.01.

in conclusion

This time, we have summarized the concept of parameter initialization in neural networks. It was confirmed that the initial value of the weight is a very important point in the learning of the neural network. The initial value of the weight often determines the success or failure of neural network learning. It may be necessary to note whether these initial values and activation functions are appropriate for the model you create.

The full program is here. https://github.com/Fumio-eisan/Weight_20200502

Comparison of weight parameter updates for Xavier, He, etc. is weight_init_activation_histogram.py The comparison by MNIST data is weight_init_compare.py.

Recommended Posts

The method of initializing weight parameters in a neural network is summarized with implementation (initial values of Xavier, He et al., Etc.)
Understand the number of input / output parameters of a convolutional neural network
Implementation of a two-layer neural network 2
Visualize the inner layer of a neural network
With a bit of precision, the weight parameters were incredibly reduced ~ CNN's amazing results ~