I tried to implement ADALINE in Python

Last time continuation

2.4 ADALINE and learning convergence

The difference from the previous Perceptron is The point of using a linear activation function instead of a unit step function to update weights

fig1

2.5 Minimization of cost function by gradient descent

J(\mathbf{w}) = \frac{1}{2} \sum_{i} (y^{(i)} - \phi (z^{(i)}))^2

By advancing the learning so that it becomes the minimum value of the cost function, the learning can be converged.

ADALINE can find the weight that minimizes the cost function by using the gradient descent method by using a differentiable activation function and using the sum of squares of error as the cost function to make it a convex function.

fig2

A diagram of the principle of gradient descent

To update the weights using gradient descent, go one step backwards along the gradient $ \ nabla J (\ mathbf {w}) $ of the cost function $ J (\ mathbf {w}) $.

\mathbf{w} := \mathbf{w} + \Delta \mathbf{w}
\Delta \mathbf{w} = - \eta \nabla J(\mathbf{w})
\frac{\partial J}{\partial w_j} = - \sum_i (y^{(i)} - \phi (z^{(i)})) x_j^{(i)}

So

\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_i (y^{(i)} - \phi (z^{(i)})) x_j^{(i)}

2.5.1 Implement ADALINE in Python

ADALINE's learning rules are very similar to Perceptron. Implemented below

import numpy as np
class AdalineGD(object):
    """ADAptive LInear NEuron classifier
    
Parameters
    -----------
    eta : float
Learning rate(0.Greater than 0 1.Value less than or equal to 0)
    n_iter : int
Number of trainings in training data
    random_state : int
Random seed for weight initialization
attribute
    -----------
    w_ :One-dimensional array
Weight after conforming
    cost_ :list<- error_Is cost_Change to
Cost function of sum of squares of error at each epoch
    """
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
        
    def fit(self, X, y):
        """Fits to training data
        
Parameters
        ------------
        X : {Array-like data structure}, shape = [n_samples, n_features]
Training data
            n_samples is the number of samples, n_features is the number of features
        y :Array-like data structure, shape = [n_samples]
Objective variable
            
Return value
        ------------
        self : object
        """
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1])
        self.cost_ = []

        for _ in range(self.n_iter): #Repeat training data for the number of trainings
            net_input = self.net_input(X) #You don't have to loop for each sample as you would with a Perceptron implementation (I'm not sure why Perceptron looped for each sample).
            #Because the activation method is just an identity function
            #This code has no effect. Implemented as a mere concept of activation function.
            #In the case of logistic regression implementation, it seems that it is only necessary to change to a sigmoid function.
            output = self.activation(net_input)
            #Error y^(i) - φ(z^(i))Calculation
            errors = y - output
            #Weight w_1, ..., w_m update
            self.w_[1:] += self.eta * X.T.dot(errors)
            #Weight w_Update 0
            self.w_[0] += self.eta * errors.sum()
            #Calculation of cost function
            cost = (errors ** 2).sum() / 2.0
            #Cost storage
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        """Calculate total input"""
        return np.dot(X, self.w_[1:]) + self.w_[0] #You don't have to loop with xi

    def activation(self, X):
        """Calculate the output of the linear activation function"""
        return X
    
    def predict(self, X):
        """Returns the class label after one step"""
        return np.where(self.activation(self.net_input(X)) >= 0.0, 1, -1)

Try with well-classified data from Perceptron.

import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df2 = df.query("target != 1").copy() #Exclude label 1
df2["target"] -= 1 #Label 1-Align to 1
X = df2[['petal width (cm)', 'sepal width (cm)']].values
Y = df2['target'].values

import matplotlib.pyplot as plt
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, Y)
ax[0].plot(range(1, len(ada1.cost_)+1), np.log10(ada1.cost_), marker='o')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('log(Sum-squared-error)')
ax[0].set_title('Adaline - Learning rate 0.01')

ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, Y)
ax[1].plot(range(1, len(ada2.cost_)+1), ada2.cost_, marker='o')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Sum-squared-error')
ax[1].set_title('Adaline - Learning rate 0.0001')

output_5_1.png

← is the log of the sum of squared errors for the number of epochs when the learning rate is 0.01, and → is the sum of squared errors for the number of epochs when 0.0001

It turns out that if you do not select η well, it will not converge well.

fig3

If the learning rate is high, we aim for the minimum value as shown in the figure, but we jump over the minimum value and climb to the opposite bank.

2.5.2 Improve gradient descent through feature scaling

Scaling features for optimal performance of machine learning algorithms.

This time standardized $ \ mathbf {x}'\ _j = \ frac {\ mathbf {x} \ _ j-\ mu_j} {\ sigma_j} $

This gives the data the characteristics of a standard normal distribution and allows the learning to converge quickly.

#Standardization
X_std = np.copy(X)
X_std[:, 0] = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()
X_std[:, 1] = (X[:, 1] - X[:, 1].mean()) / X[:, 1].std()

from sklearn import preprocessing
ss = preprocessing.StandardScaler()
X_std2 = ss.fit_transform(X)

print(X_std[:3])
print(X_std2[:3]) #the same
[[-1.02461719  0.71907625]
 [-1.02461719 -0.4833924 ]
 [-1.02461719 -0.00240494]]
[[-1.02461719  0.71907625]
 [-1.02461719 -0.4833924 ]
 [-1.02461719 -0.00240494]]
#Boundary plot function implemented earlier

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):
    
    #Marker and color map preparation
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    
    #Plot of decision area
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    #Grid point generation
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    #Predict by converting each feature into a one-dimensional array
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    #Convert prediction results to original gridpoint data size
    Z = Z.reshape(xx1.shape)
    #Plot of grid point contours
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
    #Axis range setting
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    #Plot samples by class
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0],
                    y=X[y == cl, 1],
                    alpha=0.8,
                    c=colors[idx],
                    marker=markers[idx],
                    label=cl,
                    edgecolor='black')
ada = AdalineGD(n_iter=15, eta=0.01)
ada.fit(X_std, Y)

plot_decision_regions(X_std, Y, classifier=ada)
plt.title('Adaline - Gradient Descent')
plt.xlabel('petal width [standardized]')
plt.ylabel('sepal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Sum-squared-error')
plt.tight_layout()
plt.show()

output_9_0.png output_9_1.png

It can be seen that it has converged and the boundaries of the classification are tightly closed.

It can be confirmed that the standardization is effective because it did not converge well at the same learning rate.

ada1 = AdalineGD(n_iter=15, eta=0.01)
ada1.fit(X, Y)

plot_decision_regions(X, Y, classifier=ada1)
plt.title('Adaline - Gradient Descent')
plt.xlabel('petal width [standardized]')
plt.ylabel('sepal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

plt.plot(range(1, len(ada1.cost_) + 1), ada1.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Sum-squared-error')
plt.tight_layout()
plt.show()

output_11_0.png

output_11_1.png

↑ Even if I tried using unstandardized data, it didn't work.

After that, I will try with the data that could not be classified last time.

import numpy as np
import matplotlib.pyplot as plt
df3 = df.query("target != 0").copy() #Exclude label 0
y = df3.iloc[:, 4].values
y = np.where(y == 1, -1, 1) #label 1-Set 1 to 1 for others (label 2)

# plt.scatter(df3.iloc[:50, 1], df3.iloc[:50, 0], color='orange', marker='o', label='versicolor')
# plt.scatter(df3.iloc[50:, 1], df3.iloc[50:, 0], color='green', marker='o', label='virginica')
# plt.xlabel('sepal width [cm]')
# plt.ylabel('sepal length [cm]')
# plt.legend(loc='upper left')
# plt.show()

X2 = df3[['sepal width (cm)', 'sepal length (cm)']].values
from sklearn import preprocessing
sc = preprocessing.StandardScaler()
X2_std = sc.fit_transform(X2)

ada2 = AdalineGD(n_iter=15, eta=0.01)
ada2.fit(X2_std, y)

plot_decision_regions(X2_std, y, classifier=ada2)
plt.title('Adaline - Gradient Descent')
plt.xlabel('sepal width [standardized]')
plt.ylabel('sepal length [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

plt.plot(range(1, len(ada2.cost_) + 1), ada2.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Sum-squared-error')
plt.tight_layout()
plt.show()

output_13_0.png output_13_1.png

Of course, it is not linear and completely separable, but since the boundary is drawn at a reasonable position, it may have converged to the minimum cost.


2.6 Large-scale machine learning and stochastic gradient descent

If the number of data is large, the calculation cost of the method up to this point (batch gradient descent method) becomes high.

→ Use stochastic gradient descent

\Delta \mathbf{w} = \eta \sum_i (y^{(i)} - \phi (z^{(i)})) \mathbf{x}^{(i)}

Instead, update the weights step by step for each sample

\eta (y^{(i)} - \phi (z^{(i)})) \mathbf{x}^{(i)}

It is important to sort the samples in a random order. The advantages of this method are that it converges at high speed, that shallow minimum values can be easily escaped (there is a lot of noise on the error surface), and that it can be used for online learning.

Online learning… When new training data arrives, train on the spot and update the model. You can adapt to changes quickly.

Implement ADALINE using stochastic gradient descent below

import numpy as np
from numpy.random import seed

class AdalineSGD(object):
    """ADAptive LInear NEuron classifier
    
Parameters
    -----------
    eta : float
Learning rate(0.Greater than 0 1.Value less than or equal to 0)
    n_iter : int
Number of trainings in training data
    shuffle : bool (Default True)
If True, shuffle training data per epoch to avoid circulation
    random_state : int
Random seed for weight initialization
attribute
    -----------
    w_ :One-dimensional array
Weight after conforming
    cost_ :list
Error sum of squares cost function to average all training samples at each epoch
    """
    def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None):
        self.eta = eta
        self.n_iter = n_iter
        self.w_initialized = False #Weight initialization flag
        self.shuffle = shuffle
        self.random_state = random_state
        
    def fit(self, X, y):
        """Fits to training data
        
Parameters
        ------------
        X : {Array-like data structure}, shape = [n_samples, n_features]
Training data
            n_samples is the number of samples, n_features is the number of features
        y :Array-like data structure, shape = [n_samples]
Objective variable
            
Return value
        ------------
        self : object
        """
        #Generation of weight vector
        self._initialize_weights(X.shape[1])
        self.cost_ = []
        #Repeat for the number of trainings
        for i in range(self.n_iter):
            #Shuffle training data if specified
            if self.shuffle:
                X, y = self._shuffle(X, y)
            #Generate a list to store the cost of each sample
            cost = []
            #Calculation for each sample
            for xi, target in zip(X, y):
                #Weight update and cost calculation using feature xi and objective variable y
                cost.append(self._update_weights(xi, target))
            #Calculation of average cost of sample
            avg_cost = sum(cost)/len(y)
            #Average cost storage
            self.cost_.append(avg_cost)
        return self

    def partial_fit(self, X, y):
        """Fit training data without reinitializing weights"""
        #Execute initialization if it has not been initialized
        if not self.w_initialized:
            self._initialize_weights(X.shape[1])
        #When the number of elements of the objective variable y is 2 or more
        #Update weights with features xi and target of each sample
        if y.ravel().shape[0] > 1:
            for xi, target in zip(X, y):
                self._update_weights(xi, target)
        #When the number of elements of the objective variable y is 1,
        #Update weights with feature X and objective variable y for the entire sample
        else:
            self._update_weights(X, y)
        return self

    def _shuffle(self, X, y):
        """Shuffle training data"""
        r = self.rgen.permutation(len(y))
        return X[r], y[r] #Shuffle can be achieved by passing an array to the index

    def _initialize_weights(self, m):
        """Initialize weights to small random numbers"""
        self.rgen = np.random.RandomState(self.random_state)
        self.w_ = self.rgen.normal(loc=0.0, scale=0.01, size=1 + m)
        self.w_initialized = True
        
    def _update_weights(self, xi, target):
        """Update weights using ADALINE learning rules"""
        #Calculation of the output of the activation function
        output = self.activation(self.net_input(xi))
        #Error calculation
        error = target - output
        #Weight update
        self.w_[1:] += self.eta * xi.dot(error)
        self.w_[0] += self.eta * error
        #Cost calculation
        cost = 0.5 * error**2
        return cost

    def net_input(self, X):
        """Calculate total input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def activation(self, X):
        """Calculate the output of the linear activation function"""
        return X
    
    def predict(self, X):
        """Returns the class label after one step"""
        return np.where(self.activation(self.net_input(X)) >= 0.0, 1, -1)

ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1)
ada.fit(X_std, y)
plot_decision_regions(X_std, y, classifier=ada)
plt.title('Adaline - Stochastic Gradient Descent')
plt.xlabel('petal width [standardized]')
plt.ylabel('sepal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')
plt.show()

output_17_0.png output_17_1.png

The average cost is decreasing soon. The boundary is the same as the batch gradient descent method.

For online learning, you can update by calling partial_fit.

After learning Perceptron and ADALINE, I have deepened my understanding considerably, so I think it will be easier to understand other machine learning algorithms in the future, so I am glad to study.

Until now, I used other algorithms as black boxes, so it was good to be able to know the contents.

I have a notebook on Gist.

(Conceptual diagrams, etc. are quoted from Chapter 2 of "Theory and Practice by Python Machine Learning Programming Expert Data Scientists" as before.)

Recommended Posts

I tried to implement ADALINE in Python
I tried to implement PLSA in Python
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
I tried to implement PPO in Python
I tried to implement TOPIC MODEL in Python
I tried to implement selection sort in python
I tried to implement a pseudo pachislot in Python
I tried to implement GA (genetic algorithm) in Python
I tried to implement a one-dimensional cellular automaton in Python
I tried to implement the mail sending function in Python
I tried to implement blackjack of card game in Python
I tried to implement a misunderstood prisoner's dilemma game in Python
I tried to implement StarGAN (1)
I tried to implement Bayesian linear regression by Gibbs sampling in python
I tried to implement a card game of playing cards in Python
I tried to graph the packages installed in Python
I want to easily implement a timeout in python
I tried to implement Minesweeper on terminal with python
I tried to implement an artificial perceptron with python
I tried to summarize how to use pandas in python
I tried to implement Deep VQE
I tried to touch Python (installation)
I tried to implement adversarial validation
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried Line notification in Python
I tried to implement merge sort in Python with as few lines as possible
I tried to implement what seems to be a Windows snipping tool in Python
I tried to create API list.csv in Python from swagger.yaml
I tried "How to get a method decorated in Python"
I tried to make a stopwatch using tkinter in python
I tried to summarize Python exception handling
I tried to implement Autoencoder with TensorFlow
Python3 standard input I tried to summarize
I tried using Bayesian Optimization in Python
I wanted to solve ABC159 in Python
I tried to implement CVAE with PyTorch
[Python] I tried to calculate TF-IDF steadily
I tried to touch Python (basic syntax)
Implement Enigma in python
I tried Python> autopep8
Implement XENO in python
I tried to debug.
I tried to paste
Implement sum in Python
I tried Python> decorator
Implement Traceroute in Python 3
I tried to implement reading Dataset with PyTorch
I want to do Dunnett's test in Python
Try to implement Oni Maitsuji Miserable in python
How to implement Discord Slash Command in Python
I want to create a window in Python
I tried playing a typing game in Python
How to implement shared memory in Python (mmap.mmap)
I tried to integrate with Keras in TFv1.1
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
I wrote "Introduction to Effect Verification" in Python
[Memo] I tried a pivot table in Python
I tried to output LLVM IR with Python