Revisited scikit-learn

Make a note of what you have learned about the useful API of scikit-learn. I wrote a memo in $ HOME / Desktop / memo.md, but I accidentally rmed it, so I tried using Qiita. I will add new learning as needed.

(I would appreciate it if you could point out any mistakes. Thank you.)

Data preprocessing

God dwells in the library's Preprocessing module, and we need to thank the committers when importing.

Dimensionality reduction of features

Select k features. The first argument of SelectKBest is a function, and f_classif etc. can be used in addition to chi2. There are other RFECVs for dimensionality reduction of features, but I'm not sure how they differ, so I'd like to investigate in the future.

from sklearn.feature_selection.univariate_selection import SelectKBest, chi2

fselect = SelectKBest(chi2 , k=2000)

Reference: [How do I properly use SelectKBest, GridSearchCV, and cross-validation in the sklearn package together?](Https://www.quora.com/How-do-I-properly-use-SelectKBest-GridSearchCV-and- cross-validation-in-the-sklearn-package-together)

Deal with data that has missing values

A modern weapon called Pandas has made data preprocessing much easier. It goes well with Jupyter notebook and removes data containing NaN in one line. NaN and, in some cases, 0, etc. are called missing values.

Missing values can also be removed with scikit-learn. An excerpt of a sample of a reference article (official document). When imp.fit () is performed, the mean value and median value are calculated, and when imp.transform () is performed, the missing value value is replaced with the mean value and median value.

import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X)) 
# [[ 4.          2.        ]
#  [ 6.          3.666...]
#  [ 7.          6.        ]]

Reference: scikit-learn official documentation

Normalize the data

I've always written the normalization myself (although I only call a few numpy methods), but scikit-learn does it for me. L1 norm and L2 norm can be used. Excerpt from the sample code of the document.

X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')

print(X_normalized)
# array([[ 0.40..., -0.40...,  0.81...],
#       [ 1.  ...,  0.  ...,  0.  ...],
#       [ 0.  ...,  0.70..., -0.70...]])

Reference: scikit-learn official documentation

Scaling data

I feel that Japanese is strange. What do you call scaling in Japanese? Process the data so that the mean is 0 and the variance is 1.

Sample (ry

As stated in the documentation, it can be combined with Pipeline.

from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X)

print(X_scaled)
# array([[ 0.  ..., -1.22...,  1.33...],
#       [ 1.22...,  0.  ..., -0.26...],
#       [-1.22...,  1.22..., -1.06...]])

There are various types such as StandardScaler and MinMaxScaler. Please read the following document (fit, transform API is growing).

Reference: scikit-learn official document

Model validation

K Fold while making the label ratio feel good

When KFold calls split (), it is sufficient to input an array of several dimensions of samples, but StratifiedKFold needs to input an array of labels. Why is that? This is because it makes the ratio of labels of teacher data feel good.

from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)

print(skf)  

for train_index, test_index in skf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Reference: scikit-learn official documentation

Pre-processing-> seamless learning

When dealing with data that requires preprocessing, it seems that the code will be clean if preprocessing and feature extraction can be done in model.fit (). A system called Pipeline makes this possible. It remains the name.

Before

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

# build the feature matrices
ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')
X_train = ngram_counter.fit_transform(data_train)
X_test  = ngram_counter.transform(data_test)

# train the classifier
classifier = LinearSVC()
model = classifier.fit(X_train, y_train)

# test the classifier
y_test = model.predict(X_test)

After

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

# build the pipeline
ppl = Pipeline([
              ('ngram', CountVectorizer(ngram_range=(1, 4), analyzer='char')),
              ('clf',   LinearSVC())
      ])

# train the classifier
model = ppl.fit(data_train)

# test the classifier
y_test = model.predict(data_test)

The code is taken from the blog post below, but there is also a Feature Union at the bottom of the article.

Reference: Using Pipelines and FeatureUnions in scikit-learn

Evaluation index

Logistic loss

I want to ride the scikit-learn ecosystem even with my own learning device

If you need to use a model not provided by scikit-learn, you have to implement it yourself. After implementing the model, it is necessary to perform cross validation and grid search for evaluation. If you can get on the rail of model_selection of scikit-learn, you can skip this implementation. This can be achieved by inheriting BaseEstimator when implementing the model.

The code shown in the reference article below is quoted.

from sklearn.base import BaseEstimator

class MyEstimator(BaseEstimator):
    def __init__(self, param1, param2):
        self.param1 = param1
        self.param2 = param2

    def fit(self, x, y):
        return self 

    def predict(self, x):
        return [1.0]*len(x) 

    def score(self, x, y):
        return 1

    def get_params(self, deep=True):
        return {'param1': self.param1, 'param2': self.param2}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self,parameter, value)
        return self

By the way, if you inherit ClassifierMixin and RegressorMixin, you can use model.score implemented on scikit-learn side. I want to get on the rails positively.

Reference: Implement a minimum self-made estimator (Estimator) with scikit-learn