About this article

--Talk about how to make Python faster --Find CPU-bound processing bottlenecks by profiling --Improve the bottleneck you find with Cython

Introduction

The other day, I implemented a recommendation algorithm called Bayesian Personalized Ranking (BPR). I wrote the code with reference to the formula in this paper, but when I executed it, it was a little too slow to use, so I improved the processing speed. I worked on it. I will summarize what I tried at that time as a memorandum.

Techniques and code used in this article

BPR gives the matrix factorization of the user x item matrix. Decomposes the user x item matrix $ X $ into the user x factor matrix $ U $ and the item x factor matrix $ V $.

X = U \cdot V^T

See BPR's Original Paper for how to solve this problem.

I implemented this technique as follows. $ X $ is defined in scipy.sparse.lil_matrix. The decomposed $ U $ and $ V $ are numpy.array. The flow is to bootstrap sample the sample used for learning and then update $ U and V $.

For this code, I would like to improve buildModel (), which is the learning part of the model.

`MFbpr.py`


class MFbpr(Recommender):
    '''
Constructor and other processing
    '''
    
    def buildModel(self):
        loss_pre = sys.float_info.max
        nonzeros = self.trainMatrix.nnz
        hr_prev = 0.0
        sys.stderr.write("Run for BPR. \n")
        for itr in xrange(self.maxIter):
            start = time.time()
            
            # Each training epoch
            for s in xrange(nonzeros):
                # sample a user
                u = np.random.randint(self.userCount)
                itemList = self.trainMatrix.getrowview(u).rows[0]
                if len(itemList) == 0:
                    continue
                # sample a positive item
                i = random.choice(itemList)
                
                # One SGD step update
                self.update_ui(u, i)
                
    def update_ui(self, u, i):
        # sample a negative item(uniformly random)
        j = np.random.randint(self.itemCount)
        while self.trainMatrix[u, j] != 0:
            j = np.random.randint(self.itemCount)
            
        # BPR update rules
        y_pos = self.predict(u, i)  # target value of positive instance
        y_neg = self.predict(u, j)  # target value of negative instance
        mult = -self.partial_loss(y_pos - y_neg)
        
        for f in xrange(self.factors):
            grad_u = self.V[i, f] - self.V[j, f]
            self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
                
            grad = self.U[u, f]
            self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
            self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])

`exec_bpr.py`


from MFbpr import MFbpr

def load_data(ratingFile, testRatio=0.1):
    '''
The process of loading data
    '''
    return trainMatrix, testRatings

if __name__ == "__main__":
    # data
    trainMatrix, testRatings = load_data('yelp.rating')

    # settings
    topK = 100
    factors = 64
    maxIter = 10
    maxIterOnline = 1
    lr = 0.01
    adaptive = False
    init_mean = 0.0
    init_stdev = 0.1
    reg = 0.01
    showProgress = False
    showLoss = True

    bpr = MFbpr(trainMatrix, testRatings, topK, factors, maxIter, lr, adaptive, reg, init_mean, init_stdev, showProgress, showLoss)
    bpr.buildModel()

The entire amount of code is available on github.

Try to run

I will try it for the time being.

Setting

data

--Number of users: 25,677 --Number of items: 25,815 --Number of ratings (number of samples with ratings): 627,775

Hyperparameters

--Number of factors: 64

Execution environment

It is ubuntu with 2G memory and 2 processors built on VirtualBox.

Execution result

The first bracket is the time [s] it took for one iteration, and the last bracket is the time [s] it took to calculate loss.

> python exex_bpr.py

Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [128.6936481]	 [-]loss: 624484.357765 [8.16665792465]
Iter=1 [137.202406883]	 [-]loss: 616970.769244 [7.11149096489]
Iter=2 [131.134891987]	 [-]loss: 609361.307524 [7.12593793869]
Iter=3 [134.665620804]	 [-]loss: 601240.628869 [8.45840716362]
Iter=4 [134.722868919]	 [-]loss: 592053.155587 [7.6300880909]

Even though I'm calculating about 600,000 samples, I feel that 130 seconds per iteration is too long.

Profiling

Overall profiling

Let's start by identifying the parts that are taking a long time. Use the Python profiler cProfile to measure processing speed. python -m cProfile -s cumulative ***.py If you execute, the time-consuming processing will be displayed in descending order.

> python -m cProfile -s cumulative exex_bpr.py

Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [244.87996006]	 [-]loss: 624394.802988 [53.4806399345]
Iter=1 [248.624686956]	 [-]loss: 616876.50976 [48.6073460579]
Iter=2 [253.822627068]	 [-]loss: 609269.820414 [52.5446169376]
Iter=3 [261.039247036]	 [-]loss: 601207.904104 [53.8221797943]
Iter=4 [260.285779953]	 [-]loss: 592212.148141 [54.9556028843]
         369374621 function calls (368041918 primitive calls) in 1808.492 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.171    0.171 1808.493 1808.493 exex_bpr.py:3(<module>)
        1   34.307   34.307 1532.124 1532.124 MFbpr.py:40(buildModel)
  3067088  535.383    0.000  830.046    0.000 MFbpr.py:92(update_ui)
  6209078   48.829    0.000  433.564    0.000 lil.py:228(__getitem__)
  3292937   26.140    0.000  376.631    0.000 lil.py:191(getrowview)
  3292939   66.488    0.000  346.337    0.000 lil.py:85(__init__)
        5    0.000    0.000  263.411   52.682 MFbpr.py:117(_showLoss)
        5   22.098    4.420  263.410   52.682 MFbpr.py:128(loss)
        1    9.443    9.443  242.550  242.550 exex_bpr.py:9(load_data)

From the bottom of the line ʻOrdered by: cumulative time, the function names and the time required for processing are listed. If you look at this, you can see that the function ʻupdate_ui is called 3,067,088 times and takes a total of 535.383 seconds. (Well, it's only natural that we call only ʻupdate_ui in buildModel` ...)

It is the overhead of cProfile that the execution time of one iteration is longer than before.

Line-by-line profiling

You can use line_profiler to measure the function of interest line by line. You can install it with pip.

> pip install line_profiler

In order to profile with line_profiler, you need to add a @ profile decorator to the function you are looking at.

`MFbpr.py`


    @profile
    def update_ui(self, u, i):
        # sample a negative item(uniformly random)
        j = np.random.randint(self.itemCount)
        while self.trainMatrix[u, j] != 0:
            j = np.random.randint(self.itemCount)

        # BPR update rules
        y_pos = self.predict(u, i)  # target value of positive instance
        y_neg = self.predict(u, j)  # target value of negative instance
        mult = -self.partial_loss(y_pos - y_neg)

        for f in xrange(self.factors):
            grad_u = self.V[i, f] - self.V[j, f]
            self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])

            grad = self.U[u, f]
            self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
            self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])

All you have to do is run it with the kernprof command.

> kernprof -l -v exex_bpr.py

Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [953.993126154]	 [-]loss: 624406.940531 [7.50253987312]
Iter=1 [962.82383585]	 [-]loss: 616855.373858 [7.96375918388]
Wrote profile results to exex_bpr.py.lprof
Timer unit: 1e-06 s

Total time: 1082.55 s
File: MFbpr.py
Function: update_ui at line 92

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    92                                               @profile
    93                                               def update_ui(self, u, i):
    94                                                   # sample a negative item(uniformly random)
    95   1226961      8241361      6.7      0.8          j = np.random.randint(self.itemCount)
    96   1228124     39557350     32.2      3.7          while self.trainMatrix[u, j] != 0:
    97      1163         6123      5.3      0.0              j = np.random.randint(self.itemCount)
    98                                                       
    99                                                   # BPR update rules
   100   1226961      9495381      7.7      0.9          y_pos = self.predict(u, i)  # target value of positive instance
   101   1226961      4331378      3.5      0.4          y_neg = self.predict(u, j)  # target value of negative instance
   102   1226961     10002355      8.2      0.9          mult = -self.partial_loss(y_pos - y_neg)
   103                                                   
   104  79752465    126523856      1.6     11.7          for f in xrange(self.factors):
   105  78525504    161882071      2.1     15.0              grad_u = self.V[i, f] - self.V[j, f]
   106  78525504    191293505      2.4     17.7              self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
   107                                                           
   108  78525504    137871145      1.8     12.7              grad = self.U[u, f]
   109  78525504    194033291      2.5     17.9              self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
   110  78525504    199315454      2.5     18.4              self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])

Looking at the % Time column, I found that ** 90% or more of the processing time of ʻupdate_ui` is spent under the for statement **.

One iteration is less than 1,000 seconds, which means that the overhead is heavy.

Accelerate with Cython

Implementation

Rewrite the for loop with Cython. One of the reasons Python is slow is that it is dynamically typed. Since the type is checked every time a variable is referenced, a program with many variable references cannot ignore the time required for this operation.

If you write using Cython, you can define the type of the variable like in C language. Since the type is not confirmed one by one, a considerable speedup can be expected.

Variables are declared with cdef. Define the types of all variables used in the calculation.

`fastupdfn.pyx`


import numpy as np
cimport numpy as np
cimport cython

DOUBLE = np.float64
ctypedef np.float64_t DOUBLE_t

cpdef c_upd(np.ndarray[DOUBLE_t, ndim=2] U, 
          np.ndarray[DOUBLE_t, ndim=2] V,
          double mult,
          double lr,
          double reg,
          unsigned int u,
          unsigned int i,
          unsigned int j,
          unsigned int factors):
    cdef unsigned int f
    cdef double grad_u, grad
    for f in xrange(factors):
        grad_u = V[i, f] - V[j, f]
        U[u, f] -= lr * (mult * grad_u + reg * U[u, f])
        
        grad = U[u, f]
        V[i, f] -= lr * (mult * grad + reg * V[i, f])
        V[j, f] -= lr * (-mult * grad + reg * V[j, f])
        
    return U, V

The caller is rewritten as follows.

`MFbpr.py`


import fastupd    #add to

class MFbpr(Recommender):
    """
Omission
    """
    def update_ui(self, u, i):
        # sample a negative item(uniformly random)
        j = np.random.randint(self.itemCount)
        while self.trainMatrix[u, j] != 0:
            j = np.random.randint(self.itemCount)
            
        # BPR update rules
        y_pos = self.predict(u, i)  # target value of positive instance
        y_neg = self.predict(u, j)  # target value of negative instance
        mult = -self.partial_loss(y_pos - y_neg)
       
        #Call the Cython implementation
        self.U, self.V = fastupd.c_upd(self.U, self.V, mult, self.lr, self.reg, u, i, j, self.factors)

You also need to have a setup.py to compile your Cython implementation.

`setup.py`


from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
    cmdclass={'build_ext': build_ext},
    ext_modules=[Extension("fastupd", ["fastupdfn.pyx"])]
)

Compile when you're ready. Compile with the following command.

> python setup.py build_ext --inplace

Run

I will do it.

> python exec_bpr.py

Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [38.7308020592]	 [-]loss: 624282.459815 [8.64863014221]
Iter=1 [36.7822458744]	 [-]loss: 616740.942017 [8.62252616882]
Iter=2 [35.8996829987]	 [-]loss: 609119.520253 [7.9108710289]
Iter=3 [35.1236720085]	 [-]loss: 600992.740207 [8.14179396629]
Iter=4 [34.9632918835]	 [-]loss: 591877.909123 [8.81325411797]

It takes less than 40 seconds to execute one iteration. It is ** 3-4 times faster than the first **.

The result of line_profiler is also posted.

> kernprof -l -v exex_bpr.py

Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [66.7851500511]	 [-]loss: 624400.680806 [7.92221903801]
Iter=1 [62.5339269638]	 [-]loss: 616866.244211 [7.85720801353]
Iter=2 [65.6408250332]	 [-]loss: 609235.226048 [8.48338794708]
Iter=3 [66.0613160133]	 [-]loss: 601140.035318 [8.52119803429]
Iter=4 [66.5882000923]	 [-]loss: 592026.927719 [8.32318806648]
Wrote profile results to exex_bpr.py.lprof
Timer unit: 1e-06 s

Total time: 164.139 s
File: MFbpr.py
Function: update_ui at line 93

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    93                                               @profile
    94                                               def update_ui(self, u, i):
    95                                                   # sample a negative item(uniformly random)
    96   3066856     17642519      5.8     10.7          j = np.random.randint(self.itemCount)
    97   3069840     79530375     25.9     48.5          while self.trainMatrix[u, j] != 0:
    98      2984        15027      5.0      0.0              j = np.random.randint(self.itemCount)
    99                                                       
   100                                                   # BPR update rules
   101   3066856     17651846      5.8     10.8          y_pos = self.predict(u, i)  # target value of positive instance
   102   3066856     10985781      3.6      6.7          y_neg = self.predict(u, j)  # target value of negative instance
   103   3066856     18993291      6.2     11.6          mult = -self.partial_loss(y_pos - y_neg)
   104                                                  
   105   3066856     19320147      6.3     11.8          self.U, self.V = fastupd.c_upd(self.U, self.V, mult, self.lr, self.reg, u, i, j, self.factors)

The portion of the matrix update that originally spent more than 90% has been reduced to 11.8%. You can see the effect of Cython.

Summary

Profiling with cPrifile and line_profiler found a processing bottleneck and improved it with Cython. I just rewrote one place, but it's a little less than four times faster.

By the way, I put the code with Cython applied in the w_cython branch of github. It may be merged into master soon.

bonus?

The for statement part updates the matrix elements independently, so it can be executed completely in parallel. Cython has a function called prange () that executes the processing of the for statement in multiple threads, so it may be possible to improve it a little by applying this.

Talk about improving machine learning algorithm bottlenecks with Cython

About this article

Introduction

Techniques and code used in this article

MFbpr.py

exec_bpr.py

Try to run

Setting

data

Hyperparameters

Execution environment

Execution result

Profiling

Overall profiling

Line-by-line profiling

MFbpr.py

Accelerate with Cython

Implementation

fastupdfn.pyx

MFbpr.py

setup.py

Run

Summary

bonus?

`MFbpr.py`

`exec_bpr.py`

`MFbpr.py`

`fastupdfn.pyx`

`MFbpr.py`

`setup.py`