--Talk about how to make Python faster --Find CPU-bound processing bottlenecks by profiling --Improve the bottleneck you find with Cython
The other day, I implemented a recommendation algorithm called Bayesian Personalized Ranking (BPR). I wrote the code with reference to the formula in this paper, but when I executed it, it was a little too slow to use, so I improved the processing speed. I worked on it. I will summarize what I tried at that time as a memorandum.
BPR gives the matrix factorization of the user x item matrix. Decomposes the user x item matrix $ X $ into the user x factor matrix $ U $ and the item x factor matrix $ V $.
X = U \cdot V^T
See BPR's Original Paper for how to solve this problem.
I implemented this technique as follows. $ X $ is defined in scipy.sparse.lil_matrix. The decomposed $ U $ and $ V $ are numpy.array.
The flow is to bootstrap sample the sample used for learning and then update $ U and V $.
For this code, I would like to improve buildModel (), which is the learning part of the model.
MFbpr.py
class MFbpr(Recommender):
    '''
Constructor and other processing
    '''
    
    def buildModel(self):
        loss_pre = sys.float_info.max
        nonzeros = self.trainMatrix.nnz
        hr_prev = 0.0
        sys.stderr.write("Run for BPR. \n")
        for itr in xrange(self.maxIter):
            start = time.time()
            
            # Each training epoch
            for s in xrange(nonzeros):
                # sample a user
                u = np.random.randint(self.userCount)
                itemList = self.trainMatrix.getrowview(u).rows[0]
                if len(itemList) == 0:
                    continue
                # sample a positive item
                i = random.choice(itemList)
                
                # One SGD step update
                self.update_ui(u, i)
                
    def update_ui(self, u, i):
        # sample a negative item(uniformly random)
        j = np.random.randint(self.itemCount)
        while self.trainMatrix[u, j] != 0:
            j = np.random.randint(self.itemCount)
            
        # BPR update rules
        y_pos = self.predict(u, i)  # target value of positive instance
        y_neg = self.predict(u, j)  # target value of negative instance
        mult = -self.partial_loss(y_pos - y_neg)
        
        for f in xrange(self.factors):
            grad_u = self.V[i, f] - self.V[j, f]
            self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
                
            grad = self.U[u, f]
            self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
            self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])
exec_bpr.py
from MFbpr import MFbpr
def load_data(ratingFile, testRatio=0.1):
    '''
The process of loading data
    '''
    return trainMatrix, testRatings
if __name__ == "__main__":
    # data
    trainMatrix, testRatings = load_data('yelp.rating')
    # settings
    topK = 100
    factors = 64
    maxIter = 10
    maxIterOnline = 1
    lr = 0.01
    adaptive = False
    init_mean = 0.0
    init_stdev = 0.1
    reg = 0.01
    showProgress = False
    showLoss = True
    bpr = MFbpr(trainMatrix, testRatings, topK, factors, maxIter, lr, adaptive, reg, init_mean, init_stdev, showProgress, showLoss)
    bpr.buildModel()
The entire amount of code is available on github.
I will try it for the time being.
--Number of users: 25,677 --Number of items: 25,815 --Number of ratings (number of samples with ratings): 627,775
--Number of factors: 64
It is ubuntu with 2G memory and 2 processors built on VirtualBox.
The first bracket is the time [s] it took for one iteration, and the last bracket is the time [s] it took to calculate loss.
> python exex_bpr.py
Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [128.6936481]	 [-]loss: 624484.357765 [8.16665792465]
Iter=1 [137.202406883]	 [-]loss: 616970.769244 [7.11149096489]
Iter=2 [131.134891987]	 [-]loss: 609361.307524 [7.12593793869]
Iter=3 [134.665620804]	 [-]loss: 601240.628869 [8.45840716362]
Iter=4 [134.722868919]	 [-]loss: 592053.155587 [7.6300880909]
Even though I'm calculating about 600,000 samples, I feel that 130 seconds per iteration is too long.
Let's start by identifying the parts that are taking a long time.
Use the Python profiler cProfile to measure processing speed.
python -m cProfile -s cumulative ***.py
If you execute, the time-consuming processing will be displayed in descending order.
> python -m cProfile -s cumulative exex_bpr.py
Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [244.87996006]	 [-]loss: 624394.802988 [53.4806399345]
Iter=1 [248.624686956]	 [-]loss: 616876.50976 [48.6073460579]
Iter=2 [253.822627068]	 [-]loss: 609269.820414 [52.5446169376]
Iter=3 [261.039247036]	 [-]loss: 601207.904104 [53.8221797943]
Iter=4 [260.285779953]	 [-]loss: 592212.148141 [54.9556028843]
         369374621 function calls (368041918 primitive calls) in 1808.492 seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.171    0.171 1808.493 1808.493 exex_bpr.py:3(<module>)
        1   34.307   34.307 1532.124 1532.124 MFbpr.py:40(buildModel)
  3067088  535.383    0.000  830.046    0.000 MFbpr.py:92(update_ui)
  6209078   48.829    0.000  433.564    0.000 lil.py:228(__getitem__)
  3292937   26.140    0.000  376.631    0.000 lil.py:191(getrowview)
  3292939   66.488    0.000  346.337    0.000 lil.py:85(__init__)
        5    0.000    0.000  263.411   52.682 MFbpr.py:117(_showLoss)
        5   22.098    4.420  263.410   52.682 MFbpr.py:128(loss)
        1    9.443    9.443  242.550  242.550 exex_bpr.py:9(load_data)
From the bottom of the line ʻOrdered by: cumulative time, the function names and the time required for processing are listed. If you look at this, you can see that the function ʻupdate_ui is called 3,067,088 times and takes a total of 535.383 seconds.
(Well, it's only natural that we call only ʻupdate_ui in buildModel` ...)
It is the overhead of cProfile that the execution time of one iteration is longer than before.
You can use line_profiler to measure the function of interest line by line. You can install it with pip.
> pip install line_profiler
In order to profile with line_profiler, you need to add a @ profile decorator to the function you are looking at.
MFbpr.py
    @profile
    def update_ui(self, u, i):
        # sample a negative item(uniformly random)
        j = np.random.randint(self.itemCount)
        while self.trainMatrix[u, j] != 0:
            j = np.random.randint(self.itemCount)
        # BPR update rules
        y_pos = self.predict(u, i)  # target value of positive instance
        y_neg = self.predict(u, j)  # target value of negative instance
        mult = -self.partial_loss(y_pos - y_neg)
        for f in xrange(self.factors):
            grad_u = self.V[i, f] - self.V[j, f]
            self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
            grad = self.U[u, f]
            self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
            self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])
All you have to do is run it with the kernprof command.
> kernprof -l -v exex_bpr.py
Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [953.993126154]	 [-]loss: 624406.940531 [7.50253987312]
Iter=1 [962.82383585]	 [-]loss: 616855.373858 [7.96375918388]
Wrote profile results to exex_bpr.py.lprof
Timer unit: 1e-06 s
Total time: 1082.55 s
File: MFbpr.py
Function: update_ui at line 92
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    92                                               @profile
    93                                               def update_ui(self, u, i):
    94                                                   # sample a negative item(uniformly random)
    95   1226961      8241361      6.7      0.8          j = np.random.randint(self.itemCount)
    96   1228124     39557350     32.2      3.7          while self.trainMatrix[u, j] != 0:
    97      1163         6123      5.3      0.0              j = np.random.randint(self.itemCount)
    98                                                       
    99                                                   # BPR update rules
   100   1226961      9495381      7.7      0.9          y_pos = self.predict(u, i)  # target value of positive instance
   101   1226961      4331378      3.5      0.4          y_neg = self.predict(u, j)  # target value of negative instance
   102   1226961     10002355      8.2      0.9          mult = -self.partial_loss(y_pos - y_neg)
   103                                                   
   104  79752465    126523856      1.6     11.7          for f in xrange(self.factors):
   105  78525504    161882071      2.1     15.0              grad_u = self.V[i, f] - self.V[j, f]
   106  78525504    191293505      2.4     17.7              self.U[u, f] -= self.lr * (mult * grad_u + self.reg * self.U[u, f])
   107                                                           
   108  78525504    137871145      1.8     12.7              grad = self.U[u, f]
   109  78525504    194033291      2.5     17.9              self.V[i, f] -= self.lr * (mult * grad + self.reg * self.V[i, f])
   110  78525504    199315454      2.5     18.4              self.V[j, f] -= self.lr * (-mult * grad + self.reg * self.V[j, f])
Looking at the % Time column, I found that ** 90% or more of the processing time of ʻupdate_ui` is spent under the for statement **.
One iteration is less than 1,000 seconds, which means that the overhead is heavy.
Rewrite the for loop with Cython. One of the reasons Python is slow is that it is dynamically typed. Since the type is checked every time a variable is referenced, a program with many variable references cannot ignore the time required for this operation.
If you write using Cython, you can define the type of the variable like in C language. Since the type is not confirmed one by one, a considerable speedup can be expected.
Variables are declared with cdef. Define the types of all variables used in the calculation.
fastupdfn.pyx
import numpy as np
cimport numpy as np
cimport cython
DOUBLE = np.float64
ctypedef np.float64_t DOUBLE_t
cpdef c_upd(np.ndarray[DOUBLE_t, ndim=2] U, 
          np.ndarray[DOUBLE_t, ndim=2] V,
          double mult,
          double lr,
          double reg,
          unsigned int u,
          unsigned int i,
          unsigned int j,
          unsigned int factors):
    cdef unsigned int f
    cdef double grad_u, grad
    for f in xrange(factors):
        grad_u = V[i, f] - V[j, f]
        U[u, f] -= lr * (mult * grad_u + reg * U[u, f])
        
        grad = U[u, f]
        V[i, f] -= lr * (mult * grad + reg * V[i, f])
        V[j, f] -= lr * (-mult * grad + reg * V[j, f])
        
    return U, V
The caller is rewritten as follows.
MFbpr.py
import fastupd    #add to
class MFbpr(Recommender):
    """
Omission
    """
    def update_ui(self, u, i):
        # sample a negative item(uniformly random)
        j = np.random.randint(self.itemCount)
        while self.trainMatrix[u, j] != 0:
            j = np.random.randint(self.itemCount)
            
        # BPR update rules
        y_pos = self.predict(u, i)  # target value of positive instance
        y_neg = self.predict(u, j)  # target value of negative instance
        mult = -self.partial_loss(y_pos - y_neg)
       
        #Call the Cython implementation
        self.U, self.V = fastupd.c_upd(self.U, self.V, mult, self.lr, self.reg, u, i, j, self.factors) 
You also need to have a setup.py to compile your Cython implementation.
setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
    cmdclass={'build_ext': build_ext},
    ext_modules=[Extension("fastupd", ["fastupdfn.pyx"])]
)
Compile when you're ready. Compile with the following command.
> python setup.py build_ext --inplace
I will do it.
> python exec_bpr.py
Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [38.7308020592]	 [-]loss: 624282.459815 [8.64863014221]
Iter=1 [36.7822458744]	 [-]loss: 616740.942017 [8.62252616882]
Iter=2 [35.8996829987]	 [-]loss: 609119.520253 [7.9108710289]
Iter=3 [35.1236720085]	 [-]loss: 600992.740207 [8.14179396629]
Iter=4 [34.9632918835]	 [-]loss: 591877.909123 [8.81325411797]
It takes less than 40 seconds to execute one iteration. It is ** 3-4 times faster than the first **.
The result of line_profiler is also posted.
> kernprof -l -v exex_bpr.py
Data	yelp.rating
#Users	25677, #newUser: 588
#Items	25815
#Ratings	 627775.0 (train), 73167(test), 10056(#newTestRatings)
Run for BPR. 
Iter=0 [66.7851500511]	 [-]loss: 624400.680806 [7.92221903801]
Iter=1 [62.5339269638]	 [-]loss: 616866.244211 [7.85720801353]
Iter=2 [65.6408250332]	 [-]loss: 609235.226048 [8.48338794708]
Iter=3 [66.0613160133]	 [-]loss: 601140.035318 [8.52119803429]
Iter=4 [66.5882000923]	 [-]loss: 592026.927719 [8.32318806648]
Wrote profile results to exex_bpr.py.lprof
Timer unit: 1e-06 s
Total time: 164.139 s
File: MFbpr.py
Function: update_ui at line 93
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    93                                               @profile
    94                                               def update_ui(self, u, i):
    95                                                   # sample a negative item(uniformly random)
    96   3066856     17642519      5.8     10.7          j = np.random.randint(self.itemCount)
    97   3069840     79530375     25.9     48.5          while self.trainMatrix[u, j] != 0:
    98      2984        15027      5.0      0.0              j = np.random.randint(self.itemCount)
    99                                                       
   100                                                   # BPR update rules
   101   3066856     17651846      5.8     10.8          y_pos = self.predict(u, i)  # target value of positive instance
   102   3066856     10985781      3.6      6.7          y_neg = self.predict(u, j)  # target value of negative instance
   103   3066856     18993291      6.2     11.6          mult = -self.partial_loss(y_pos - y_neg)
   104                                                  
   105   3066856     19320147      6.3     11.8          self.U, self.V = fastupd.c_upd(self.U, self.V, mult, self.lr, self.reg, u, i, j, self.factors) 
The portion of the matrix update that originally spent more than 90% has been reduced to 11.8%. You can see the effect of Cython.
Profiling with cPrifile and line_profiler found a processing bottleneck and improved it with Cython.
I just rewrote one place, but it's a little less than four times faster.
By the way, I put the code with Cython applied in the w_cython branch of github. It may be merged into master soon.
The for statement part updates the matrix elements independently, so it can be executed completely in parallel. Cython has a function called prange () that executes the processing of the for statement in multiple threads, so it may be possible to improve it a little by applying this.
Recommended Posts