The performance tuning work that prepares the print statement and outputs the execution time is painful, so I'm talking about stopping it.

Improvements are easy if the program can identify slow-running logic. If you use profiler, you can easily identify the cause, so I will show you how to use it. The first half is a method of identifying slow execution logic using line_profiler, and the second half is a speed-up technique in Python.

Identify which line is heavy with profiler

Use profiler in your local environment to identify which line is heavy. There are various profilers in Python, but I personally use line_profiler because it has the necessary and sufficient functions. What we are specifying here is that "which line has been executed N times, and the total execution time is M%."

Example of using line_profiler

I wrote a sample code that takes about 10 seconds to execute. Please read the process of time.sleep () as DB access. It is a program that returns the data that the user has 1000 cards and has 3 skills for each card with json.

■ The profiler will tell you everything, so you can skip the code

`sample1.py`


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import random
import time
import simplejson


class UserCardSkill(object):
    def __init__(self, user_id, card_id):
        self.id = random.randint(1, 1000),  #SkillID range is 1-Assuming 999
        self.user_id = user_id
        self.card_id = card_id

    @property
    def name(self):
        return "skill:{}".format(str(self.id))

    @classmethod
    def get_by_card(cls, user_id, card_id):
        time.sleep(0.01)
        return [cls(user_id,  card_id) for x in xrange(3)]  #Card has 3 skills

    def to_dict(self):
        return {
            "name": self.name,
            "skill_id": self.id,
            "card_id": self.card_id,
        }


class UserCard(object):
    def __init__(self, user_id):
        self.id = random.randint(1, 300)  #Card ID range is 1-Assuming 299
        self.user_id = user_id

    @property
    def name(self):
        return "CARD:{}".format(str(self.id))

    @property
    def skills(self):
        return UserCardSkill.get_by_card(self.user_id, self.id)

    @classmethod
    def get_by_user(cls, user_id):
        time.sleep(0.03)
        return [cls(user_id) for x in range(1000)]  #Assuming the user has 1000 Cards

    def to_dict(self):
        """
Convert card information to dict and return
        """
        return {
            "name": self.name,
            "skills": [skill.to_dict() for skill in self.skills],
        }


def main(user_id):
    """
Respond the card information possessed by user with json
    """
    cards = UserCard.get_by_user(user_id)
    result = {
        "cards": [card.to_dict() for card in cards]
    }
    json = simplejson.dumps(result)
    return json

user_id = "A0001"
main(user_id)

Identify heavy lines with line_profiler

Now let's install the profiler tool and identify the heavy spots.

`install`


pip install line_profiler

`sample1_profiler.py`


~~abridgement~~

#Profiler instantiation and function registration
from line_profiler import LineProfiler
profiler = LineProfiler()
profiler.add_module(UserCard)
profiler.add_module(UserCardSkill)
profiler.add_function(main)

#Execution of the registered main function
user_id = "A0001"
profiler.runcall(main, user_id)

#Result display
profiler.print_stats()

Execution result of line_profiler

`Execution result`


>>>python ./sample1_profiler.py 
Timer unit: 1e-06 s

Total time: 0.102145 s
File: ./sample1_profiler.py
Function: __init__ at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                               def __init__(self, user_id, card_id):
    10      3000        92247     30.7     90.3          self.id = random.randint(1, 1000),  #SkillID range is 1-Assuming 999
    11      3000         5806      1.9      5.7          self.user_id = user_id
    12      3000         4092      1.4      4.0          self.card_id = card_id

Total time: 0.085992 s
File: ./sample1_profiler.py
Function: to_dict at line 23

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    23                                               def to_dict(self):
    24      3000        10026      3.3     11.7          return {
    25      3000        66067     22.0     76.8              "name": self.name,
    26      3000         6091      2.0      7.1              "skill_id": self.id,
    27      3000         3808      1.3      4.4              "card_id": self.card_id,
    28                                                   }

Total time: 0.007384 s
File: ./sample1_profiler.py
Function: __init__ at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                               def __init__(self, user_id):
    33      1000         6719      6.7     91.0          self.id = random.randint(1, 300)  #Card ID range is 1-Assuming 299
    34      1000          665      0.7      9.0          self.user_id = user_id

Total time: 11.0361 s
File: ./sample1_profiler.py
Function: to_dict at line 49

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    49                                               def to_dict(self):
    50                                                   """
51 Convert card information to dict and return
    52                                                   """
    53      1000         1367      1.4      0.0          return {
    54      1000        10362     10.4      0.1              "name": self.name,
    55      4000     11024403   2756.1     99.9              "skills": [skill.to_dict() for skill in self.skills],
    56                                                   }

Total time: 11.1061 s
File: ./sample1_profiler.py
Function: main at line 59

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    59                                           def main(user_id):
    60                                               """
61 Respond with json the card information possessed by user
    62                                               """
    63         1        41318  41318.0      0.4      cards = UserCard.get_by_user(user_id)
    64         1            1      1.0      0.0      result = {
    65      1001     11049561  11038.5     99.5          "cards": [card.to_dict() for card in cards]
    66                                               }
    67         1        15258  15258.0      0.1      json = simplejson.dumps(result)
    68         1            2      2.0      0.0      return json

■ I was able to identify a heavy line with profiler. From the execution result of line_profiler, it was found that the processing of lines 65 and 55 was heavy. It seems that the user has 1000 cards, and as a result of inquiring UserCardSkill 1000 times for each card, it took more than 10 seconds to execute.

Acceleration technique

This is a specific program execution speed improvement technique. We will tune the program investigated by profiler by memoization by Cache and Hash search without changing the code structure as much as possible. I want to talk about Python, so I won't talk about SQL acceleration.

スクリーンショット 2015-11-25 21.53.09.png

Reduce the number of DB inquiries combined with memoization

Reduce the number of inquiries to UserCardSkill without changing the structure of the code as much as possible. It is a code that acquires UserCardSkill associated with user in a batch, saves it in memory, and returns the value from the data in memory from the second time onward.

`sample1_memoize.py`


class UserCardSkill(object):
    _USER_CACHE = {}
    @classmethod
    def get_by_card(cls, user_id, card_id):
        #Function to access DB each time before improvement
        time.sleep(0.01)
        return [cls(user_id,  card_id) for x in xrange(3)]

    @classmethod
    def get_by_card_from_cache(cls, user_id, card_id):
        #Function to access DB only for the first time after improvement
        if user_id not in cls._USER_CACHE:
            #If there is no data in the cache, get all skills related to User from DB
            cls._USER_CACHE[user_id] = cls.get_all_by_user(user_id)

        r = []
        for skill in cls._USER_CACHE[user_id]:
            if skill.card_id == card_id:
                r.append(skill)
        return r

    @classmethod
    def get_all_by_user(cls, user_id):
        #Acquire all skills related to User from DB at once
        return list(cls.objects.filter(user_id=user_id))

from timeit import timeit
@timeit  #Execution time is printed
def main(user_id):

`Execution result`


>>>sample1_memoize.py
func:'main' args:[(u'A0001',), {}] took: 0.6718 sec

It is more than 15 times faster than 11.1061 sec before the improvement to 0.6718 sec. The reason for the improvement in execution speed is that the number of inquiries to UserCardSkill has been reduced from 1000 to 1.

Rewrite from linear search to hash search

In the memoized code, the list cls._USER_CACHE [user_id] with 3 * 1000 elements is linearly searched (full scan) every time in order to linearize the skill for each card in the get_by_card_from_cache function. Since it is inefficient to search every line type, generate a dict with card_id as the key in advance and rewrite it to hash search. In this code, the complexity of the linear search is O (n) and the complexity of the hash search is O (1).

`python`


~~abridgement~~

class UserCardSkill(object):
    _USER_CACHE = {}
    @classmethod
    def get_by_card_from_cache(cls, user_id, card_id):
        if user_id not in cls._USER_CACHE:
            #If there is no data in the cache, get all skills related to User from DB
            users_skill = cls.get_all_by_user(user_id)

            # card_Convert to dict with id as KEY
            cardskill_dict = defaultdict(list)
            for skill in users_skill:
                cardskill_dict[skill.card_id].append(skill)

            #Save to cache
            cls._USER_CACHE[user_id] = cardskill_dict

        #Rewritten from linear search to hash search
        return cls._USER_CACHE[user_id].get(card_id)

    @classmethod
    def get_all_by_user(cls, user_id):
        #Acquire all skills related to User from DB
        return list(cls.objects.filter(user_id=user_id))

`Execution result`


>>>sample1_hash.py
func:'main' args:[(u'A0001',), {}] took: 0.3840 sec

Before the improvement, the list of 3000 elements was fully scanned for 1000 cards, so ʻif skill.card_id == card_id:` was called 3 million times. Since it disappeared by replacing it with hash search, even if the cost of generating hash is deducted, it leads to improvement of execution speed.

Use cached_property

お手軽なメモ化といえばcached_propertyではないでしょうか。インスタンスキャッシュにself.func.__name__(サンプル実装であれば"skills")をKEYにして戻り値を保存しています。2回目以降の問い合わせではキャッシュから値を返却することで実行速度が改善します。実装は数行なのでコード読んだ方が早いかもしれません。cached_property.py#L12

`cached_property.py`


from cached_property import cached_property

class Card(object):
    @cached_property
    def skills(self):
        return UserCardSkill.get_by_card(self.user_id, self.id)

@timeit
def main(user_id):
    cards = Card.get_by_user(user_id)
    for x in xrange(10):
        cards[0].skills

`Execution result`


# cached_Before property application
>>>python ./cached_property.py 
func:'main' args:[(u'A0001',), {}] took: 0.1443 sec

# cached_After applying property
>>> python ./sample1_cp.py 
func:'main' args:[(u'A0001',), {}] took: 0.0451 sec

Use thread local storage

It is a story on the assumption that the web server is running on wsgi and Apache.

Thread-local storage (TLS) is a means of allocating a place to store unique data for each thread in a given multithreaded process. If you are running a web server with wsgi and Apache and you set MaxRequestsPerChild to a value greater than or equal to 1 in config, the child process will terminate after MaxRequestsPerChild requests. If you write a program that uses thread local storage (TLS), you can save the cache for each child process. By storing common data such as master data among all users in TLS, a significant speedup can be expected.

I wrote a program that calculates prime numbers from integers in the range 0-500010. By recording the prime number calculation result in TLS, the second and subsequent prime number calculations are skipped.

`tls.py`


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import random
import threading
import time

threadLocal = threading.local()


def timeit(f):
    def timed(*args, **kw):
        # http://stackoverflow.com/questions/1622943/timeit-versus-timing-decorator
        ts = time.time()
        result = f(*args, **kw)
        te = time.time()

        print 'func:%r args:[%r, %r] took: %2.4f sec' % (f.__name__, args, kw, te-ts)
        return result
    return timed


@timeit
def worker():
    initialized = getattr(threadLocal, 'initialized', None)
    if initialized is None:
        print "init start"
        #TLS initialization
        threadLocal.initialized = True
        threadLocal.count = 0
        threadLocal.prime = {}
        return []
    else:
        print "loop:{}".format(threadLocal.count)
        threadLocal.count += 1
        return get_prime(random.randint(500000, 500010))


def get_prime(N):
    """
Return a list of prime numbers
    :param N: int
    :rtype : list of int
    """
    #If there is data in TLS, return it from the cache
    if N in threadLocal.prime:
        return threadLocal.prime[N]

    #Calculate prime numbers
    table = list(range(N))
    for i in range(2, int(N ** 0.5) + 1):
        if table[i]:
            for mult in range(i ** 2, N, i):
                table[mult] = False
    result = [p for p in table if p][1:]

    #Record results in TLS
    threadLocal.prime[N] = result
    return result

for x in xrange(100):
    worker()

`Execution result`


>>>python tls.py 
init start
func:'worker' args:[(), {}] took: 0.0000 sec
loop:0
func:'worker' args:[(), {}] took: 0.1715 sec
loop:1
func:'worker' args:[(), {}] took: 0.1862 sec
loop:2
func:'worker' args:[(), {}] took: 0.0000 sec
loop:3
func:'worker' args:[(), {}] took: 0.2403 sec
loop:4
func:'worker' args:[(), {}] took: 0.2669 sec
loop:5
func:'worker' args:[(), {}] took: 0.0001 sec
loop:6
func:'worker' args:[(), {}] took: 0.3130 sec
loop:7
func:'worker' args:[(), {}] took: 0.3456 sec
loop:8
func:'worker' args:[(), {}] took: 0.3224 sec
loop:9
func:'worker' args:[(), {}] took: 0.3208 sec
loop:10
func:'worker' args:[(), {}] took: 0.3196 sec
loop:11
func:'worker' args:[(), {}] took: 0.3282 sec
loop:12
func:'worker' args:[(), {}] took: 0.3257 sec
loop:13
func:'worker' args:[(), {}] took: 0.0000 sec
loop:14
func:'worker' args:[(), {}] took: 0.0000 sec
loop:15
func:'worker' args:[(), {}] took: 0.0000 sec
...

The cache stored in thread local storage (TLS) is saved for each Apache child process and remains until the child process terminates.

Cache has side effects

Proper use of the cache will improve the execution speed of the program. However, be aware that there are many cache-specific bugs called side effects. When I saw or did something in the past

■ Display bug that new value cannot be obtained even if updated This is a bug that occurs when you use the cache without being aware of the life cycle design.

Value acquisition >> 2. Value update >> 3. When writing a program that performs value acquisition in this order, the value is cached in 1 The cache does not disappear when updating in 2, and the value is updated in 3 when updating. It is a bug that the old value is acquired and displayed as it is without being able to acquire the specified value.

■ Bug that data disappears It's a deadly guy. 1. Get value >> 2. In a program that adds to the acquired value and updates the value, the result that the value of 1 is not updated because it refers to the cache, for example, 1234 + 100, 1234 + 200, 1234 At +50, there is a bug that the value disappears.

■ How to prevent side effects Anyone can safely use it by packaging it like a cached_property decorator and working with the cache from a well-tested package. You will be able to handle it without knowing the reasoning, but if possible, it is better to know the reasoning about the cache life cycle.

memo The release date of line_profiler is 2008

I think it's a loss not to use profiler for performance tuning

Identify which line is heavy with profiler

Example of using line_profiler

sample1.py

Identify heavy lines with line_profiler

install

sample1_profiler.py

Execution result of line_profiler

Execution result

Acceleration technique

Reduce the number of DB inquiries combined with memoization

sample1_memoize.py

Execution result

Rewrite from linear search to hash search

python

Execution result

Use cached_property

cached_property.py

Execution result

Use thread local storage

tls.py

Execution result

Cache has side effects

`sample1.py`

`install`

`sample1_profiler.py`

`Execution result`

`sample1_memoize.py`

`Execution result`

`python`

`Execution result`

`cached_property.py`

`Execution result`

`tls.py`

`Execution result`