Introduction

Horse racing has a play called POG. This is a fictitious owner who competes for the success of his horse. In general, the prize money won during the period from debut to derby is often used as an index of activity.

I have been doing POG for about 5 years now. Until now, I have been able to select somehow good horses by relying on Aomoto, which will be released around April every year.

However, the results are brilliant, and by the end of this year, only 2 out of 10 have won.

In order to overcome this situation where the swelling does not rise, I decided to put my heart into data analysis. However, the author does not have the skills to freely manipulate the machine learning that is popular these days. Therefore, the immediate goal here is to find a causal relationship between the basic horse information (stables, producers, pedigree) and the prize money earned during the POG period.

What is important here is the "prize money won during the POG period". As far as I know, no website publishes this information. Maybe some institutions offer it for a fee, but I don't want to do it in situations where it's unclear if it's worth the cost.

Therefore, here, we decided to take the means of acquiring basic horse information and run history data from netkeiba and calculating the prize money during the POG period using the run history data.

The language used is Python. This is just something I'm used to.

Source code

Collect data on horses born during the four years from 2010 to 2013. Here, data for four years is acquired at the same time by parallel processing of four cores. ~~ It depends on the machine specs, but my MBA finished collecting the data in about an hour. ~~

Edited on 2015/12/30 Inserted a 10-second time wait function in the server access process. In this case, the frequency of browsing the site by human power can be suppressed, so it will not be excessive access.

`MakeUmaDB_151229.py`


#!/usr/bin/env python
# encoding: utf-8

import urllib2 as ul
import pandas as pd
import os
import time
import datetime
from lxml import html
import multiprocessing as mp

__PROC__ = 4

def MakeDir(dname):
    if not os.path.exists(dname):
        os.mkdir(dname)
        print 'Make directory:%s' % dname
    else:
        print '%s is exist' % dname

    return 0

def subMakeHorseDB(year):
    # Set Directory
    o_dname = 'horse_db'
    MakeDir(o_dname)

    # horse_prof
    prof_keys = [
        u'Horse name',
        u'Birthday',
        u'Trainer',
        u'Horse owner',
        u'Producer',
        u'Origin',
        u'Auction transaction price',
        u'father',
        u'mother',
        u'Mother father',
        u'POG period prize_half period',
        u'POG period prize_Year-round'
    ]

    # get Uma data from web site
    base_url = 'http://db.netkeiba.com/horse/'
    idx_from = 100000
    idx_to = 111000

    masta_d = {}
    for idx in range(idx_from, idx_to + 1):
        try:
            # get html from web
            time.sleep(10)
            s_idx = str(year)+str(idx).zfill(6)
            url = base_url + s_idx
            src_html = ul.urlopen(url).read()# get html from url
            root = html.fromstring(src_html)

            # show progress
            print 'idx: %s, (%d, %d/%d)' % (s_idx, year, idx, idx_to)

            # not found db
            if root.xpath('//title')[0].text.startswith(u'｜'):
                #print 'DB not found'
                continue

            # html parse
            masta_d[s_idx] = {}

            for prof in prof_keys:
                if prof == u'Horse name':
                    horse_name = root.xpath('//div[@class="horse_title"]')[0].text_content().split('\n')[1]
                    masta_d[s_idx][prof] = horse_name

                elif prof == u'father':
                    masta_d[s_idx][prof] = root.xpath('//td[@rowspan="2"][@class="b_ml"]')[0].text_content().split('\n')[1]

                elif prof == u'mother':
                    masta_d[s_idx][prof] = root.xpath('//td[@rowspan="2"][@class="b_fml"]')[0].text_content().split('\n')[1]
                elif prof == u'Mother father':
                    masta_d[s_idx][prof] = root.xpath('//td[@class="b_ml"]')[2].text_content().split('\n')[1]
                elif prof == u'POG period prize_half period' or prof == u'POG period prize_Year-round':
                    continue
                elif prof == u'Birthday':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[0].text_content()
                elif prof == u'Trainer':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[1].text_content()
                elif prof == u'Horse owner':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[2].text_content()
                elif prof == u'Producer':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[3].text_content()
                elif prof == u'Origin':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[4].text_content()
                elif prof == u'Auction transaction price':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[5].text_content()

            # calc POG prize
            prize_all = 0.0
            prize_half = 0.0
            deadline_all = datetime.datetime.strptime('%d-07-01'%(year+3), '%Y-%m-%d')
            deadline_half = datetime.datetime.strptime('%d-01-01'%(year+3), '%Y-%m-%d')

            r_hist = root.xpath('//table[@class="db_h_race_results nk_tb_common"]')
            if len(r_hist) == 0:
                masta_d[s_idx][u'POG period prize_half period'] = '%d' % prize_half
                masta_d[s_idx][u'POG period prize_Year-round'] = '%d' % prize_all
            else:
                r_hist_l = root.xpath('//table[@class="db_h_race_results nk_tb_common"]/tbody/tr')
                for race in r_hist_l:
                    r_date = datetime.datetime.strptime(race.text_content().split('\n')[1],'%Y/%m/%d')
                    try:
                        prize = float(race.text_content().split('\n')[-2].replace(',',''))
                    except:
                        prize = 0.0

                    if r_date < deadline_all:
                        prize_all += prize
                    if r_date < deadline_half:
                        prize_half += prize

                masta_d[s_idx][u'POG period prize_half period'] = '%.2f' % prize_half
                masta_d[s_idx][u'POG period prize_Year-round'] = '%.2f' % prize_all
        except:
            pass

    # make data frame
    df = pd.DataFrame(masta_d).T
    o_df = pd.DataFrame()

    # sort columns
    for prof in prof_keys:
        o_df = pd.concat([o_df, df[prof]], axis=1)
    o_df.index.name = 'Index'

    o_fname = 'horse_prof_%d.csv' % year
    o_fpath = os.path.join(o_dname, o_fname)
    o_df.to_csv(o_fpath, encoding='utf-8')

def main():

    year_l = [2010, 2011, 2012, 2013]
    pool = mp.Pool(__PROC__)
    pool.map(subMakeHorseDB, year_l)

if __name__ == '__main__':
    main()
    raw_input('Press Enter to Exit¥n')

from now on

I want to find the law of POG winning by kneading the spit out csv file.

Data analysis for improving POG 1 ~ Web scraping with Python ~

Introduction

Source code

MakeUmaDB_151229.py

from now on

`MakeUmaDB_151229.py`