Data analysis for improving POG 1 ~ Web scraping with Python ~

Introduction

Horse racing has a play called POG. This is a fictitious owner who competes for the success of his horse. In general, the prize money won during the period from debut to derby is often used as an index of activity.

I have been doing POG for about 5 years now. Until now, I have been able to select somehow good horses by relying on Aomoto, which will be released around April every year.

However, the results are brilliant, and by the end of this year, only 2 out of 10 have won.

In order to overcome this situation where the swelling does not rise, I decided to put my heart into data analysis. However, the author does not have the skills to freely manipulate the machine learning that is popular these days. Therefore, the immediate goal here is to find a causal relationship between the basic horse information (stables, producers, pedigree) and the prize money earned during the POG period.

What is important here is the "prize money won during the POG period". As far as I know, no website publishes this information. Maybe some institutions offer it for a fee, but I don't want to do it in situations where it's unclear if it's worth the cost.

Therefore, here, we decided to take the means of acquiring basic horse information and run history data from netkeiba and calculating the prize money during the POG period using the run history data.

The language used is Python. This is just something I'm used to.

Source code

Collect data on horses born during the four years from 2010 to 2013. Here, data for four years is acquired at the same time by parallel processing of four cores. ~~ It depends on the machine specs, but my MBA finished collecting the data in about an hour. ~~

MakeUmaDB_151229.py


#!/usr/bin/env python
# encoding: utf-8

import urllib2 as ul
import pandas as pd
import os
import time
import datetime
from lxml import html
import multiprocessing as mp

__PROC__ = 4

def MakeDir(dname):
    if not os.path.exists(dname):
        os.mkdir(dname)
        print 'Make directory:%s' % dname
    else:
        print '%s is exist' % dname

    return 0

def subMakeHorseDB(year):
    # Set Directory
    o_dname = 'horse_db'
    MakeDir(o_dname)

    # horse_prof
    prof_keys = [
        u'Horse name',
        u'Birthday',
        u'Trainer',
        u'Horse owner',
        u'Producer',
        u'Origin',
        u'Auction transaction price',
        u'father',
        u'mother',
        u'Mother father',
        u'POG period prize_half period',
        u'POG period prize_Year-round'
    ]

    # get Uma data from web site
    base_url = 'http://db.netkeiba.com/horse/'
    idx_from = 100000
    idx_to = 111000

    masta_d = {}
    for idx in range(idx_from, idx_to + 1):
        try:
            # get html from web
            time.sleep(10)
            s_idx = str(year)+str(idx).zfill(6)
            url = base_url + s_idx
            src_html = ul.urlopen(url).read()# get html from url
            root = html.fromstring(src_html)

            # show progress
            print 'idx: %s, (%d, %d/%d)' % (s_idx, year, idx, idx_to)

            # not found db
            if root.xpath('//title')[0].text.startswith(u'|'):
                #print 'DB not found'
                continue

            # html parse
            masta_d[s_idx] = {}

            for prof in prof_keys:
                if prof == u'Horse name':
                    horse_name = root.xpath('//div[@class="horse_title"]')[0].text_content().split('\n')[1]
                    masta_d[s_idx][prof] = horse_name

                elif prof == u'father':
                    masta_d[s_idx][prof] = root.xpath('//td[@rowspan="2"][@class="b_ml"]')[0].text_content().split('\n')[1]

                elif prof == u'mother':
                    masta_d[s_idx][prof] = root.xpath('//td[@rowspan="2"][@class="b_fml"]')[0].text_content().split('\n')[1]
                elif prof == u'Mother father':
                    masta_d[s_idx][prof] = root.xpath('//td[@class="b_ml"]')[2].text_content().split('\n')[1]
                elif prof == u'POG period prize_half period' or prof == u'POG period prize_Year-round':
                    continue
                elif prof == u'Birthday':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[0].text_content()
                elif prof == u'Trainer':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[1].text_content()
                elif prof == u'Horse owner':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[2].text_content()
                elif prof == u'Producer':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[3].text_content()
                elif prof == u'Origin':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[4].text_content()
                elif prof == u'Auction transaction price':
                    masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[5].text_content()

            # calc POG prize
            prize_all = 0.0
            prize_half = 0.0
            deadline_all = datetime.datetime.strptime('%d-07-01'%(year+3), '%Y-%m-%d')
            deadline_half = datetime.datetime.strptime('%d-01-01'%(year+3), '%Y-%m-%d')

            r_hist = root.xpath('//table[@class="db_h_race_results nk_tb_common"]')
            if len(r_hist) == 0:
                masta_d[s_idx][u'POG period prize_half period'] = '%d' % prize_half
                masta_d[s_idx][u'POG period prize_Year-round'] = '%d' % prize_all
            else:
                r_hist_l = root.xpath('//table[@class="db_h_race_results nk_tb_common"]/tbody/tr')
                for race in r_hist_l:
                    r_date = datetime.datetime.strptime(race.text_content().split('\n')[1],'%Y/%m/%d')
                    try:
                        prize = float(race.text_content().split('\n')[-2].replace(',',''))
                    except:
                        prize = 0.0

                    if r_date < deadline_all:
                        prize_all += prize
                    if r_date < deadline_half:
                        prize_half += prize

                masta_d[s_idx][u'POG period prize_half period'] = '%.2f' % prize_half
                masta_d[s_idx][u'POG period prize_Year-round'] = '%.2f' % prize_all
        except:
            pass

    # make data frame
    df = pd.DataFrame(masta_d).T
    o_df = pd.DataFrame()

    # sort columns
    for prof in prof_keys:
        o_df = pd.concat([o_df, df[prof]], axis=1)
    o_df.index.name = 'Index'

    o_fname = 'horse_prof_%d.csv' % year
    o_fpath = os.path.join(o_dname, o_fname)
    o_df.to_csv(o_fpath, encoding='utf-8')

def main():

    year_l = [2010, 2011, 2012, 2013]
    pool = mp.Pool(__PROC__)
    pool.map(subMakeHorseDB, year_l)

if __name__ == '__main__':
    main()
    raw_input('Press Enter to Exit¥n')

from now on

I want to find the law of POG winning by kneading the spit out csv file.

Recommended Posts

Data analysis for improving POG 1 ~ Web scraping with Python ~
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Data analysis for improving POG 3-Regression analysis-
WEB scraping with Python (for personal notes)
[For beginners] Try web scraping with Python
Data analysis with python 2
Data analysis with Python
[Python] Flow from web scraping to data analysis
Web scraping with python + JupyterLab
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Web scraping beginner with python
Python for Data Analysis Chapter 3
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
Preprocessing template for data analysis (Python)
Beginners use Python for web scraping (1)
Data analysis starting with python (data visualization 1)
Beginners use Python for web scraping (4) ―― 1
Data analysis starting with python (data visualization 2)
Scraping with Python
Python visualization tool for data analysis work
Scraping with Python
Data analysis python
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Data analysis starting with python (data preprocessing-machine learning)
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Preparation for scraping with python [Chocolate flavor]
Create a USB boot Ubuntu with a Python environment for data analysis
AWS-Perform web scraping regularly with Lambda + Python + Cron
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Extract data from a web page with Python
Scraping with Python (preparation)
Try scraping with Python.
Data analysis using Python 0
Data analysis overview python
Scraping with Python + PhantomJS
Voice analysis with python
Python data analysis template
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Voice analysis with python
Scraping RSS with Python
Data acquisition from analytics API with Google API Client for python Part 2 Web application
Analyze Amazon Gift Certificate Low Price Information with Python for Web Scraping & R
Beginners can use Python for web scraping (1) Improved version
[For beginners] How to study Python3 data analysis exam
Reading Note: An Introduction to Data Analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Quick web scraping with Python (while supporting JavaScript loading)
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Challenge principal component analysis of text data with Python