Match the distribution of each group in Python

background purpose

When building a machine learning model with certain table data, there was a variable that I wanted to eliminate the difference between classes, so I implemented that process.

Method

Undersampling is performed for each attribute in each specified section. This time, we used seaborn titanic data to match the number of people at each teenage age between the'yes' and'no' groups with alive. ..

environment

OS:Windows10 conda:4.8.3 python:3.8.3 pandas:0.25.3 matplotlib:3.3.1 seaborn:0.11.0

Implementation

Data acquisition

Get the data as follows.

load_dataset.py


import seaborn as sns
data = sns.load_dataset("titanic")

Data confirmation before correction

Regarding the acquired data, the age distribution for each teenager by alive (survival status) is as follows.

display_gragh.py


import matplotlib.pyplot as plt
sns.set_style(style='darkgrid')
fig, ax = plt.subplots(1,1, figsize=(4,4))
ax.set_xticks(range(0,100,10))
ax.set_ylim(0,150)
ax.set_ylabel('the number of people')
sns.distplot(data['age'][data['alive']=='yes'], kde=False, rug=False, bins=range(0,100,10),
             label='alive', ax=ax)
sns.distplot(data['age'][data['alive']=='no'], kde=False, rug=False, bins=range(0,100,10),
             label='dead', ax=ax)
ax.legend()
plt.show()

image.png

Age distribution matching

I created the following function for data matching.

import pandas as pd
def adjust_number(data, target_column, attribute, period):
    '''
    target_column:Column name to be adjusted
    attribute    :Attribute to be adjusted (Adjust the number of target columns between these attributes)
    period       :Section width to adjust
    '''

    ##Initial section setting
    #The lower limit starts from 0 if the minimum value of the target data group is 0 or more, and starts from the minimum value if it is less than 0.
    lower = 0 if data[target_column].min() >= 0 else data[target_column].min()

    #The upper limit is the lower limit + section width-Start from 1
    upper = lower+period-1

    data_adjusted = pd.DataFrame() #For data storage after adjustment
    maximum = data[target_column].max() #Get the maximum value of the data group to be adjusted

    #Repeat until the lower limit exceeds the maximum value
    while lower <= maximum:
        #Extract data for the target section
        data_in_range = data[(lower<=data.loc[:,target_column]) & (data.loc[:,target_column]<=upper)]
        
        #If there is no data in the target section or there is an attribute for which there is no data in the target section, go to the next section.
        #(Because of undersampling, if the number of data of any attribute is 0, all will be 0)
        if len(data_in_range) == 0 or set(data[attribute]) != set(data_in_range[attribute]):
            lower += period
            upper += period
            continue
    
        else:
            #Acquisition of the number of data for each attribute in the target section
            counts = data_in_range[attribute].value_counts()
        
            #Undersampling by attribute
            sample = pd.DataFrame()
            for att in counts.index:            
                sample = data_in_range[data_in_range[attribute]==att].sample(n=counts.min(), random_state=42)
                
                #Concatenate the adjusted data of the target section to the stored adjusted data
                data_adjusted = pd.concat([data_adjusted, sample],axis=0, ignore_index=True)
        
        #To the next section
        lower += period
        upper += period
    
    return data_adjusted

The age distribution for each teenager after processing with this function (after matching the ages) is as follows. It is posted again before the correction.

data_adjusted = adjust_number(data, target_column='age', attribute='alive', period=10)

fig, ax = plt.subplots(1,1, figsize=(4,4))
ax.set_xticks(range(0,100,10))
ax.set_ylim(0,150)
ax.set_ylabel('the number of people')
sns.distplot(data_adjusted['age'][data_adjusted['alive']=='yes'], kde=False, rug=False, bins=range(0,100,10),
             label='alive', ax=ax)
sns.distplot(data_adjusted['age'][data_adjusted['alive']=='no'], kde=False, rug=False, bins=range(0,100,10),
             label='dead', ax=ax)
ax.legend()
plt.show()

▼ After correction ▼ Before correction image.png     image.png

I was able to correct it safely.

Application 1 (change of section width)

Even if I changed the section width from 10 to 5 (matching the number of people every 5 years old), it worked without problems.

data_adjusted = adjust_number(data, target_column='age', attribute='alive', period=5)

▼ After correction ▼ Before correction image.png     image.png

Application 2 (change of adjustment target)

There is no problem even if the adjustment target is changed to fare.

data_adjusted = adjust_number(data, target_column='fare', attribute='alive', period=30)

▼ After correction ▼ Before correction image.png     image.png

Application 3 (change target attribute)

There is no problem even if you change the target attribute to sex (gender).

data_adjusted = adjust_number(data, target_column='age', attribute='sex', period=10)

▼ After correction ▼ Before correction image.png     image.png

That's it. Thank you for visiting.

Recommended Posts

Match the distribution of each group in Python
Check the asymptotic nature of the probability distribution in Python
Check the operation of Python for .NET in each environment
Get the index of each element of the confusion matrix in Python
Check the behavior of destructor in Python
The result of installing python in Anaconda
The basics of running NoxPlayer in Python
In search of the fastest FizzBuzz in Python
Status of each Python processing system in 2020
Try transcribing the probability mass function of the binomial distribution in Python
Output the number of CPU cores in Python
Logistic distribution in Python
[Python] Sort the list of pathlib.Path in natural sort
Get the caller of a function in Python
View the result of geometry processing in Python
the zen of Python
Make a copy of the list in Python
Find the divisor of the value entered in python
Find the solution of the nth-order equation in python
The story of reading HSPICE data in Python
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
Solving the equation of motion in Python (odeint)
Output in the form of a python array
Compare the sum of each element in two lists with the specified value in Python
Experience the good calculation efficiency of vectorization in Python
How to get the number of digits in Python
[python] Get the list of classes defined in the module
Carefully understand the exponential distribution and draw in Python
Plot and understand the multivariate normal distribution in Python
The story of FileNotFound in Python open () mode ='w'
Learn the design pattern "Chain of Responsibility" in Python
Implement the solution of Riccati algebraic equations in Python
Carefully understand the Poisson distribution and draw in Python
Get the size (number of elements) of UnionFind in Python
Not being aware of the contents of the data in python
Reproduce the execution example of Chapter 4 of Hajipata in Python
Let's use the open data of "Mamebus" in Python
Implemented the algorithm of "Algorithm Picture Book" in Python3 (Heapsort)
[Python] Outputs all combinations of elements in the list
Group by consecutive elements of a list in Python
Get the URL of the HTTP redirect destination in Python
A reminder about the implementation of recommendations in Python
Reproduce the execution example of Chapter 5 of Hajipata in Python
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Towards the retirement of Python2
Download the file in Python
Find the difference in Python
Write beta distribution in Python
About the ease of Python
Equivalence of objects in Python
Generate U distribution in Python
Implementation of quicksort in Python
About the features of Python
The Power of Pandas: Python
Output the specified table of Oracle database in Python to Excel for each file
How to count the number of occurrences of each element in the list in Python with weight
Check the processing time and the number of calls for each process in python (cProfile)
Try scraping the data of COVID-19 in Tokyo with Python
Find out the apparent width of a string in python
I tried the accuracy of three Stirling's approximations in python