[Patent analysis] I tried to make a patent map with Python without spending money

Introduction

I thought I should try patent analysis at the same time, so I decided to try it. Specifically, when it comes to patent analysis, for example, there is a plot like the one below. patentmap_search_bubblechart.gif (The figure is taken from J-STORE Manual) The vertical axis shows the patent applicant, the horizontal axis shows the year when the patent was issued, and the size of the circle shows the number of patents, which is called a bubble chart. There is something in the world that makes such plots for a fee.

This time, by searching for patent documents using the free patent database J-PlatPat and processing the results in Python. I would like to make a plot like this.

What is J-PlatPat?

According to wikipedia,

The Japan Platform for Patent Information (Japan Platform for Patent Information) is related to industrial property rights such as patents, utility models, designs and trademarks operated by the National Center for Industrial Property Information and Training (INPIT). This is a database that allows you to search and inquire about industrial property gazettes, etc. for free.

... apparently ... The important thing here is that it doesn't cost money to use, but there are many restrictions on how it's free, and you risk being indignant if you touch it easily.

This time, the strategy of this platform is half, and the explanation of the processing of the obtained data is half.

Data collection

I'm interested in batteries, especially so-called rechargeable batteries, so I'd like to search for them.

However, if you normally search for "batteries", for example, solar cells and fuel cells will also be searched. Up to this point, it happens even with a normal search.

While soliciting hatred for humankind who translated Solar cell and Fuel cell as "battery", enter the following in J-PlatPat's patent / utility model search => logical formula input.

pat1.png

This / TI indicates the title, in which case patents with a'secondary battery'or'storage battery' in the patent title will be searched. If you use a logical formula, you can specify various other things, so the world will expand. For details, refer to Help of J-PlatPat.

This designation will result in 60459 patents. However, the maximum number of J-PlatPat search results displayed is 3000, so it will die. So, specify the issue date and search.

pat2.png

The issue date is specified from the search option, not from the logical expression. In this case, the patents issued during the one year period from January 1, 2019 to January 1, 2020 will be displayed.

This will display 1408 patents as shown below (as of 2/4/2020).

pat3.png

I would like to scrape the patent contents one by one here, but that is prohibited (see Notes). Never do it. I think you should forgive me with the "CSV output" in the list. Let's press it.

To output CSV, narrow down to 100 or less.

Yes I died. This is the end of this article. Is a lie. Don't rush, press "Print List".

pat4.png

Then, the window "Print list | J-PlatPat [JPP]" will pop up as shown above. This screen contains all 1408 patents. Right-click and save the HTML file instead of printing. This completes the acquisition of patent data for 2019.

After that, while shifting the issue year appropriately, we will manually acquire the HTML of the patent list such as 2018, 2017 .... This time, we have acquired data for 15 years from 2005 to 2019.

It's a work that feels empty, but it's actually not that much of a hassle, and it only takes about 10 minutes.

HTML to CSV conversion

With Python's BeautifulSoup, it's easy to convert the contents of an HTML table to CSV.

A little bit like below. Read the HTML file from ./html and write the CSV file to ./csv.


from bs4 import BeautifulSoup
import re
import glob
import pandas as pd

path = glob.glob('./html/*.html')

for p in path:
    
    name = p.replace('.html','').replace('./html\\','')  
    html = open(p,'r',encoding="utf-8_sig")
    soup = BeautifulSoup(html,"html.parser")
    
    tr = soup.find_all('tr')
    columns = [i.text.replace('\n','') for i in tr[0].find_all('th')]
    
    df = pd.DataFrame(index=[],columns=columns[1:])
    
    for l in tr[1:]:
        lines = [i.text for i in l.find_all('td')]
        lines = [i.replace('\n','') if n != 6 else re.sub(r'[\n]+', ",", i)  for n,i in enumerate(lines)]
        lines = pd.Series(lines, index=df.columns)
        df = df.append(lines,ignore_index=True)
    
    df.to_csv('./csv/'+name+'.csv', encoding='utf_8_sig', index=False)

I don't access J-PlatPat at all because I am only processing the HTML file saved locally manually. It's safe.

A part of the obtained data frame is shown below (patent2005.csv). 15 CSVs like this will be saved from 2005 to 2019. ~~ It is not necessary if J-PlatPat supports all CSV output ~~

Reference number application number Filing date Known date Title of invention applicant/Right holder FI
Re-table 2005/124920 Japanese Patent Application No. 2006-514740 2005/06/14 2005/12/29 Lead-acid battery Panasonic Corporation ,C22C11/00,C22C11/02,C22C11/06,other,
Re-table 2005/124899 Japanese Patent Application No. 2006-514659 2005/03/09 2005/12/29 Secondary battery and its manufacturing method Panasonic Corporation ,H01M2/16@M,H01M4/02@B,H01M4/02@Z,other,
Re-table 2005/124898 Japanese Patent Application No. 2006-514732 2005/06/13 2005/12/29 Positive electrode active material powder for lithium secondary batteries AGC Seimi Chemical Co., Ltd. ,H01M4/02@C,H01M4/36@E,H01M4/36,other,
Re-table 2005/122318 Japanese Patent Application No. 2006-514434 2005/05/18 2005/12/22 Non-aqueous electrolyte and lithium secondary battery using it Ube Industries, Ltd. ,H01M4/02@C,H01M4/02@D,H01M4/36,other,
JP 2005-353584 Japanese Patent Application No. 2005-140521 2005/05/13 2005/12/22 Lithium-ion secondary battery and its manufacturing method Panasonic Corporation, etc. ,H01M2/16@P,H01M4/02,101,H01M4/02,108,other,

Creating a bubble chart

When creating a bubble chart, there is the problem of how to select the target applicant (company). This time, if we pull in the top 10 companies with the highest number of patent applications in each year, we have just 30 companies, so we will do so.

Organize your data as follows:

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import glob
import pandas as pd
import collections
import numpy as np
import seaborn as sns

path = glob.glob('./csv/*.csv')

app_top10_dic = {} #TOP10 companies and number of applications for each year
app_total_dic = {} #All companies and number of applications in each year
app_all = [] #List of companies subject to bubble chart
for p in path:
    name = p.replace('.csv',' ').replace('./csv\\patent','').replace(' ','')
    df = pd.read_csv(p)
    app_list = df['applicant/Right holder']
    app_list = [i.rstrip('other').replace(' ','').replace('\u3000',' ') for i in app_list]
    app_set = collections.Counter(app_list)
    app_top10 = app_set.most_common()[:10]
    app_all.extend([i[0] for i in app_top10])
    app_top10_dic[name] = app_top10
    
    app_total = app_set.most_common()
    app_total_dic[name] = app_total
    
app_all = list(set(app_all))

Create a matrix with the applicant (company) in the column and the application year in the row. First, create an empty matrix (df) as shown below, and then

years = list(app_top10_dic.keys())
df = pd.DataFrame(index=app_all,columns=years).fillna(0)

Add the number of applications to this as follows.

for i in app_total_dic:
    dic = app_total_dic[i]
    for d in dic:
        if(d[0] in app_all):
            df[i][d[0]] += d[1]

Finally, you get a matrix like this:

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Sumitomo Metal Mining Co., Ltd. 6 14 7 13 2 7 15 18 16 25 35 54 56 60 61
Furukawa Battery Co., Ltd. 46 40 34 41 33 31 23 9 12 13 11 7 11 11 13
Dai Nippon Printing Co., Ltd. 4 14 12 0 5 9 29 59 5 9 2 1 5 0 0
Sanyo Electric Co., Ltd. 181 137 155 120 96 102 120 110 104 180 73 65 26 59 45
Samsung SDI Co., Ltd. 83 138 24 32 38 52 108 57 41 49 69 40 16 14 22
: : : : : : : : : : : : : : :
Semiconductor Energy Laboratory Co., Ltd. 0 0 0 0 0 0 9 7 15 16 22 54 24 12 12
GS Yuasa Corporation 57 20 11 17 0 0 0 0 0 0 0 0 0 0 0
GS Yuasa Co., Ltd. 43 16 38 40 39 44 56 76 79 75 99 72 83 41 31
Hitachi Maxell Co., Ltd. 29 20 8 28 26 17 39 49 48 46 63 35 39 14 0
Zeon Corporation 2 5 5 5 6 16 28 29 43 67 62 45 55 22 2

This is obtained as a pandas dataframe, so you can easily create a heatmap with seaborn.

fig = plt.figure(figsize=(6,10),dpi=150)
sns.heatmap(df,square=True,cmap='Reds')

res1.png

I think this is enough, but I also make a bubble chart. I can't (should) make this in one shot, so I'll do my best. I noticed the color and put in a legend.

fig = plt.figure(figsize=(6,10),dpi=150)

for n,y in enumerate(years):
    for m,x in enumerate(app_all[::-1]):
        plt.scatter(y,x,s=int(df[y][x]))

size = [100,200,300]
for s in size:
    plt.scatter(-100,-100,s=s,facecolor='white',edgecolor='black',label=str(s))
plt.legend(title='Patents', bbox_to_anchor=(1.03, 1), 
           loc='upper left',labelspacing=1,borderpad=1)

plt.xlim(-1,15)
plt.ylim(-1,30)
plt.xticks(rotation=45)
plt.show()

res2.png

It's done. Certainly this is easier to understand than the heat map.

A quick look at the whole reveals that Toyota has been strong since 2013. It is surprising that Toshiba has been constantly issuing patents since 2017, when there were various things. It's amazing. Panasonic's patents are decreasing, but isn't that so considering the applications of Panasonic IP Management and Sanyo Electric (subsidiary)? There are too many affiliated companies of Hitachi, and what is the difference between GS Yuasa and GS Yuasa Corporation, so it seems that group companies need to be integrated.

Summary

That's why I used Python to touch on patent analysis for a moment. I was swayed by the specifications of J-PlatPat and forgot how to use Pandas, but for the time being, I made a bubble chart, so I achieved my goal.

However, there is a possibility that some patents have been missed due to insufficient specification of search items. This part seems to require a lot of trial and error.

If you can get the text of the patent, it seems that you can do various interesting things with natural language analysis. Considering referencing the text, [Google Patent Public Datasets](https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent Is it better to use -data)?

I was satisfied for the time being, so I ended here!

Notes

"Restrictions on mass access, robot access, etc." in the J-PlatPat usage guide states as follows.

J-PlatPat is publicly used for industrial property rights information. Therefore, we prohibit actions such as downloading a large amount of data for the purpose of simple data collection and robot access (regular automatic data collection by a program) that may interfere with general use. I will.

That's why programmatic automatic collection is NG. Don't do it. Let's be careful.

I think it's arguable whether this data collection deserves "downloading a large amount of data for the purpose of simple data collection", but I think it's probably okay because I actually downloaded only 15 files. I will.

Recommended Posts

[Patent analysis] I tried to make a patent map with Python without spending money
I tried to draw a route map with Python
[5th] I tried to make a certain authenticator-like tool with python
[2nd] I tried to make a certain authenticator-like tool with python
I tried to make a periodical process with Selenium and Python
I tried to make a 2channel post notification application with Python
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
I want to make a game with Python
Python: I tried to make a flat / flat_map just right with a generator
I tried to make a traffic light-like with Raspberry Pi 4 (Python edition)
I tried to automatically generate a password with Python3
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to make various "dummy data" with Python faker
I tried to make a stopwatch using tkinter in python
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to make a real-time sound source separation mock with Python machine learning
Rubyist tried to make a simple API with Python + bottle + MySQL
[1 hour challenge] I tried to make a fortune-telling site that is too suitable with Python
I tried to make a regular expression of "amount" using Python
[Python] I tried to implement stable sorting, so make a note
[Python] A memo that I tried to get started with asyncio
I tried to create a list of prime numbers with python
I tried to make a regular expression of "date" using Python
I tried a functional language with Python
I tried to make a generator that generates a C # container class from CSV with Python
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
I made a Hex map with Python
I tried to make a Web API
I tried to make a strange quote for Jojo with LSTM
I tried to make an image similarity function with Python + OpenCV
I tried to make a mechanism of exclusive control with Go
I tried to communicate with a remote server by Socket communication with Python.
I tried to create a program to convert hexadecimal numbers to decimal numbers with python
I tried to make a calculator with Tkinter so I will write it
I tried to make "Sakurai-san" a LINE BOT with API Gateway + Lambda
[AWS] [GCP] I tried to make cloud services easy to use with Python
I tried to discriminate a 6-digit number with a number discrimination application made with python
I tried fMRI data analysis with python (Introduction to brain information decoding)
[Outlook] I tried to automatically create a daily report email with Python
I tried to build a Mac Python development environment with pythonz + direnv
[Zaif] I tried to make it easy to trade virtual currencies with Python
I tried to make a url shortening service serverless with AWS CDK
I tried to get CloudWatch data with Python
Try to make a "cryptanalysis" cipher with Python
I tried to output LLVM IR with Python
I tried to automate sushi making with python
Try to make a dihedral group with Python
I want to write to a file with Python
I tried to make a ○ ✕ game using TensorFlow
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
When I tried to make a VPC with AWS CDK but couldn't make it
When I tried to create a virtual environment with Python, it didn't work
I tried to make a castle search API with Elasticsearch + Sudachi + Go + echo
[ES Lab] I tried to develop a WEB application with Python and Flask ②
Machine learning beginners tried to make a horse racing prediction model with python
I tried to easily create a fully automatic attendance system with Selenium + Python
[Python] I tried to make a Shiritori AI that enhances vocabulary through battles
I tried to make a suspicious person MAP quickly using Geolonia address data
I tried to make a simple image recognition API with Fast API and Tensorflow