I'm a beginner with Python for 2 weeks, but I want to get Google search results for my seminar research, so this article "[Get Google search results using Custom Search API](https: // qiita) .com / zak_y / items / 42ca0f1ea14f7046108c # 1-api% E3% 82% AD% E3% 83% BC% E3% 81% AE% E5% 8F% 96% E5% BE% 97) ” ..
Although it overlaps with the reference article, I would like to publish how it was made.
environment Windows10 python3.7 Anaconda Navigator
** Target ** Obtained previous research on the seminar research theme "What are the determinants that influence the increase and decrease in the number of foreign visitors to Japan?" → Create a file that lists the titles and URLs of the acquired articles
Open the navigation menu of Google Cloud Platform and click "APIs and Services" → "Credentials".

Create an API key from "Create Credentials".

I will use the obtained API key later, so copy it and paste it somewhere.
Open the navigation menu of Google Cloud Platform and click "APIs and Services" → "Library".

Select "Custom Search API" from "Other" at the bottom of the page to open the details page.
Click "Activate".

① Go to the Custom Search Engine page and click "Add".

②
・ Enter the URL of some site under "Site to search" (anything is fine)
・ Language is set to "Japanese"
・ Enter the name of the search engine
・ Click "Create"

③ Select the name of the search engine you created earlier from the options under "Edit search engine" and edit it.
What is this page
-Copy the "search engine ID" and paste it somewhere and save it.
・ Select Japanese for "Language"
-Delete the site displayed in "Sites to search"
・ Turn on "Search the entire web"
・ Click "Update"
Install "Google API Python Client" by referring to "Google API Client Library for Python".
I have created a virtual environment with virtualenv and then installed the library.
Now write the code and run it ... then an error occurs!

** Cause **
Reference article: Causes and workarounds of UnicodeEncodeError (cp932, Shift-JIS encoding) when using Python3 on Windows
** Workaround ** Specify encoding to ʻutf-8` in the argument of Open function.
scrape.py
with open(os.path.join(save_response_dir, 'response_' + today + '.json'), mode='w', encoding='utf-8') as response_file:
        response_file.write(jsonstr)
With a little tinkering, the final code looks like this:
scrape.py
import os
import datetime
import json
from time import sleep
from googleapiclient.discovery import build
                  
GOOGLE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CUSTOM_SEARCH_ENGINE_ID = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
DATA_DIR = 'data'
def makeDir(path):
    if not os.path.isdir(path):
        os.mkdir(path)
def getSearchResponse(keyword):
    today = datetime.datetime.today().strftime("%Y%m%d")
    timestamp = datetime.datetime.today().strftime("%Y/%m/%d %H:%M:%S")
    makeDir(DATA_DIR)
    service = build("customsearch", "v1", developerKey=GOOGLE_API_KEY)
    page_limit = 10
    start_index = 1
    response = []
    for n_page in range(0, page_limit):
        try:
            sleep(1)
            response.append(service.cse().list(
                q=keyword,
                cx=CUSTOM_SEARCH_ENGINE_ID,
                lr='lang_ja',
                num=10,
                start=start_index
            ).execute())
            start_index = response[n_page].get("queries").get("nextPage")[
                0].get("startIndex")
        except Exception as e:
            print(e)
            break
    #Save the response in json format
    save_response_dir = os.path.join(DATA_DIR, 'response')
    makeDir(save_response_dir)
    out = {'snapshot_ymd': today, 'snapshot_timestamp': timestamp, 'response': []}
    out['response'] = response
    jsonstr = json.dumps(out, ensure_ascii=False)
    with open(os.path.join(save_response_dir, 'response_' + today + '.json'), mode='w', encoding='utf-8') as response_file:
        response_file.write(jsonstr)
if __name__ == '__main__':
    target_keyword = 'Foreign Visitors in Japan Factor Research'
    getSearchResponse(target_keyword)
When I run it this time, a "response" folder is created under the "data" folder, and a json file is created under that!

The code is below.
prettier.py
import os
import datetime
import json
import pandas as pd
DATA_DIR = 'data'
def makeDir(path):
    if not os.path.isdir(path):
        os.mkdir(path)
def makeSearchResults():
    today = datetime.datetime.today().strftime("%Y%m%d")
    response_filename = os.path.join(
        DATA_DIR, 'response', 'response_' + today + '.json')
    response_file = open(response_filename, 'r', encoding='utf-8')
    response_json = response_file.read()
    response_tmp = json.loads(response_json)
    ymd = response_tmp['snapshot_ymd']
    response = response_tmp['response']
    results = []
    cnt = 0
    for one_res in range(len(response)):
        if 'items' in response[one_res] and len(response[one_res]['items']) > 0:
            for i in range(len(response[one_res]['items'])):
                cnt += 1
                display_link = response[one_res]['items'][i]['displayLink']
                title = response[one_res]['items'][i]['title']
                link = response[one_res]['items'][i]['link']
                snippet = response[one_res]['items'][i]['snippet'].replace(
                    '\n', '')
                results.append({'ymd': ymd, 'no': cnt, 'display_link': display_link,
                                'title': title, 'link': link, 'snippet': snippet})
    save_results_dir = os.path.join(DATA_DIR, 'results')
    makeDir(save_results_dir)
    df_results = pd.DataFrame(results)
    df_results.to_csv(os.path.join(save_results_dir, 'results_' + ymd + '.tsv'), sep='\t',
                      index=False, columns=['ymd', 'no', 'display_link', 'title', 'link', 'snippet'])
if __name__ == '__main__':
    makeSearchResults()
When executed, it was organized in the order of date, number, site URL, title, article URL, and details!

If you open it in Excel, it looks like this ↓

The article I referred to this time ([Get Google search results using Custom Search API](https://qiita.com/zak_y/items/42ca0f1ea14f7046108c#1-api%E3%82%AD%E3%] 83% BC% E3% 81% AE% E5% 8F% 96% E5% BE% 97)) was so nice and easy to understand that even beginners could easily implement it! I have to understand the meaning of the code well, but I'm happy to create a program that can be used in everyday life for the time being: satisfied: However, it seems that there are various restrictions on the Custom Search API if it is a free frame (Google Custom Search JSON API), so I will use it again in the future Sometimes you have to be careful.
Recommended Posts