I recently learned about scraping and implemented it. This time, I created a keyword in " CiNii Articles --Search for Japanese Articles --National Institute of Informatics ". All the "titles", "authors", and "paper publication media" of the papers that hit the keyword are acquired and saved in CSV. It was a good study for learning scraping, so I wrote an article. We hope it will be useful for those who are learning scraping!
Below is the code I wrote myself. The explanation is written with the code, so please take a look at it. Also, actually go to the site of " CiNii Articles --Search for Japanese Articles --National Institute of Informatics " and Chrome I think that understanding will deepen if you do it while actually looking at the structure of HTML using the verification function of. This time I saved this code as "search.py".
import sys
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
def main():
    url ='https://ci.nii.ac.jp/search?q={}&count=200'.format(sys.argv[1])
    res = requests.get(url)
    soup = BeautifulSoup(res.content , "html.parser")
    #Check the number of searches.
    #In text'\n Search results\n\t\n\t0\n\t'It contains data like this.
    search_count_result = soup.find_all("h1" , {"class":"heading"})[0].text
    #Get the number of searches using a regular expression
    pattern = '[0-9]+'
    result = re.search(pattern, search_count_result)
   
   #If there are no search results, the function ends here
    search_count = int(result.group())
    if  search_count == 0:
        return print('There are no search results.')
    print('The number of searches is' + str(search_count) + 'It is a matter.')
    #Creating a directory to store data.
    try:
        os.makedirs(sys.argv[1])
        print("A new directory has been created.")
    except FileExistsError:
        print("It will be a directory that already exists.")
    #To get all the search results, get the number of for.
    #This time, it is set to 200 because it is displayed every 200 cases.
    if search_count // 200 == 0:
        times = 1
    elif search_count % 200 == 0:
        times = search_count // 200
    else:
        times = search_count // 200 + 1
    
    #Acquire authors, titles, and publication media at once
    title_list = []
    author_list = []
    media_list = []
    #Processing to delete whitespace characters here
    escape = str.maketrans({"\n":'',"\t":''})
    for time in range(times):
        
        #get url
        count = 1 + 200 * time
        #search?q={}Enter the keyword you want to search for here.
        #count=200&start={}It counts every 200 and enters the number to display from.
        url ='https://ci.nii.ac.jp/search?q={}&count=200&start={}'.format(sys.argv[1], count)
        print(url)
        res = requests.get(url)
        soup = BeautifulSoup(res.content , "html.parser")
        for paper in soup.find_all("dl", {"class":"paper_class"}):#Turn the loop for each paper.
            
            #Get title
            title_list.append(paper.a.text.translate(escape))
            #Acquisition of author
            author_list.append(paper.find('p' , {'class':"item_subData item_authordata"}).text.translate(escape))
            #Acquisition of publication media
            media_list.append(paper.find('p' , {'class':"item_extraData item_journaldata"}).text.translate(escape))
    
    #Save as CSV as a data frame.
    jurnal = pd.DataFrame({"Title":title_list , "Author":author_list , "Media":media_list})
    
    #Encoding is performed to prevent garbled characters.
    jurnal.to_csv(str(sys.argv[1] + '/' + str(sys.argv[1]) + '.csv'),encoding='utf_8_sig')
    print('I created a file.')
    print(jurnal.head())
if __name__ == '__main__':
    #The code you want to run only when you run the module directly
    main()
I tried to implement what I actually created. First, type the following into the terminal. This time, I entered machine learning as a search keyword. In the place of machine learning, you enter the keyword you want to search for.
python search.py machine learning
If all goes well, it will look like this:

The contents of the CSV look like this.

How was that? I learned scraping about three days ago, but the code was dirty, but it was relatively easy to implement. I think I have more to study, so I will continue to do my best.
Recommended Posts