Let's do image scraping with Python

What is web scraping?

Scraping is a technology that ** searches and extracts arbitrary information from websites **. In addition to retrieving data on the Web, you can also analyze the structure.

Before doing web scraping

Before doing scraping, here are some things to check and things to keep in mind while working.

  1. Whether the API exists If there is a service that provides an API, use that to get the data. If you still have problems such as inadequate data, consider scraping.

  2. Regarding the use of the acquired data Be careful when using the acquired data. This is because the acquired data is a copyrighted work other than your own, and you need to consider it so that it does not conflict with copyright law.

Reproduction for private use (Article 30) http://www.houko.com/00/01/S45/048.HTM#030

Reproduction for information analysis, etc. (Article 47-7) http://www.houko.com/00/01/S45/048.HTM#047-6

In addition, the following are three rights that are of particular concern.

  1. Copy right:

The manufacturing right is one of the rights included in the copyright and is stipulated in Article 21 of the Copyright Law. (Article 21 "The author has the exclusive right to copy the work.") Reproduction means copying, recording / recording, printing or making a photograph, copying (copying), electronically reading with a scanner, and storing. Reference: https://www.jrrc.or.jp/guide/outline.html

  1. Right to adapt:

Translation rights and adaptation rights are copyright property rights stipulated in Article 27 of the Copyright Act. Article 27 states that "the author has the exclusive right to translate, arrange, or transform, or adapt, make a movie, and otherwise adapt the work" ("Copyright Information Center" http. (From //www.cric.or.jp/db/article/a1.html#021) is clearly stated. On the contrary, doing these without the permission of the author is a copyright infringement. Quote: http://www.iprchitekizaisan.com/chosakuken/zaisan/honyaku_honan.html

  1. Public transmission right:

The public transmission right is a copyright property right stipulated in Article 23 of the Copyright Act. In this Article 23, "The author occupies the right to publicly transmit (including enabling transmission in the case of automatic public transmission) for the work." "The author occupies the right." It occupies the right to publicly transmit the work transmitted to the public using the receiving device. " Quote: http://www.iprchitekizaisan.com/chosakuken/zaisan/kousyusoushin.html

Also, pay attention to the above, and make sure that the code you write does not overwhelm the server when you actually perform scraping. Excessive access puts a strain on the server and is considered an attack, and in the worst case, the service may not be available for a certain period of time. In addition, there are cases where one of the users was arrested due to an access failure in the system, so please use it within the bounds of common sense. https://ja.wikipedia.org/wiki/岡崎市立中央図書館事件

With the above in mind, let's move on.

HTML basics

It is useful to know the basics of HTML when practicing web scraping. The reason is that ** the data is acquired by specifying the tags used in HTML (\ , \

, \

) **.

Let me give you an example.

sample.html


<html>
<head>
  <title>neet-AI</title>
</head>
<body>
<div id="main">
  <p>neet-Click here for AI link</p>
  <a href="https://neet-ai.com">neet-AI</a>
</div>
</body>
</html>

If you look at the above code on your browser

スクリーンショット 2017-06-17 20.24.26.png

A page like this will appear.

Let's explain the HTML tags used on this page.

HTML tag list

Tag name Description
<html></html> A tag that explicitly states that this is HTML code
<head></head> Represents basic information (character code and page title) of the page.
<title></title> Represents the page title.
<body></body> Represents the body of the page.
<div></div> The tag itself has no meaning, but it is often used to describe it as one content.
<p></p> The sentence enclosed by this tag is now represented as one paragraph.
<a></a> Represents a link to another page.

There are many types other than the tags described above. Check each time you find out what kind of tag you want.

Web scraping basics

Now that you understand HTML tags, let's scrape them.

Basic steps of web scraping

  1. Get a web page
  2. Programmatically search and extract the specified tag (scraping)
  3. Format and save or display the data obtained by scraping

The above procedure is the basic procedure for scraping.

Library to use

When web scraping with Python, we will use various libraries.

Requests Used to get a web page.

BeautifulSoup4 Analyze the acquired web page, search for tags, and format the data.

We will do web scraping using the above library.

Ready to scrape with Python

Before scraping, you need to fetch the HTML of the web page in Python.

get_html.py


import requests

response = requests.get('http://test.neet-ai.com')
print(response.text)

Let's explain each line.

response = requests.get('http://test.neet-ai.com')

This line takes the HTML from http://test.neet-ai.com. The fetched HTML goes into the response variable.

print(response.text)

The variable response cannot be used in Beautiful Soup without text.

Scraping page title

title_scraping.py


import requests
from bs4 import BeautifulSoup

response = requests.get('http://test.neet-ai.com')
soup = BeautifulSoup(response.text,'lxml')
title = soup.title.string
print(title)

Seeing is believing, let's take a look at the program. Up to the 4th line, it is the same as "Preparing for scraping with Python". The scraping program starts from the 5th line, so let's explain each line.

soup = BeautifulSoup(response.text,'lxml')

Here, a variable called soup is prepared so that the fetched HTML data can be scraped. The'lxml'in parentheses means ** "I'll convert response.text with a tool called lxml" **.

title = soup.title.string

If you can convert the fetched HTML data, you can extract the specified data by specifying it with a fixed type of BeautifulSoup.

Let's walk through this program step by step. It looks like ** searching for the tag title in the soup variable and outputting the character string in the title tag in string format **. This is a little difficult to understand programmatically, so it may be better to understand it intuitively. It is difficult to understand as it is, so I would appreciate it if you could imagine it as follows.

スクリーンショット 2017-06-07 20.08.13.png

Please refer to the URL below as there is not enough time to introduce more detailed formats here.

Beautiful Soup Document

If you get the following results by running this program, you are successful.

neet-AI

Scraping links

First of all, the \ \ tag is used to represent a link in HTML. In this case, we want to get the URL in the a tag, so we can't use the string format.

get_link.py


import requests
from bs4 import BeautifulSoup

response = requests.get('http://test.neet-ai.com')
soup = BeautifulSoup(response.text,'lxml')
link = soup.a.get('href')
print(link)

** You can get the linked href by using a function called get (). ** ** Keep in mind that the get () function is useful and will be used frequently in the future.

Scraping multiple links

The page I was referring to so far had only one a tag. So how do you scrape a page with multiple a tags? First of all, let's run the previous program on a page with multiple a tags. Let's change the URL of the line that gets the page.

link_get.py


import requests
from bs4 import BeautifulSoup

response = requests.get('http://test.neet-ai.com/index2.html')
soup = BeautifulSoup(response.text,'lxml')
link = soup.a.get('href')
print(link)

When I run it, I only see the neet-AI link. This is because we are only extracting the first a tag found in soup.a.get ('href'). If you want to extract all a tags, it will be as follows.

link_all_get.py


import requests
from bs4 import BeautifulSoup

response = requests.get('http://test.neet-ai.com/index2.html')
soup = BeautifulSoup(response.text,'lxml')
links = soup.findAll('a')
for link in links:
   print(link.get('href'))

Let's explain each line.

links = soup.findAll('a')

Here ** all a tags are extracted and once put in the list called links. ** **

for link in links:
   print(link.get('href'))

Since it is a list type, you can operate it one by one by turning it with for. You can get each URL by using the get () function on the link variable that can be operated. ** Remember this method of getting all the tags once and turning them with for so that you can operate them ** as you will use them often in the future.

Scraping with id or class

Previously, the tag didn't mention id or class. However, on a typical site, id or class is set in the tag to make web design easier or to improve the readability of the program. Setting the id and class does not make scraping much more difficult. ** On the contrary, it may be easier when you say "I want to scrape only this content!".

index5.html


<html>
<head>
<title>neet-AI</title>
</head>
<body>
<div id="main">
  <a id="neet-ai" href="https://neet-ai.com">neet-AI</a>
  <a id="twitter" href="https://twitter.com/neetAI_official">Twitter</a>
  <a id="facebook" href="https://www.facebook.com/Neet-AI-1116273381774200/">Facebook</a>
</div>
</body>
</html>

For example, suppose you have a site like the one above. As you can see by looking at the tag of a, id is given to all. If you want to get the Twitter URL at this time, you can write like this.

twitter_scra.py


import requests
from bs4 import BeautifulSoup

response = requests.get('http://test.neet-ai.com/index5.html')
soup = BeautifulSoup(response.text,'lxml')
twitter = soup.find('a',id='twitter').get('href')

print(twitter)

You can easily get it by specifying the id name as the second of find.

Next, let's make it a class.

index6.html


<html>
<head>
<title>neet-AI</title>
</head>
<body>
<div id="main">
  <a class="neet-ai" href="https://neet-ai.com">neet-AI</a>
  <a class="twitter" href="https://twitter.com/neetAI_official">Twitter</a>
  <a class="facebook" href="https://www.facebook.com/Neet-AI-1116273381774200/">Facebook</a>
</div>
</body>
</html>

twitter_scra_clas.py


import requests
from bs4 import BeautifulSoup

response = requests.get('http://test.neet-ai.com/index6.html')
soup = BeautifulSoup(response.text,'lxml')
twitter = soup.find('a',class_='twitter').get('href')

print(twitter)

Note that class _ **, not ** class. This is because class is registered in advance as a reserved word (a word that has a special meaning in the language specifications) in python. To avoid this, the BeautifulSoup library author probably added an underscore.

Web scraping application

The basics of web scraping so far are HTML pages designed to facilitate web scraping. However, ** general websites are not designed for scraping, so they can have a very complex structure **.

Since it is so complicated, knowledge other than scraping such as the characteristics of web pages is required in addition to scraping.

In the advanced version, you will be able to scrape complicated sites to some extent if you get the hang of it, so let's cultivate know-how in the advanced version.

Use the communication characteristics of the web page

This technique comes in handy. Let's take nifty news as an example.

For example, there is a page that can be paged by IT Category. Let's actually press the second below to turn the page.

Looking at the URL again, https://news.nifty.com/technology/2 It will be.

Next, let's move to the third page.

The URL on the third page is like this. https://news.nifty.com/technology/3

As anyone who has done server-side development knows, most of the time when creating page-by-page pages ** Enter the number of pages at the end of the URL and parameters to update the page. ** **

If you use this mechanism, you can turn pages by simply replacing the numbers in the ** URL. ** **

Try changing the end to your favorite number. I think you can jump to that number. (Although there are limits)

Now, let's create a program that scrapes the search results from the 1st page to the 10th page on the program.

paging_scraping.py


import requests
from bs4 import BeautifulSoup

for page in range(1,11):
           r = requests.get("https://news.nifty.com/technology/"+str(page))
           r.encoding = r.apparent_encoding
           print(r.text)

It looks like this. This technique is useful when scraping search results and URLs that are serial numbers.

Web scraping practice

https://search.nifty.com/imagesearch/search? This time, we will scrape the search results by Nifty Image Search and download a large number of images. In this chapter you will learn how to download images and how to get the URL of an image from Yahoo Image Search.

Prior knowledge

Before image scraping, it is first necessary to know the nature of the image as a mechanism for downloading the image. Have you ever opened an image with a text editor? As anyone who has opened it will know, ** Photos and data are all composed of numbers. ** **

That number is analyzed by an image viewer and converted into an image.

The story is off, but if the image is made up of numbers ** If you can copy all the numbers, you will be able to download the photo. ** ** The detailed program will be explained in the image download section.

As a rough program flow (1) Scrap the search result screen acquired with a certain keyword to acquire the image URL. (2) Since there is image data (characters) at the image URL destination, copy it. ③ Paste the copied data to your computer.

Will be.

Image URL scraping

As explained in the preparation, you first need the URL of the image to download the image. First of all, search for "cat" in Nifty image search and look at the search result screen. Looking at the URL, there is "q = cat". "Q =" is the parameter of the keyword to be searched. スクリーンショット 2017-08-11 19.09.13.png

You can see 20 photos on the search result page. Let's take these 20 image URLs this time.

Let's take a look at the source code. スクリーンショット 2017-08-11 19.11.33.png

The photos are pasted with the img tag, so if you search for img, you will find about 140 hits. 20 out of 140 are the image URLs we are looking for. We will add more and more conditions to exclude other than the applicable 20 cases.

Let's search with img src. スクリーンショット 2017-08-11 19.12.41.png

This alone is pretty close. However, there are still some disturbing elements.

Let's take a look at the 20 corresponding image URLs (src). As you can see, it all starts with the URL "https://msp.c.yimg.jp/".

Let's search for "https://msp.c.yimg.jp/". スクリーンショット 2017-08-11 19.15.14.png

It's perfect 20 cases!

Since I was able to narrow down the applicable 20 cases, I will create a program with reference to this.

pic_url_scra.py


import requests
import re
from bs4 import BeautifulSoup

url = "https://search.nifty.com/imagesearch/search?select=1&q=%s&ss=up"
keyword = "Cat"
r = requests.get(url%(keyword))
soup = BeautifulSoup(r.text,'lxml')
imgs = soup.find_all('img',src=re.compile('^https://msp.c.yimg.jp/yjimage'))
for img in imgs:
        print(img['src'])

The heart of this program is

imgs = soup.find_all('img',src=re.compile('^https://msp.c.yimg.jp/yjimage'))

is. You can also pass a regular expression object in the second argument of find_all. Here, we are looking for a src that starts with "https://msp.c.yimg.jp/yjimage".

Image download

When downloading the image, copy the data (characters) of the page destination image. First, to get the data of the page destination image, you can get it with **. Content ** in requests.

Then you can write to your computer as a file with ** open () ** and ** .write **. When I try to program with this, it looks like this.

pic_download.py


import requests
import re
import uuid
from bs4 import BeautifulSoup

url = "https://search.nifty.com/imagesearch/search?select=1&q=%s&ss=up"
keyword = "Cat"
r = requests.get(url%(keyword))
soup = BeautifulSoup(r.text,'lxml')
imgs = soup.find_all('img',src=re.compile('^https://msp.c.yimg.jp/yjimage'))
for img in imgs:
        print(img['src'])
        r = requests.get(img['src'])
        with open(str('./picture/')+str(uuid.uuid4())+str('.jpeg'),'wb') as file:
                file.write(r.content)

By the way to generate (open) a file

with open(str('./picture/')+str(uuid.uuid4())+str('.jpeg'),'wb') as file:

We have created a library called uuid that creates one ID in the world so that the file name is not covered.

After creating the file, write the image data.

file.write(r.content)

Omnibus

Utilizing the knowledge so far, we will scrape 100 images for each keyword. With Nifty's image search, you can get 20 photos per page of the search result screen. So, let's get 5 pages.

But how do you navigate pages in Nifty Image Search? In Nifty News, you can change the number at the end of the page, but when you move the page, you can move it by changing the number of "start =" in the case of Nifty image search.

Let's move another page. Next is 40. ** I found that it is possible to move pages by changing the value every 20. ** **

Let's create a program based on this information.

** Sample program **

picture_scraping.py


import requests
import re
import uuid
from bs4 import BeautifulSoup

url = "https://search.nifty.com/imagesearch/search?select=1&chartype=&q=%s&xargs=2&img.fmt=all&img.imtype=color&img.filteradult=no&img.type=all&img.dimensions=large&start=%s&num=20"
keyword = "Cat"
pages = [1,20,40,60,80,100]

for p in pages:
        r = requests.get(url%(keyword,p))
        soup = BeautifulSoup(r.text,'lxml')
        imgs = soup.find_all('img',src=re.compile('^https://msp.c.yimg.jp/yjimage'))
        for img in imgs:
                r = requests.get(img['src'])
                with open(str('./picture/')+str(uuid.uuid4())+str('.jpeg'),'wb') as file:
                        file.write(r.content)

Recommended Posts

Let's do image scraping with Python
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Scraping with Python
Scraping with Python
Let's do MySQL data manipulation with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Do Houdini with Python3! !! !!
Image processing with Python
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Image processing with Python (Part 2)
I tried scraping with Python
Do Django with CodeStar (Python3.6.8, Django2.2.9)
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Let's run Excel with Python
Image editing with python OpenCV
Sorting image files with Python (2)
Scraping with Selenium in Python
Sorting image files with Python (3)
Image processing with Python (Part 1)
Tweet with image in Python
Sorting image files with Python
Let's write python with cinema4d.
Image processing with Python (Part 3)
Scraping weather forecast with python
Let's do R-CNN with Sklearn-theano
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
Let's build git-cat with Python
[Python] Image processing with scikit-image
Try scraping with Python + Beautiful Soup
Cut out an image with python
[Python] Using OpenCV with Python (Image Filtering)
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
[Python] Using OpenCV with Python (Image transformation)
Let's make a GUI with python.
Scraping with Python, Selenium and Chromedriver
Image processing with Python 100 knocks # 3 Binarization
Let's play with Excel with Python [Beginner]
To do tail recursion with Python2
[Scraping] Python scraping
Web crawling, web scraping, character acquisition and image saving with python
What to do with PYTHON release?
[Let's play with Python] Image processing to monochrome and dots
Get Qiita trends with Python scraping
Let's make a graph with python! !!
Find image similarity with Python + OpenCV
Image processing with Python 100 knocks # 2 Grayscale
"Scraping & machine learning with Python" Learning memo
Let's analyze voice with Python # 1 FFT
Send image with python, save with php