Website scraping with Python's Beautiful Soup

About this article

As I wrote in Qiita before, I wrote Code for scraping websites in Java. Looking back now, it's hard to say that the code content is clean, although it meets the requirements. I was embarrassed to see it, so I decided to rewrite it in Python, so make a note.

There are many similar articles in Qiita, but it is a memorandum.

About Beautiful Soup

I used to use a library called jsoup when scraping with Java. This time we will use ** Beautiful Soup **.

BeautifulSoup is a library for scraping Python. Since you can extract the elements in the page using the CSS selector, it is convenient to extract only the desired data in the page. Official: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Since it is a Python library, it is installed with pip.

pip install beautifulsoup4

Example of use

Like the article I wrote before, I want to extract the date, title, and URL of "Notice" from the following page.

<body> 
 <div class="section"> 
  <div class="block"> 
   <dl>
    <dt>2019.08.04</dt> 
    <dd>
     <a href="http://www.example.com/notice/0003.html">Notice 3</a>
    </dd> 
    <dt>2019.08.03</dt> 
    <dd>
     <a href="http://www.example.com/notice/0002.html">Notice 2</a>
    </dd> 
    <dt>2019.08.02</dt> 
    <dd>
     <a href="http://www.example.com/notice/0001.html">Notice 1</a>
    </dd> 
   </dl>
  </div>
 </div>
</body>

Extract the notification with the following code and print it.

scraping.py


# -*- coding: utf-8 -*-
import requests
import sys
from bs4 import BeautifulSoup
from datetime import datetime as d

def main():

    print("Scraping Program Start")

    #Send a GET request to the specified URL to get the contents of the page
    res=requests.get('http://www.example.com/news.html')

    #Parse the retrieved HTML page into a BeautifulSoup object
    soup = BeautifulSoup(res.text, "html.parser")

    #Extract the entire block class element in the page
    block = soup.find(class_="block")

    #Extract dt element (date) and dd element in block class
    dt = block.find_all("dt")
    dd = block.find_all("dd")

    if(len(dt) != len(dd)):
        print("ERROR! The number of DTs and DDs didn't match up.")
        print("Scraping Program Abend")
        sys.exit(1)

    newsList = []

    for i in range(len(dt)):
        try:
            date = dt[i].text
            title = dd[i].find("a")
            url = dd[i].find("a").attrs['href']

            print("Got a news. Date:" + date +", title:" + title.string + ", url:" + url)

        except:
            print("ERROR! Couldn't get a news.")
            pass

    print("Scraping Program End")

if __name__ == "__main__":
    main()

The expected result when executing the above code is as follows.

Scraping Program Start
Got a news. Date:2019.08.04, title:Notice 3, url:http://www.example.com/notice/0003.html
Got a news. Date:2019.08.03, title:Notice 2, url:http://www.example.com/notice/0002.html
Got a news. Date:2019.08.04, title:Notice 1, url:http://www.example.com/notice/0001.html
Scraping Program End

in conclusion

Compared to the last time I wrote in Java's Spring Boot, it's good that the amount of coding is overwhelmingly small in Python. Please point out any mistakes in the content.

Recommended Posts

Website scraping with Python's Beautiful Soup
Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Try scraping with Python + Beautiful Soup
Scraping multiple pages with Beautiful Soup
Scraping with Python and Beautiful Soup
Crawl practice with Beautiful Soup
Beautiful Soup
[Python] Scraping a table using Beautiful Soup
Remove unwanted HTML tags with Beautiful Soup
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Beautiful Soup memo
Beautiful soup spills
Scraping with Selenium
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
My Beautiful Soup (Python)
Scraping with scrapy shell
Scraping: Save website locally
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Scraping with Selenium [Python]
Note that I dealt with HTML in Beautiful Soup
Scraping with Python + PyQuery
[Python] Delete by specifying a tag with Beautiful Soup
Scraping RSS with Python
[Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup
[Python] Practical Beautiful Soup ~ Scraping the triple single odds table on the official website of Kyotei ~
Scraping Google News search results in Python (2) Use Beautiful Soup
I tried scraping with Python
Automatically download images with scraping
Web scraping with python + JupyterLab
Face recognition with Python's OpenCV
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Implement subcommands with Python's argparse
Festive scraping with Python, scrapy
Save images with web scraping
Scraping Shizuoka's GoToEat official website
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Operate your website with Python_Webbrowser
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
I tried scraping with python
Web scraping beginner with python
I-town page scraping with selenium
[Python, Selenium, PhantomJS] A story when scraping a website with lazy load