Web scraping notes in python3

Introduction

I'm researching various things to try web scraping with python, but I'd like to summarize the contents as a memo.

Library to use

Use Beautiful Soup and Requests

--Reference article: [Python] Get the text from the URL with Beautiful Soup and Requests

How to check the execution result of python on atom

Use a package called "script"

--Reference article: [Explanation] How to execute Python on Atom

Code execution shortcut is Ctrl + Shift + B (for windows)

Notes on Beautiful Soup

――It seems that there are various types of parser --Reference article: Introduction to beautifulsoup4 parse and scrape html ――I honestly don't really understand the difference (it seems that there is a difference in speed) ――I think it's okay because most of the sample code uses html.parser lol

--Get_text () to get the string --Reference article: Beautiful Soup in 10 minutes --You can also trim line breaks/blanks, get element names specified, etc. --You can also get a specific string by specifying an element. --For example, if you specify .i.get_text (), only the part enclosed by the <i> element will be acquired. --Reference: Official Reference --You can get the href attribute with get ("href ") --Reference article: [Python] Get href value with Beautiful Soup [Scraping]

Notes on requests

Reference article: How to use Requests (Python Library)

--There is a function corresponding to the method of http request --There is a pattern to get the response content in text format (.text) and in binary format (.content).

About UnicodeEncodeError

I got an error when trying to run the following sample code in a windows environment

from bs4 import BeautifulSoup
import requests as req

url = 'https://www.y-shinno.com/vgg16-finetuning-uecfood100/'
html = req.get(url).content
soup = BeautifulSoup(html, 'html.parser')
text = soup.find(class_='entry-content').get_text()
print(text)

The error returned

UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 1710: illegal multibyte sequence

When I investigated, it seems that it is caused by the batting of the character string encoding method inside python and the standard encoding method of windows.

--Reference article: (Windows) Causes and workarounds for UnicodeEncodeError in Python 3

As a solution, solve it by writing the code described in the following article at the time of import (It took a long time to resolve ...)

--Reference article: Countermeasures when a Japanese encoding error occurs in Python

import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Should I include this when dealing with Japanese in python on windows? (Please point out if you make a mistake)

How to get and scrape elements

You can use find () or select ()

--Reference article -Basics of CSS Selector for Web Scraping -Differences in how to use find_all () and select () in Beautiful Soup

Note: When examining the class attribute with a developer tool, there are cases where just copying and pasting the source code does not work.

Example

<div class="l-mt10 text-ellipsis__body--2lines contents-list__item-name">AIUEO</div>

When you try to select with this class attribute specified

soup.select(".l-mt10 text-ellipsis__body--2lines contents-list__item-name")

Then it doesn't work

soup.select(".l-mt10.text-ellipsis__body--2lines.contents-list__item-name")

Must be (difficult to understand ...)

How to trim only the character string from the array obtained by select

Combine get_text () and for ~ in ~ (How do you use the for statement!)

Example

print([t.get_text() for t in text])

With this kind of feeling, you can output the result of sequentially get_text () with a for statement for each element of the array.

When dealing with JSON in python

If you want to output the acquired elements as a json array, use the itemgetter of the json module and the operator module, and turn the for statement to arrange them sequentially.

About json module

Reference article:

-Explanation of how to handle JSON in Python -Let's master json dumps in Python! encoding, foramt, datetime

The distinction between functions is confusing ...

--json.load (): Converts the JSON of the file to dictionary type as a result of processing and returns it. --json.loads (): Convert json acquired as a character string on the program to dictionary type and read it --json.dump (): Convert dictionary value to JSON and output to file --json.dumps (): Convert dictionary type value to string type and output

About item getter of operator module

I just looked at it lightly, so only the reference materials are listed.

--Reference: Active engineers explain how to use itemgetter () in Python [for beginners]

How to handle arrays in python

I just looked at it lightly, so only the reference materials are listed.

--Reference: python array basics are now perfect! Introducing many useful methods

Recommended Posts

Web scraping notes in python3
WEB scraping with Python (for personal notes)
Python web scraping selenium
Web scraping with python + JupyterLab
Scraping with selenium in Python
[Python] Scraping in AWS Lambda
Scraping with chromedriver in python
Scraping with Selenium in Python
Trade-offs in web scraping & crawling
Scraping with Tor in Python
Web scraping using Selenium (Python)
Get Evernote notes in Python
Web scraping beginner with python
[Scraping] Python scraping
web scraping
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Web application development memo in python
Web scraping with Python First step
I tried web scraping with python.
Hit the web API in Python
Beginners use Python for web scraping (1)
web coder tried excel in Python
Beginners use Python for web scraping (4) ―― 1
Quadtree in Python --2
Python in optimization
Python Scraping get_ranker_categories
Metaprogramming in Python
Notes using cChardet and python3-chardet in Python 3.3.1.
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Scraping with Python
Python study notes _000
Python learning notes
Meta-analysis in Python
Unittest in python
Scraping with Python
Notes on nfc.ContactlessFrontend () for nfcpy in python
Python beginner notes
[Beginner] Python web scraping using Google Colaboratory
Getting Started with Python Web Scraping Practice
Python study notes_006
web scraping (prototype)
Epoch in Python
Discord in Python
Python Scraping eBay
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
python C ++ notes
Horse Racing Site Web Scraping with Python
Plink in Python
Constant in python
Python Scraping get_title
Scraping a website using JavaScript in Python