[EC2] Introduction to scraping with selenium

Summary of the flow until extracting the element of the specified URL using python selenium on EC2.

things to do

--Install chrome driver --Install chrome --Installing selenium --Installation of Japanese fonts --Extract the text of the specified URL (text.py) --Get a screen capture of the specified URL (capture.py)

Premise

-Connected to an EC2 instance using ssh. -Python3 is already installed.

How to connect to an EC2 instance using ssh How to build python3 environment on EC2

1. 1. chrome driver installation

(1) Move to the DL page of the version you want to download from the Official page of Chrome Driver.

(2) Copy the link address for linux64.

③ DL and decompress

`python`


#Move to tmp directory
$ cd/tmp/

#Download chromedriver (URL is copy)
$ wget https://chromedriver.storage.googleapis.com/83.0.4103.39/chromedriver_linux64.zip


#Defrost
$ unzip chromedriver_linux64.zip

#Unzipped file/user/Move under bin
$ sudo mv chromedriver /usr/bin/chromedriver

2. chrome installation

#Complete chrome installation in one sentence
$ curl https://intoli.com/install-google-chrome.sh | bash

Complete!　　　<-Successful installation
Successfully installed Google Chrome!


#Rename file
$ sudo mv /usr/bin/google-chrome-stable /usr/bin/google-chrome


#Check version
$ google-chrome --version && which google-chrome

Google Chrome 83.0.4103.61　<-　--execution result of version
/usr/bin/google-chrome   <-Execution result of which

Contents of each command

3. 3. Install selenium

$ pip3 install selenium

## 4. Japanese font installation ``` $ sudo yum install ipa-gothic-fonts ipa-mincho-fonts ipa-pgothic-fonts ipa-pmincho-fonts ```

If you do not install it, the characters will be garbled when you capture the screen.

Example of garbled characters

## 5. Extract the text of the specified URL (text.py)

① Create a text.py file in the user folder

`python`


$ cd ~
$ touch text.py
$ vi text.py

② The vim editor will start up, so copy and paste the following. └ Press the "i" key to enter insert mode. └ Copy and paste is "shift + ins" (or right-click and select paste)

`python`


#-*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True

driver = webdriver.Chrome(options=options)

#Specifying the URL
driver.get("https://www.google.co.jp/")

#Specify the element to scrape
element_text = driver.find_element_by_id("hptl").text

print(element_text)

driver.quit()

③ After pasting, save the vim editor below and finish. esc + :wq + Enter

④ Execute the created file

$ python3 text.py

#Success if the following is displayed
About Google Store

Scraping the text on the top right of google top is complete.

The meaning of each scraping code

## Get a screen capture of the specified URL (capture.py)

① Create a capture.py file in the user folder

`python`


$ cd ~
$ touch capture.py
$ vi capture.py

② The vim editor will start up, so copy and paste the following. └ Press the "i" key to enter insert mode. └ Copy and paste is "shift + ins" (or right-click and select paste)

`python`


#-*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True

#Specify the screen size to capture
options.add_argument('--window-size=1280,1024')

driver = webdriver.Chrome(options=options)

#Specify URL
driver.get("https://www.google.co.jp/")

#Specify the capture file name and extension
driver.save_screenshot('googletop.png')


driver.quit()

③ After pasting, save the vim editor below and finish. esc + :wq + Enter

④ Execute the created file

$ python3 capture.py

#Success if the following files are created in the same directory
$ ls
googletop.png

You can scrape relatively simply. After that, change the URL, change the elements to be extracted, and customize.

[EC2] Introduction to scraping using selenium (text extraction and screen capture)

[EC2] Introduction to scraping with selenium

things to do

Premise

1. 1. chrome driver installation

python

2. chrome installation

3. 3. Install selenium

python

python

python

python

`python`

`python`

`python`

`python`

`python`