By previous, I created a git repository on GCP and manually executed scraping PGM on CloudShell. This time, we will finally perform automatic scraping on GCE.
(1) Succeed in scraping the target stuff locally for the time being. (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (4) -1 Put the test PGM on the cloud and run it normally on CloudShell. (4) -2 Add scraping PGM to the repository and run it normally on Cloud Shell. (4) -3 Create a VM instance of Compute Engine and have it automatically execute scraping. ← Now here </ font> (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)
(1) Create a GCE instance (2) Enhanced security for GCE instance SSH connection (3) Install git, anyenv, pyenv, python 3.8.5 on the GCE instance (3) Clone repository to GCE instance (4) Crontab settings (test PGM) (5) Crontab settings (scraping PGM)
Create an instance of Compute Engine. The free tier of Compute Engine is as follows, so make it within that performance range.
Compute Engine 1 f1-micro instance (per month, US region only, excluding North Virginia [us-east4]) 30 GB-Moon HDD 5 GB-Monthly snapshot (some regions) 1 GB downlink (outward) network from North America to all regions (per month, excluding China and Australia)
Quote: Google Cloud Platform Free Tier
The f1-micro instance free tier limit is based on time, not number of instances. Since 720 hours are free every month, I think that it will be 30 days if it is always started. Be careful of fine charges on the 31st month ...?
The free tier f1-micro instance limit is based on time, not number of instances. All f1-micro instances will be free to use each month until you have used up the equivalent number of hours in the month. Usage is aggregated for all supported regions.
The Google Cloud free tier is also provided for external IP addresses used by VM instances. The external IP address in use can be used at no additional charge until the total number of hours in the month is used up. Usage is the sum of all in-use external IP addresses in all regions. The Google Cloud free tier for your external IP address applies to all instance types, not just f1-micro instances.
Quote: [Google Cloud Free Program](https://cloud.google.com/free/docs/gcp-free-tier?hl=ja&_ga=2.249650575.-865936855.1596008883&_gac=1.221982442.1601002469.CjwKCAjwh7H7BRBBEiwAPXjadiZJ_avk6 always-free)
Limit SSH connections to your instance and change the default SSH port.
There are two types of SSH connection restrictions for an instance: one is by registering the SSH key in the metadata, and the other is by a function called OS login. This time, we will adopt SSH restrictions by OS login. I think this is an easy-to-understand explanation of the differences. Use the convenient function "OS Login" to restrict SSH connection to an instance with IAM
In addition, two-step authentication can be set for OS login, so set that as well to enhance security. Setting up OS Login using 2-step verification (https://cloud.google.com/compute/docs/oslogin/setup-two-factor-authentication?hl=ja)
It also disables the default SSH port 22. When logging in, it is necessary to specify additional parameters, but if it remains at 22, it will be attacked indiscriminately, so it can not be helped. From the VM instance details, view the network details,
The instance login after setting will be as follows. In the first place, all accounts other than the one set in IAM (in my case, the Google account that is the project owner by default) should be rejected. Then, (for the first time) log in to ssh with the passphrase of the automatically created "google_compute_engine" ssh key, but you will be shown the option of two-step authentication, and in your case, log in with the one-time password of the authenticator app of your smartphone.
bash
hoge@cloudshell:~ (my-hoge-app)$ gcloud compute --project "my-hoge-app" ssh --zone "us-central1-a" "instance-7" --ssh-flag="-p 50050"
Enter passphrase for key '/home/hoge/.ssh/google_compute_engine':
Please choose from the available authentication methods:
1: Google phone prompt
2: Security code from Google Authenticator application
3: Voice or text message verification code
Enter the number for the authentication method to use: 2
Enter your one-time password: xxxxxx
Linux instance-7 4.19.0-10-cloud-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Sep 29 11:38:31 2020 from 35.189.187.53
hoge_gmail_com@instance-7:~$
So, I'm going to install Python from here, but it's not straightforward. Even if you simply try to insert Python, there is a high possibility that an error will occur due to insufficient swap area.
Therefore, I would like to thank you for using the method on the following site. Building an environment where Python works in GCE's f1-micro environment
The meaning of the command There are other aspects of lack of understanding, but since it is an environment where it is completely okay to break it, I will enter the command in the same way. The general flow is git, anyenv, pyenv, and finally Python 3.8.5.
bash
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo dd if=/dev/zero of=/var/swapfile bs=1M count=1200
1200+0 records in
1200+0 records out
1258291200 bytes (1.3 GB, 1.2 GiB) copied, 11.2339 s, 112 MB/s
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo chmod 600 /var/swapfile
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo mkswap -L swap /var/swapfile
Setting up swapspace version 1, size = 1.2 GiB (1258287104 bytes)
LABEL=swap, UUID=80b8b0ee-3779-4f2d-b9cb-00cccd3f401f
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo swapon /var/swapfile
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ cat /proc/swaps
Filename Type Size Used Priority
/var/swapfile file 1228796 0 -2
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ echo '/var/swapfile swap swap defaults 0 0' | sudo tee -a /etc/fstab
/var/swapfile swap swap defaults 0 0
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$
I use git to install anyenv, but since git cannot be used in the initial state, install it.
bash
hoge_gmail_com@instance-7:~$ sudo apt-get install git-all
The log flows like a demon. Experience 5 minutes. At the end, it ends with the following feeling.
bash
Install emacsen-common for emacs
emacsen-common: Handling install of emacsen flavor emacs
Install git for emacs
Setting up git-el (1:2.20.1-2+deb10u3) ...
Install git for emacs
Install git for emacs
Setting up emacs (1:26.1+1-3.2+deb10u1) ...
Setting up git-all (1:2.20.1-2+deb10u3) ...
Processing triggers for libgdk-pixbuf2.0-0:amd64 (2.38.1+dfsg-1) ...
Processing triggers for libc-bin (2.28-10) ...
hoge_gmail_com@instance-7:~$
Go back to the steps on the example site and install anyenv.
bash
hoge_gmail_com@instance-7:~$ git clone https://github.com/anyenv/anyenv ~/.anyenv
Cloning into '/home/hoge_gmail_com/.anyenv'...
remote: Enumerating objects: 14, done.
remote: Counting objects: 100% (14/14), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 406 (delta 3), reused 4 (delta 2), pack-reused 392
Receiving objects: 100% (406/406), 70.99 KiB | 3.09 MiB/s, done.
Resolving deltas: 100% (179/179), done.
hoge_gmail_com@instance-7:~$
Other commands are faithfully executed according to the procedure.
In addition, install pyenv according to the procedure. What is the version of pyenv? .. ..
bash
hoge_gmail_com@instance-7:~$ pyenv --version
pyenv 1.2.20-7-gdd62b0d1
Finally, install Python 3.8.5 on pyenv. (Almost no log flows, but about 15 minutes) Globally, safely to 3.8.5.
bash
hoge_gmail_com@instance-7:~$ pyenv install 3.8.5
Downloading Python-3.8.5.tar.xz...
-> https://www.python.org/ftp/python/3.8.5/Python-3.8.5.tar.xz
Installing Python-3.8.5...
Installed Python-3.8.5 to /home/hoge_gmail_com/.anyenv/envs/pyenv/versions/3.8.5
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ pyenv global 3.8.5
hoge_gmail_com@instance-7:~$ python --version
Python 3.8.5
hoge_gmail_com@instance-7:~$
Let's clone the Cloud Source Repositories repository right away.
bash
instance-7:10/01/20 12:24:54 ~ $ gcloud source repos clone gce-cron-test
ERROR: (gcloud.source.repos.clone) PERMISSION_DENIED: Request had insufficient authentication scopes.
If you are in a compute engine VM, it is likely that the specified scopes during VM creation are not enough to run this command.
See https://cloud.google.com/compute/docs/access/service-accounts#accesscopesiam for more information of access scopes.
See https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes for how to update access scopes of the VM.
What went wrong with Cloud Shell gcloud source repos clone failed. .. ..
The Cloud API access scope seems to be a problem, so fix it. Stop the VM instance once, and change the permission of "Cloud Source Repositories" from disabled to read only in "Cloud API Access Scope" at the bottom from [VM Instance Details]> [Edit].
You should now be able to work with Cloud Source Repositories from your VM.
bash
instance-7:10/01/20 12:51:36 ~ $
instance-7:10/01/20 12:51:36 ~ $ gcloud source repos clone gce-cron-test
Cloning into '/home/hogehoge_gmail_com/gce-cron-test'...
remote: Total 11 (delta 1), reused 11 (delta 1)
Unpacking objects: 100% (11/11), done.
Project [my-gce-app] repository [gce-cron-test] was cloned to [/home/hogehoge_gmail_com/gce-cron-test].
Succeeded. Check the contents of the gce-cron-test directory.
bash
nstance-7:10/01/20 12:53:02 ~ $
instance-7:10/01/20 12:53:20 ~ $
instance-7:10/01/20 12:53:20 ~ $ cd gce-cron-test
instance-7:10/01/20 12:53:42 ~/gce-cron-test $ ls -la
total 36
drwxr-xr-x 3 hogehoge_gmail_com hogehoge_gmail_com 4096 Oct 1 12:52 .
drwxr-xr-x 6 hogehoge_gmail_com hogehoge_gmail_com 4096 Oct 1 12:52 ..
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 6148 Oct 1 12:52 .DS_Store
drwxr-xr-x 8 hogehoge_gmail_com hogehoge_gmail_com 4096 Oct 1 12:52 .git
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 146 Oct 1 12:52 cron-test.py
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 2352 Oct 1 12:52 my-web-scraping-app-6293fbee8c53.json
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 2763 Oct 1 12:52 requests-test2.py
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 334 Oct 1 12:52 requirements.txt
instance-7:10/01/20 12:53:47 ~/gce-cron-test $hogehoge
Familiar faces are lined up and it is a wonderful success. Check the python path.
bash
instance-7:10/01/20 12:53:56 ~/gce-cron-test $
instance-7:10/01/20 12:53:56 ~/gce-cron-test $
instance-7:10/01/20 12:58:06 ~ $
instance-7:10/01/20 12:58:06 ~ $ which python
/home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python
instance-7:10/01/20 12:59:10 ~ $
Let's finally edit crontab. You'll be asked for an editor's choice first, so safely choose vim.
bash
instance-7:10/01/20 13:05:10 ~/gce-cron-test $
instance-7:10/01/20 13:05:36 ~/gce-cron-test $
instance-7:10/01/20 13:05:36 ~/gce-cron-test $ crontab -e
no crontab for hogehoge_gmail_com - using an empty one
Select an editor. To change later, run 'select-editor'.
1. /bin/nano <---- easiest
2. /usr/bin/vim.basic
3. /usr/bin/vim.tiny
4. /usr/bin/emacs
Choose 1-4 [1]: 2
crontab: installing new crontab
instance-7:10/01/20 13:06:47 ~/gce-cron-test $
crontab: installing new crontab generated a new crontab.
Check the contents with crontab -l. The procedure is the same as before, just tweak the python path and the PGM and log directories.
bash
instance-7:10/01/20 13:06:52 ~/gce-cron-test $ crontab -l
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
* * * * * cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail_com/gce-cron-test/cron-test.py >> /home/hogehoge_gmail_com/gce-cron-test/cron.log 2>&1
instance-7:10/01/20 13:06:58 ~/gce-cron-test $
The result of running the test PGM before scraping.
bash
instance-7:10/01/20 13:09:49 ~/gce-cron-test $ cat cron.log
2020/10/01 13:07:02 cron works!
2020/10/01 13:08:01 cron works!
2020/10/01 13:09:01 cron works!
2020/10/01 13:10:01 cron works!
I was able to confirm that crontab works properly on GCE.
Install the library using requirements.txt.
bash
instance-7:10/02/20 11:54:02 ~/gce-cron-test $ /home/hogehoge_gmail_com/.anyenv/envs/pyenv/versions/3.8.5/bin/python3.8 -m pip install -r requirements.txt
It was installed firmly.
bash
instance-7:10/02/20 11:57:34 ~/gce-cron-test $ pip list
Package Version
-------------------- ---------
beautifulsoup4 4.9.1
cachetools 4.1.1
certifi 2020.6.20
chardet 3.0.4
google-auth 1.21.0
google-auth-oauthlib 0.4.1
gspread 3.6.0
httplib2 0.18.1
idna 2.10
oauth2client 4.1.3
oauthlib 3.1.0
pip 20.2.3
pyasn1 0.4.8
pyasn1-modules 0.2.8
requests 2.24.0
requests-oauthlib 1.3.0
rsa 4.6
setuptools 47.1.0
six 1.15.0
soupsieve 2.0.1
#scope = ['https://spreadsheets.google.com/feeds',
urllib3 1.25.10
Try hitting the command to put on crontab directly.
bash
iinstance-7:10/02/20 11:57:40 ~/gce-cron-test $
instance-7:10/02/20 12:06:15 ~/gce-cron-test $ cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail
_com/gce-cron-test/requests-test2.py
2020/10/02 12:06:44 Finished scraping.
instance-7:10/02/20 12:06:46 ~/gce-cron-test $
It's a success.
###Crontab settings (scraping PGM)
Then edit crontab.
#### **`bash`**
```bash
instance-7:10/02/20 12:06:57 ~/gce-cron-test $
instance-7:10/02/20 12:06:59 ~/gce-cron-test $ crontab -e
crontab: installing new crontab
instance-7:10/02/20 12:08:52 ~/gce-cron-test $
instance-7:10/02/20 12:08:54 ~/gce-cron-test $
instance-7:10/02/20 12:08:54 ~/gce-cron-test $ crontab -l
Edit this file to introduce tasks to be run by cron.
Each task to run has to be defined through a single line
indicating with different fields when the task will be run
and what command to run for the task
To define the time you can provide concrete values for
minute (m), hour (h), day of month (dom), month (mon),
and day of week (dow) or use '*' in these fields (for 'any').
Notice that tasks will be started based on the cron's system
daemon's notion of time and timezones.
Output of the crontab jobs (including errors) is sent through
email to the user the crontab file belongs to (unless redirected).
For example, you can run a backup of all your user accounts
at 5 a.m every week with:
0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
For more information see the manual pages of crontab(5) and cron(8)
m h dom mon dow command
* * * * * cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail_com/gce-cron-test/cron-test.py >> /home/hogehoge_gmail_com/gce-cron-test/cron.log 2>&1
*/3 * * * * cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail_com/gce-cron-test/requests-test2.py >> /home/hogehoge_gmail_com/gce-cron-test/cron.log 2>&1
instance-7:10/02/20 12:11:42 ~/gce-cron-test $
It worked fine with the setting every 3 minutes! !! Successful automatic execution of python scraping on GCE!
bash
instance-7:10/02/20 12:16:38 ~/gce-cron-test $ cat cron.log
2020/10/01 13:04:53 cron works!
2020/10/01 13:07:02 cron works!
2020/10/01 13:08:01 cron works!
2020/10/01 13:09:01 cron works!
2020/10/01 13:10:01 cron works!
2020/10/02 12:09:21 Scraping has ended.
2020/10/02 12:12:21 Scraping has ended.
2020/10/02 12:15:21 Scraping has ended.
instance-7:10/02/20 12:16:48 ~/gce-cron-test $
#Summary ・ Crontab works on GCE ・ Scraping works even with the performance of the free frame for GCE instances -The GCE VM instance cannot be used unless git is installed in the initial state. -For GCE VM instances, repository operations cannot be performed unless permissions are set. -Installation of Python3 with GCE free frame may fail if you do not install after setting the swap area properly
Recommended Posts