Scraping Alexa's web rank with pyQuery

Thing you want to do

** I tried using pyQuery. ** ** I also found a library called Scrapy, but it seems to be troublesome because it includes crawlers, so I avoid it. beautifulsoup looks good, but this time I will try pyQuery.

Installation

$ yum install libxml2-devel libxslt-devel
$ pip install pyquery

Since pyQuery uses libxml2, install it first. If you don't have pip, install it as well.

Referenced (try pyQuery sample)

I tried scraping the earthquake information site with the sample code of [here] [Ref1].

pqsample.py


import pyquery
query = pyquery.PyQuery("http://www.jma.go.jp/jp/quake/quake_local_index.html", parser='html')
for tr in query('.infotable')('tr'):
    print query(tr).text()

This code prints the contents of the <tr> tag under the class =" infotable " in a for loop. When I checked the configuration of html with the developer tool of chrome, it was as follows. image

I got the following earthquake information obediently with python pqsample.py. Certainly easy.

Information announcement date and time Occurrence date and time Epicenter Place name Magnitude Maximum seismic intensity December 03, 2014 14:38 Around 14:32 on March 3, 2014 Northern Nagano Prefecture M1.6 Seismic intensity 1 December 03, 2014 06:03 Around 06:00 on the 3rd Northern Nagano Prefecture M2.0 Seismic intensity 1

Alexa ranking analysis

I found that it works, so I started scraping the favorite site. Open the desired page in chrome, press the magnifying glass mark from the developer tools (CTRL-Shift-I) window, and click the element you want to examine. The DOM tree is displayed as shown below. (If you are firefox, you can check it in the inspector.)

image

With this tree structure, you should list the <li> tags using the class =" site-listing " as the key. The rank is in count, and the domain is in the<a>tag under desc-paragraph. I wrote the code to output these to csv by turning for.

alexa.py


import pyquery

for page in range(20):
    query = pyquery.PyQuery("http://www.alexa.com/topsites/countries;" + str(page) + "/PE", parser='html')
    for li in query('.site-listing')('li'):
        print query(li)('.count').text() + ", " + query(li)('.desc-paragraph')('a').text()

This time I wanted a Peruvian rank, so I specified the country code / PE page. If you specify your favorite country code here, you can get the page of that country. The code loops 20 HTML pages. So run python alexa.py.

image

csv is done. Great success. After that, it is useful for creating a table with excel using this, or for connection test with curl.

Summary

-With the chrome + pyQuery combo, you can easily scrape the information obtained by cutting and pasting, which is comfortable. -Although Alexa API can be used from AWS, it seems that the TOP list cannot be obtained, so this is good. ・ I may write a volume of easy connection test with curl soon.

Reference site

[Scraping with Python (pyquery)] [Ref1] [Ref1]:http://d.hatena.ne.jp/kouichi501t/20130407/1365328955

Recommended Posts

Scraping Alexa's web rank with pyQuery
Scraping with Python + PyQuery
Web scraping with python + JupyterLab
Save images with web scraping
Easy web scraping with Scrapy
Web scraping beginner with python
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with BeautifulSoup4 (layered page)
Web scraping with Python First step
I tried web scraping with python.
web scraping
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Web scraping with BeautifulSoup4 (serial number page)
[For beginners] Try web scraping with Python
Scraping with selenium
Scraping with Python
Scraping with Python
web scraping (prototype)
Scraping with Selenium
AWS-Perform web scraping regularly with Lambda + Python + Cron
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Data analysis for improving POG 1 ~ Web scraping with Python ~
Scraping with Python + PhantomJS
Introduction to Web Scraping
Quick web scraping with Python (while supporting JavaScript loading)
Scraping with scrapy shell
Python beginners get stuck with their first web scraping
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Beautiful Soup
Scraping RSS with Python
Web crawling, web scraping, character acquisition and image saving with python
I tried scraping with Python
Automatically download images with scraping
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Festive scraping with Python, scrapy
Web application development with Flask
Scraping with Selenium in Python
Web scraping technology and concerns
Web application creation with Django
Trade-offs in web scraping & crawling
Scraping with Tor in Python
Web API with Python + Falcon
Image collection by web scraping
Web scraping using Selenium (Python)
Scraping weather forecast with python
scraping the Nikkei 225 with playwright-python
Web scraping using AWS lambda