This is a memo of tools that can be used when scraping with Python.
The easiest way to access the web in Python is to use requests. You can install it with pip.
For GET and POST, using requests.get and requests.post is generally sufficient.
Installation
$ pip install requests
Please see here for details. http://requests-docs-ja.readthedocs.org/en/latest/
BeautifulSoup4 is a good way to parse HTML.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div><h1 id="test">TEST</h1></div>', 'html')
>>> soup.select_one('div h1#test').text
'TEST'
The characters in the tag are soup.text, and the attributes can be accessed withsoup ['id'](where id is the attribute name).
Frequently used methods of BeautifulSoup object
--BeautifulSoup.find ()-> Search for tags and return the first hit tag --BeautifulSoup.find_all ()-> Search for tags and return a list of hit tags --BeautifulSoup.find_previous ()-> Returns the previous tag --BeautifulSoup.find_next ()-> Returns the next tag --BeautifulSoup.find_parent ()-> Returns parent tag --BeautifulSoup.select ()-> css selector returns a list of tags --BeautifulSoup.select_one ()-> Search with css selector and return the first hit tag
Please see here for details. http://kondou.com/BS4/
CSV is a comma-separated format file. You can use the csv module. Learn more about the csv module here. http://docs.python.jp/3.4/library/csv.html
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
The JSON format is also a commonly used format. Use the standard module json module.
>>> import json
>>> json.dumps([1, 2, 3, 4])
'[1, 2, 3, 4]'
>>> json.loads('[1, 2, 3, 4]')
[1, 2, 3, 4]
>>> json.dumps({'aho': 1, 'ajo': 2})
'{"aho": 1, "aro": 2}'
>>> json.loads('{"aho": 1, "ajo": 2}')
{u'aho': 1, u'aro': 2}
--json.dumps ()-> Make the object a JSON string --json.loads ()-> Make JSON string an object --json.dump ()-> Turn the object into a JSON string and write it to a file --json.load ()-> Read the JSON string in the file and make it an object
Please see here for details. http://docs.python.jp/3.4/library/json.html
We have prepared some scraping samples. please refer. However, please do not throw requests bang bang as there are general sites. Even if you make a mistake, you can't just turn the loop.
--Extract tutorial information from PyConJP https://github.com/TakesxiSximada/happy-scraping/tree/master/pycon.jp --Extract new package information from PyPI https://github.com/TakesxiSximada/happy-scraping/tree/master/pypi.python.org --Break through Django's Admin site authentication https://github.com/TakesxiSximada/happy-scraping/tree/master/djangoadmin --User-Agent spoofing https://github.com/TakesxiSximada/happy-scraping/tree/master/fake-useragent --Extract the data dynamically generated by Javascript https://github.com/TakesxiSximada/happy-scraping/tree/master/dynamic-page
--https://teratail.com/ It might be a good idea to mow the entry on the top page. --http://isitchristmas.com/ Christmas Judgment (Timely) --https://data.nasa.gov/developer NASA data is available, so it may be interesting to look it up.
There are many other sites that look good ...
Recommended Posts