If there is a characteristic tag in the part you want to extract from XML / HTML, BeautifulSoup can search for that tag as a keyword, which is very useful. However, if there is no tag with such characteristics, there is no choice but to follow the tags in order from the root tag to the target tag. BeautifulSoup doesn't support XPATH, so I'm a little weak at this kind of exploration.
Therefore, I made a function to decide the tag by specifying a simple path.
** How to specify PATH ** (1) Uninflected Words-Arrange tag element names in order, separated by'/' html/body/table/tr/td "First td under first tr under first table under body"
(2) Selection of sibling tags --Specify the order n (1 to) in which the tags you want to extract appear with'[n]' html/body/table[1]/tr[3]/td "The first td under the third tr under the first table under the body"
b4_path.py
from bs4 import BeautifulSoup
import re
#<SUBROUTINE>###################################################################
# Function:Find the root tag of any tag
################################################################################
def root(self):
if self.name == u'[document]':
return self
else:
return [node for node in self.parents][-1]
BeautifulSoup.root = root
#<SUBROUTINE>###################################################################
# Function:Extract the specific tag specified in the simple PATH
#
################################################################################
def path(self, path):
re_siblings = re.compile(r'(\w+)\[(\d+)\]')
if path[0] == '/':
node = self.root()
else:
node = self
path = path.strip('/')
for arrow in path.split('/'):
if arrow == 'tbody':
continue
match = re_siblings.match(arrow)
if match:
arrow = match.group(1)
num = int(match.group(2))-1
node = node.find_all(arrow, recursive=False)[num]
else:
node = getattr(node, arrow)
return node
BeautifulSoup.path = path
#<TEST>#########################################################################
# Function:Test routine
################################################################################
if __name__ == '__main__':
soup = BeautifulSoup("""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""")
print soup.path('/html/body/p[2]/a[1]').prettify()
Recommended Posts