BeautifulSoup trick: Decide the Tag by specifying the path

If there is a characteristic tag in the part you want to extract from XML / HTML, BeautifulSoup can search for that tag as a keyword, which is very useful. However, if there is no tag with such characteristics, there is no choice but to follow the tags in order from the root tag to the target tag. BeautifulSoup doesn't support XPATH, so I'm a little weak at this kind of exploration.

Therefore, I made a function to decide the tag by specifying a simple path.

** How to specify PATH ** (1) Uninflected Words-Arrange tag element names in order, separated by'/'   html/body/table/tr/td "First td under first tr under first table under body"

(2) Selection of sibling tags --Specify the order n (1 to) in which the tags you want to extract appear with'[n]'   html/body/table[1]/tr[3]/td "The first td under the third tr under the first table under the body"

b4_path.py


from bs4 import BeautifulSoup
import re

#<SUBROUTINE>###################################################################
# Function:Find the root tag of any tag
################################################################################
def root(self):
    if self.name == u'[document]':
        return self
    else:
        return [node for node in self.parents][-1]

BeautifulSoup.root = root

#<SUBROUTINE>###################################################################
# Function:Extract the specific tag specified in the simple PATH
# 
################################################################################
def path(self, path):
    re_siblings = re.compile(r'(\w+)\[(\d+)\]')

    if path[0] == '/':
        node = self.root()
    else:
        node = self

    path = path.strip('/')
    for arrow in path.split('/'):
        if arrow == 'tbody':
            continue

        match = re_siblings.match(arrow)
        if match:
            arrow = match.group(1)
            num   = int(match.group(2))-1
            node  = node.find_all(arrow, recursive=False)[num]
        else:
            node  = getattr(node, arrow)
    return node
    
BeautifulSoup.path = path

#<TEST>#########################################################################
# Function:Test routine
################################################################################
if __name__ == '__main__':
    soup = BeautifulSoup("""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
""")

    print soup.path('/html/body/p[2]/a[1]').prettify()

Recommended Posts

BeautifulSoup trick: Decide the Tag by specifying the path
Import by directly specifying the directory path
BeautifulSoup trick: ask for root tag
Install by specifying the version with pip
Read the file by specifying the character code.
Let's decide the date course by combinatorial optimization
Sort the elements of the array by specifying the conditions
Python3 datetime is faster just by specifying the timezone
File renaming using the full path received by the shell
[Python] Get element by specifying name attribute in BeautifulSoup
Access Github by specifying the SSH key in GitPython
Extract the element by deleting the tag contained in the string
[Python] Delete by specifying a tag with Beautiful Soup
Get the full path referenced by .lnk with wsl