Introduction

It's been a long time since JSON has become the mainstream when exchanging machine-readable data, but sometimes data is distributed in XML (such as data published by an old institution). Or if you are doing natural language processing, for example, the parser CaboCha has an option ( -f 3) to output the analysis result in XML format, so the result processing is in the so-called lattice format. I think it may be used in the sense that it will be easier.

In the latter case, I was trying to drop the parsing result of a large corpus into XML, but when I tried to process the 8GB XML on the machine with 64GB of memory at hand, the memory was full. I got stuck in the middle (I don't even vomit an error). I was a little surprised because I made it 64GB with the intention of trying my best to increase the memory.

The XML in question is in the form of a list with a number of <item> tags hanging under the <root> tag. It seems that it is also a record format.

<root>
    <item>...</item>
    <item>...</item>
    ...
    <item>...</item>
</root>

When processing each ʻitem, it has nothing to do with the other ʻitems, and it is good to look at them one by one. Many of you know that using ʻiterator (generator) is memory friendly when that type of data is huge. Of course, the library that handles XML also has a method that can read an XML file with ʻiterator, but it required a little trick.

XML in the Python standard library

It's easy to use the standard xml.etree.ElementTree when working with XML in Python. There is also a famous dokoro BeautifulSoup, but since it is specialized in HTML, it is analyzed in XML that I want to handle. There is a part that causes an error [^ 1], and I'm addicted to it, so I've settled on the standard library. This article describes the precautions to be taken when doing ʻiteratorXML parsing with this standard libraryxml`.

Normal usage (put everything in memory)

This is the case when using it normally without using ʻiterator`.

import xml.etree.ElementTree as ET

tree = ET.parse('path/to/xml')

for item in tree.iterfind('item'):
    # do something on item

You are reading the <item> tag in the XML tree with .iterfind () while ʻiterator. But just before that, ʻET.parse () is likefile.readlines (). I eat a lot of memory.

When itering (but eating memory)

This is when you want to read while ʻiter`.

import xml.etree.ElementTree as ET

context = ET.iterparse('path/to/xml')

for event, elem in context:
    if elem.tag == 'item':
        # do something on item

If ʻET.parse () is changed to ʻET.iterparse (), the XML in the argument path will be read in ʻiteratorformat. I read it tag by tag, but only when the end of the tag is reached,context returns ʻevent and ʻelem. ʻEvent ==" end " and ʻelem` is an element.

Now you can save memory! If you think about it, it's a big mistake. Actually, even if # do something on item is pass, it uses as much memory as ** "usual usage" **.

** ʻiter, but context` saves all the tags I've read so far **.

Somewhere, a local variable called context.root is hidden inside the iterator. I didn't know that because I didn't even write it in the official documentation. Maybe some people are happy in the sense that they can be accessed repeatedly later, unlike the usual generator. Well, I can imagine that such a mechanism is necessary to read and hold the nested structure of XML.

When iter (do not eat memory)

Then, what should I do? Tips on the Official Page before it was incorporated into the standard in Python 2.5 as a library named ʻElementTree` long ago. had. Python was a newcomer from 3 so I didn't do it at all.

import xml.etree.ElementTree as ET

context = ET.iterparse('path/to/xml', events=('start', 'end'))

_, root = next(context)  #Go one step further and get root

for event, elem in context:
    if event == 'end' and elem.tag == 'item':
        # do something on item
        root.clear()  #Empty root when you're done

You can specify the keyword argument ʻevents in ʻET.iterparse (), and if you specify 'start' to this, it will tell you the opening tag. The first open tag is <root>, so save this as a variable. At this time, the value discarded by _ contains the character string'start'.

If you take root [^ 2], you can get the element information out of memory by.clear ()every time. I'm happy.

[^ 1]: If a single tag such as <link /> reserved in HTML is used in XML, even if there is text inside, it will be erased. There was probably a workaround, but I remember it didn't work.

[^ 2]: Sounds like Android a long time ago and is wonderful.

How to save memory when reading huge XML of several GB or more in Python

Introduction

XML in the Python standard library

Normal usage (put everything in memory)

When itering (but eating memory)

When iter (do not eat memory)