It's been a long time since JSON has become the mainstream when exchanging machine-readable data, but sometimes data is distributed in XML (such as data published by an old institution).
Or if you are doing natural language processing, for example, the parser CaboCha
has an option ( -f 3
) to output the analysis result in XML format, so the result processing is in the so-called lattice
format. I think it may be used in the sense that it will be easier.
In the latter case, I was trying to drop the parsing result of a large corpus into XML, but when I tried to process the 8GB XML on the machine with 64GB of memory at hand, the memory was full. I got stuck in the middle (I don't even vomit an error). I was a little surprised because I made it 64GB with the intention of trying my best to increase the memory.
The XML in question is in the form of a list with a number of <item>
tags hanging under the <root>
tag. It seems that it is also a record format.
<root>
<item>...</item>
<item>...</item>
...
<item>...</item>
</root>
When processing each ʻitem, it has nothing to do with the other ʻitem
s, and it is good to look at them one by one. Many of you know that using ʻiterator (generator) is memory friendly when that type of data is huge. Of course, the library that handles XML also has a method that can read an XML file with ʻiterator
, but it required a little trick.
It's easy to use the standard xml.etree.ElementTree
when working with XML in Python. There is also a famous dokoro BeautifulSoup, but since it is specialized in HTML, it is analyzed in XML that I want to handle. There is a part that causes an error [^ 1], and I'm addicted to it, so I've settled on the standard library.
This article describes the precautions to be taken when doing ʻiteratorXML parsing with this standard library
xml`.
This is the case when using it normally without using ʻiterator`.
import xml.etree.ElementTree as ET
tree = ET.parse('path/to/xml')
for item in tree.iterfind('item'):
# do something on item
You are reading the <item>
tag in the XML tree with .iterfind ()
while ʻiterator. But just before that, ʻET.parse ()
is likefile.readlines ()
. I eat a lot of memory.
This is when you want to read while ʻiter`.
import xml.etree.ElementTree as ET
context = ET.iterparse('path/to/xml')
for event, elem in context:
if elem.tag == 'item':
# do something on item
If ʻET.parse () is changed to ʻET.iterparse ()
, the XML in the argument path will be read in ʻiteratorformat. I read it tag by tag, but only when the end of the tag is reached,
context returns ʻevent
and ʻelem. ʻEvent ==" end "
and ʻelem` is an element.
Now you can save memory! If you think about it, it's a big mistake. Actually, even if # do something on item
is pass
, it uses as much memory as ** "usual usage" **.
** ʻiter, but
context` saves all the tags I've read so far **.
Somewhere, a local variable called context.root
is hidden inside the iterator. I didn't know that because I didn't even write it in the official documentation. Maybe some people are happy in the sense that they can be accessed repeatedly later, unlike the usual generator
. Well, I can imagine that such a mechanism is necessary to read and hold the nested structure of XML.
Then, what should I do? Tips on the Official Page before it was incorporated into the standard in Python 2.5 as a library named ʻElementTree` long ago. had. Python was a newcomer from 3 so I didn't do it at all.
import xml.etree.ElementTree as ET
context = ET.iterparse('path/to/xml', events=('start', 'end'))
_, root = next(context) #Go one step further and get root
for event, elem in context:
if event == 'end' and elem.tag == 'item':
# do something on item
root.clear() #Empty root when you're done
You can specify the keyword argument ʻevents in ʻET.iterparse ()
, and if you specify 'start'
to this, it will tell you the opening tag. The first open tag is <root>
, so save this as a variable. At this time, the value discarded by _
contains the character string'start'
.
If you take root
[^ 2], you can get the element information out of memory by.clear ()
every time. I'm happy.
[^ 1]: If a single tag such as <link />
reserved in HTML is used in XML, even if there is text inside, it will be erased. There was probably a workaround, but I remember it didn't work.
[^ 2]: Sounds like Android a long time ago and is wonderful.
Recommended Posts