Wednesday, December 4, 2013

Fast XML parsing with XmlReader and LINQ to XML

Using XmlDocuments to parse large XML strings, as we know, can spike memory usage. The entire document is parsed and turned into an in-memory tree of objects. If we want to parse the document using less memory there are a couple of alternatives to using XmlDocuments. We could use an XmlReader but the code can be messy and it’s easy to accidentally read too much (see here). We could use XPath but that’s more designed for searching sections of XML rather than parsing an entire document. Lastly we could use LINQ to XML which offers the simplicity of XmlDocument along with LINQ queries but by default will load the entire document into memory.

This blog post offered an interesting alternative of combining LINQ to XML with XmlReaders. This hybrid approach seemed to offer the speed of forward parsing XmlReaders with the simplicity and functionality of LINQ objects.

The first step was creating a method in an utility class that abstracted out the reader and returns just the matching elements. The secret sauce is the ‘yield return’ keyword which I will explain below.

/// <summary>
/// Given an xml string and target element name return an enumerable for fast lightweight 
/// forward reading through the xml document. 
/// NOTE: This function uses an XmlReader to provide forward access to the xml document. 
/// It is meant for serial single-pass looping over the element collection. Calls to functions 
/// like ToList() will defeat the purpose of this function.
/// </summary>
public static IEnumerable<XElement> StreamElement(string xmlString, string elementName) {
    using (var reader = XmlReader.Create(new StringReader(xmlString))) {
        while (reader.Name == elementName || reader.ReadToFollowing(elementName))
            yield return (XElement)XNode.ReadFrom(reader); 
    }
}
Say you have a large CD catalog to read in like:
<Catalog>
  <CD>
    <Title>Stop Making Sense</Title>
    <Band>Talking Heads</Band>
    <Year>1984</Year>
  </CD>
  ...
</Catalog>
If you were using an XmlDocument to read that from an XML string and process each element you might have code like:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(catalogXml);
XmlNodeList discs = xmlDoc.GetElementsByTagName("CD");
foreach (XmlElement discElement in discs) {
    //... Process each element
}
You can convert that to using the hybrid LINQ/XmlReader approach like the following:
IEnumerable<XElement> discs = from node in XmlUtils.StreamElement(catalogXml, "CD") select node;
foreach (XElement discElement in discs) {
    //... Process each element
}
The one big caveat is that you can’t call any functions on the discs collection that would require looping over all of the items to get the answer (eg ToList(), Count, etc). This is because we are relying on yield to return each element one at a time. We process it and then move on to the next one. This allows memory associated with individual elements to be garbage collected as we go along and not held into memory en masse. This approach works best when we have an XML document with a set of homogenous elements that can be forward processed.

More on yield:
You consume an iterator method by using a foreach statement or LINQ query. Each iteration of the foreach loop calls the iterator method. When a yield return statement is reached in the iterator method, expression is returned, and the current location in code is retained. Execution is restarted from that location the next time that the iterator function is called.
One thing to stress when making any performance related changes is that you need to establish baseline performance numbers and then verify that the changes improve it. So for each method record the time and memory use before any changes are made and after. You can use something like the following to determine the baseline and any performance gains.
Stopwatch stopWatch = Stopwatch.StartNew();
long startMem = GC.GetTotalMemory(false)

// Code to benchmark

stopWatch.Stop();
long endMem = GC.GetTotalMemory(false);
Console.WriteLine ("{0} ms", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine ("{0} mem", endMem - startMem);

No comments: