XML parsers, and the myth of reordering content

XML parsers preserve the order of input elements. Obviously.

There's this story going around, and which has been living a strange subterranean half-life for years, and just ... won't ... die. It says that XML parsers aren't required to preserve the order of the elements in their input stream, when they report them to the application. I know, I know, that sounds neither helpful nor likely, but it is asserted nonetheless.

The argument comes in two halves: (i) there are parsers which don't preserve the order of elements, and (ii) the XML standard doesn't require parsers to do this. There is what appears to be support for both halves of this argument, but both parts collapse under closer scrutiny.

(i) Some parsers mix up the elements

There are some parsers which do not preserve the order of input XML elements, but they are either intended for restricted cases (both simplexmlparse in python and XML::Simple, to pick a random Perl example, document that by conveniently representing children in a hash, they have to lose element order), or they are buggy.

I think that what may have happened is that there have been parsers (in Perl?) that implicitly depended on the ordering of hash keys to report parse results to applications. This, obviously, broke, but rather than just fix it, I suspect that someone came over all language-lawyerly (which is never a good look in a Perlmonger, like a goth in legal bands and wig) and declared...

‘(ii) The XML standard doesn’t require it!’

This is a curious argument, because it turns out to be technically true, while being almost entirely meaningless.

It is occasionally asserted that the XML Standard doesn't require that elements are passed through to the application in document order. This is true – the standard doesn’t mandate that.

However the standard also fails to demand that elements are passed through at all. It does require, in Sect. 2.10, that [a]n XML processor must always pass all characters in a document that are not markup through to the application (which, entertainingly enough, is one of the few places in the standard where must and application appear close to each other). But it doesn't require that only the input characters are passed through, nor that those characters are passed through in order.

The upshot, therefore, is that it would appear that a parser could take an input document which looked like: <?xml version='1.1'?> <el> <foo exhortation='ecky thump!'/> <content>Hello, World!</content> </el> and pass to the application: deHllloorW! Help -- I'm trapped in an XML parser factory. This would conform to the XML spec, according to these language lawyers. This would not generally be regarded as a useful parser.

So it's a bad spec?

Well, not really. Standards (I've read a few – but then again, too few to mention) do not exist in a vacuum, but always have at least some sort of anticipated background knowledge. For example, the ISO standard for making tea presumes that you know what tea is, and how to drink it.

Contrary to the spec's Section 1 claim that it is talking about the information [an XML processor] must provide to the application, it's actually primarily concerned with XML syntax, rather than a processing model.

The SGML spec doesn't say anything about the order in which parsed elements are reported, either, but I think that anyone reading the SGML or XML specs would be as surprised to have that written down, as a reader of ISO 3103 would be, if they were instructed to ingest the tea orally, rather than through any other orifice.

What is actually specified?

(Note: at this point I know I'm really going on a bit, and it's all becoming slightly nuts, but I'm running out of ways of stating the obvious, here)

The XML Infoset was defined in order to provide the background processing model for all XML standards. In Section 2.2, it states that the information associated with a parsed element includes:

...
4. [children] An ordered list of child information items, in document order. [...]
5. [attributes] An unordered set of attribute information items, one for each of the attributes (specified or defaulted from the DTD) of this element.

The XML spec doesn't say in as many words that it's intended to produce an Infoset, but paragraph 1 of the Infoset spec says:

This specification defines an abstract data set called the XML Information Set (Infoset). Its purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document (paragraph 1)

The InfoSet therefore doesn't constrain anything, in the sense that it renders something non-conformant if it garbles the order of child elements. However, if an XML parser talks about 'children', then it must be presumed to be using the term in the sense defined in the Infoset document, which implies that children elements are reported in document order.

The InfoSet also informs the XPath and XSLT processing models. Thus if XML parsers really were supposed to have this reordering licence, then these core specifications would be fundamentlaly incoherent, and the entire XML world would end.

Thus this repeated claim that 'some XML parsers report elements in a different order' is just superstition. To write such a parser would be perverse, and to use it in an application would be foolish.

And finally...

I did find an xml-dev thread about this from 2001. The discussion is slightly complicated by being simultaneously about whether element order has intrinsic semantic significance (which I think was just a misunderstanding on someone's part) and whether a parser must preserve it. The thread ends with a message, in which Tim Bray says:

At 10:44 PM 23/01/01 +0600, Danny Ayers wrote:
If the document order is determined by the XML syntax, then I would have expected this to appear in the rec. Where exactly in the XML spec is document order defined?
It is a well-known hole in the XML spec that it never said that an XML processor is expected to allow software to access elements in the order of their occurrence - and conversely, is not required to do so for attributes. The second edition errata made it clear that attribute order is not significant but did not take the trouble to make the former point more clear. Probably, nobody felt any pressure to do this because every piece of XML software in the universe in fact does this and as far as I know nobody has ever remotely considered not doing so.

Since Tim Bray was one of the XML spec authors, I tend to feel that this should resolve the matter.