HTML/XML whitespace explained

Odd facts about HTML and XML whitespace!

So, you know these three syntaxes:

<em>   [a simple start-tag]

<em thing="foo" guh='bar'>   [a start-tag with attributes]

</em>   [a close-tag]

So there's whitespace in the start-tag when there's attributes. In fact, you must put whitespace before each attribute name, or the tagname and each attribute would just run together.  So, using magenta for the mandantory whitespace:

<em  thing="foo"  guh='bar'>

... and by "whitespace", I mean: some number of spaces and/or tabs and/or returns.  (\x20, \x0A, \x0D, \x09)  Yes, it doesn't have to be just spaces, or just one space.  So, above is where the mandatory whitespace is.

But!  You can put whitespace in other places too.  And here's where, illustrated with orange (optional) blocks:

<em  thing  =  "foo"  guh  =  'bar'  >

And back to the simpler syntaxes:

<em>   =   <em  >

</em>  =  </em  >

So, this is completely valid:

<em
 thing
         = "foo"
 guh
         =  'bar'

 >I like potatoes</em
>

...and it's exactly the same as this:

<em thing="foo" guh='bar'>I like potatoes</em>

Does this shock you!?  Because it is all very extremely true.

And I hope it shows you that you can indent things in all kinds of ways, and you can spread tags over as many lines as you want (instead of thinking that you need to cram an element's start tag and all its attributes all on one line).


There are some unfortunate surprises, however:

Now, as a programmer, you knew there could be whitespace inside of tags, and if I've shown you more places than you expected, then that doesn't bother you— because the parser hasn't been signalling you about any whitespace anyway.

However, whitespace inside quoted attribute values isn't quite preserved, and this is a real problem for you as a coder and as a programmer. The fact that the problem rears its ugly head only once in a while makes it even more of a problem— because it means you can spend years making and parsing XML without ever noticing the problem.

TODO: EXPLANATION AND EXAMPLES. 2012-09-13


The rules for whitespace inside tags are in the XML spec, rules 40 ("STag"), 41 ("Attribute"), and 42 ("ETag"), and notably also rules 25 ("Eq") and 3 ("S").

Incidentally, for XML self-closing elements (".../>"), the above rules still hold, with the stipulation that there can't be any space between the "/" and the ">".  In the spec, that's rule 43 ("EmptyElemTag").

The rules for dealing with whitespace inside attribute values are in the section "3.3.3 Attribute-Value Normalization".

And if you really want to consider HTML's whitespace rules as deriving from SGML instead of XML, and if you think that that might make a difference, get the SGML book and have at it.  But be warned, it all looks like THIS!!!

sburke@cpan.org   / Last updated 2012-09-13