Tuesday, January 24, 2006

What's the web made of?

In a neat analysis of one of the larger collections of HTML documents in existence, Google has published some stats and analysis, gleaned from the big index, of what the web is made of. It's a lot of really, really bad HTML. When you talk about a weird standard, HTML has to take the cake.
It takes long enough to roll out a new version of the standard that most browsers just implement new features on their own (and imitate the implementation of new features in other browsers). The standards committees can then base the standards on what works and what doesn't.

Google themselves have added non-standard attributes such as the "nofollow" on anchor tags (to prevent links from being spidered). Microsoft on the one hand built a really loose parser for IE6 that would handle all the sorts of garbage that are out there. On the other hand, their HTML generation tools murder the standard for no good reason with extensions and attributes that no browser ever looks at.

If you had a browser that stictly enforced standards- validating the HTML/XHTML it encountered, the list of websites you could hit might be shorter than the list of opposition political party websites you can click on from Beijing. Still, the standards body seems to have a real purpose and influence, despite the general lack of rule-following behavior.

Some choice quotes from the site:

"The most-used attribute on html elements is xmlns, from misguided people using XHTML but sending it as text/html. They even (just) outnumber the people who specify the lang attribute!"

"A whole slew of people are specifying the xml:lang attribute, which will have absolutely no effect (no HTML processor will look at that attribute; it's an XML attribute). And finally, the fourth most-used attribute on the html element is the dir attribute (used by people who write in languages written right-to-left to make the text render in the right way)."

"One conclusion one can draw from the spread of attributes used on the body element is that authors don't care about what the specifications say. Of these top twenty attributes, nine are completely invalid, and five have been deprecated for nearly eight years, half the lifetime of the Web so far."