A little more than 2 years later and it seems my estimates in
>>51 were quite pessimistic. Seeing the multi-megabyte beasts of browsers today probably bloated my own thoughts on how much code we really need...
I now have a HTML5 parser + DOM tree construction in under 24KB of binary, and this is before even having done
any real attempt at optimisation on the code - the tokenizer is simpler than what's in the spec (but isn't a state machine) although should accept the same stuff, and the tree construction is almost exactly following spec, just ignoring all error detection (what's a parse error? that which by any other name would render just as well...) - and it's written in 32-bit C.
Entropy calculations suggest a lower bound for the parser and tree construction somewhere around 12KB, I'm guessing it maybe achievable with Asm and closer to 16-20KB with just factoring out the duplicated code in C (the 12KB tokenizer I mentioned above is a dumb state machine as per the spec, so if I rewrote
this one in Asm it would likely turn out much smaller.)
Now all we need is just a CSS parser/box generator and renderer, and some miscellaneous UI and other bits, and WE'VE WRITTEN A FUCKING HTML5-COMPLIANT WEB BROWSER IN A <1MB EXECUTABLE
!!!