Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 21:28

>>42

Not every case falls neatly into the highest significant zero idea.
Yes, it does.

For ASCII bytes, 0b0xxxxxxx will always have the highest zero at bit 7. For continuation bytes, 0b10xxxxxx will always have the highest zero at bit 6. And so on. I believe that (as long as the msz function itself is fast, and it isn't at the moment) using the most significant zero test is faster than masking or a lookup table.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List