Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 15:29

>>14

Continuing with earlier post.

In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities. Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.

You should also check for the invalid code points U+DF00 through U+DFFF.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List