Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 18:51

OP here, continued.

>>16

In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities.
That's what I use msz_byte = msz(byte); for. If msz_byte is 7, then it's ASCII, if it's 6, then it's a continuation, if it's 1 to 5, then it's a start byte.

Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.
I have done all of these things. Sorry if the code is not clear.

>You should also check for the invalid code points U+DF00 through U+DFFF.
You mean U+D800 to U+DFFF (surrogate pairs)? I have:

/* Check if a codepoint is valid, returning 1 if so and 0 otherwise. Invalid
   codepoints include those higher than U+10ffff, any codepoint from U+fdd0 to
   U+fdef inclusive, as well as the last two codepoints in every plane, and
   all surrogate pair values (U+d800 to U+dfff inclusive). */
 
inline int valid(uint32_t cp) {
        return (
                (cp < 0x110000) &&
                ((cp < 0xfdd0) || (cp > 0xfdef)) &&
                ((cp & 0xfffe) != 0xfffe) &&
((cp & 0xfffff800) != 0xd800)
        );
}

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List