Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-29 2:38

http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbstowcs.html
Use a UTF-8 LC_CTYPE and call mbstowcs(NULL, s, 0) where s is the string you want to validate. If it returns (size_t)-1 you have an invalid string. Otherwise, it will return the length in wide characters.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List