/prog/ - UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 20:03

>>39

http://linux.die.net/man/2/read

I see, so if the read was interrupted by a signal, then my program would terminate prematurely, while yours would spin on reading until it got it, or failed from a different error. I was trying to use fread though, which can be used on operating systems that are not necessarily posix compliant, and checking to see if the read failed due to a signal interrupt doesn't seem to be possible using fread, at least it isn't specified in the man page.

Name: Anonymous 2012-01-28 21:18

>>37

You may have done those things, but not in the natural way that most C programmers would have done them. You don't need general bit scanning, just some classification based on masks:

if ((byte & 0x80) == 0) { /* it's ascii */ }
else if ((byte & 0xC0 == 0xF0)) { /* cont. byte */ }
else if ((byte & 0xE0) == 0xC0) { /* 110xxxxx: 2 byte code */
... etc

Each of the cases can be handled in a dedicated block of code.
Not every case falls neatly into the highest significant zero idea.

Not saying that this is necessarily where your bottleneck is, but the code isn't exactly tight.

You know, you could even have a 256 entry lookup table to classify the byte.

switch (byte_0_table[byte]) {
case ASCII: ... break;
case INVALID: ... break;
case CODE2: ... break; // two byte code prefix
}

Name: YOUR_SUPER_NIGGER 2012-01-28 21:27

>>42
That last method looks pretty tight.

Name: Anonymous 2012-01-28 21:28

>>42

Not every case falls neatly into the highest significant zero idea.
Yes, it does.

For ASCII bytes, 0b0xxxxxxx will always have the highest zero at bit 7. For continuation bytes, 0b10xxxxxx will always have the highest zero at bit 6. And so on. I believe that (as long as the msz function itself is fast, and it isn't at the moment) using the most significant zero test is faster than masking or a lookup table.

Name: Anonymous 2012-01-28 21:31

>>36

Ah see? You're just redirecting errors to /dev/null and feeding in random bytes? You're probably feeding reams of error message crap into /dev/null. Just one bad byte produces a pretty long error message, right? Have you measured the quantity of error message material as a ratio relative to input bytes? If you feed in 10 megs of random bytes, how large is the error output? You know, one in four randomly chosen bytes is a continuation byte!!! (These occupy 0x80-0xBF).

Have you run the program under "time"? How much is the system time versus user time?

You might want to try profiling it just for fun with gprof or even better, oprofile.

And just as I suspected, your laptop is an Atom device. Those things are slow, just accept it. Think of all the energy you're saving and the battery life you're getting (when not running your program, anyway, haha). These Atom CPU's are probably about as fast as desktop PC's from ten years ago or more.

Name: Anonymous 2012-01-28 21:33

>>44

Actually you are right, sorry.

And, nice dubs.

Name: Anonymous 2012-01-28 23:04

>>1

also, the compiler might be able to optimize your code better if you used regular looping constructs, rather than gotos. Gotos can make certain optimizations difficult.

Name: Anonymous 2012-01-29 2:38

http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbstowcs.html
Use a UTF-8 LC_CTYPE and call mbstowcs(NULL, s, 0) where s is the string you want to validate. If it returns (size_t)-1 you have an invalid string. Otherwise, it will return the length in wide characters.

Name: Anonymous 2012-01-29 4:40

>>48
I do not trust the libc implementation to do it correctly. I intend to write an independent, correct implementation.

Name: Anonymous 2012-01-29 5:03

>>48
Also, there is a difference: that validates a string in memory, while my implementation validates stdin as a stream with O(1) memory consumption.

Name: Anonymous 2012-01-30 4:55

>>48

Yagodda be kidding!!! Calling some fawking C library function written by imbeciles which will do something slightly different on every system it's compiled on.

Name: Anonymous 2012-01-31 5:14

I've incorporated some optimisations [1], and the program is now reduced in runtime by 50% - 80%. I think the only real optimisation left is to read blocks at a time to reduce getchar() call overhead.

http://pastebin.com/miDqFzww
________________________________
[1] http://codereview.stackexchange.com/a/8485/10422

Name: Anonymous 2012-01-31 5:43

One optimisation I made afterwards was to eliminate the getchar() function call overhead by reading and parsing 4096 bytes at a time. This cut down runtimes by at least another 50%.

http://pastebin.com/DfnVi6vt

Name: Anonymous 2012-01-31 5:58

Updated to fix numerous bugs:

http://pastebin.com/NdWivCez

Name: Anonymous 2012-01-31 6:04

More bugs fixed:

http://pastebin.com/ywcm6LWC

UTF-8 validator

1 Name: Anonymous 2012-01-28 9:20

41 Name: Anonymous 2012-01-28 20:03

42 Name: Anonymous 2012-01-28 21:18

43 Name: YOUR_SUPER_NIGGER 2012-01-28 21:27

44 Name: Anonymous 2012-01-28 21:28

45 Name: Anonymous 2012-01-28 21:31

46 Name: Anonymous 2012-01-28 21:33

47 Name: Anonymous 2012-01-28 23:04

48 Name: Anonymous 2012-01-29 2:38

49 Name: Anonymous 2012-01-29 4:40

50 Name: Anonymous 2012-01-29 5:03

51 Name: Anonymous 2012-01-30 4:55

52 Name: Anonymous 2012-01-31 5:14

53 Name: Anonymous 2012-01-31 5:43

54 Name: Anonymous 2012-01-31 5:58

55 Name: Anonymous 2012-01-31 6:04