/prog/ - UTF-8 validator

Name: Anonymous 2012-01-28 18:48

OP here. I posted this just before I went to sleep. I'm truly impressed with the SNR in this thread, /prog/.

>>2

The msz function can be replaced with inline assembly using the bsr (bit-scan reverse) instruction if you're on x86(_64).

Instead of switching on the state, use an inner loop.

And there's probably a ton of bit-twiddling optimizations you could do, but I can't be bothered to think about it. (Read "Hacker's delight" for more info.)
Will do, thanks.

Also, you've of course profiled this, right?
I'm not sure how to profile a C program, but I'll look it up and find out; thanks.

>>14

>This program is ridiculously complicated for a simple encoding like UTF-8, and that's the main reason why it doesn't perform as well as you think it should.

There is more complexity than a minimal code-size implementation, but I sacrificed some simplicity for (what I think, at least) some optimisation and cleanliness. Technically, I don't even need msz_byte, but finding the most significant zero is a 'cool' way to determine the kind of byte it is (ASCII = 7, continuation = 6, start byte = 5, 4, 3, 2, 1). A simply coded implementation would probably be even slower.

>You say 1.5 megabytes per second on your notebook?

>What CPU? (Atom, or real processor?) What operating system? What version of what compiler did you use, with what options?

delan@pluto ~ $ grep 'model name' /proc/cpuinfo

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

delan@pluto ~ $ cat /etc/gentoo-release; uname -a

Gentoo Base System release 2.0.3

Linux pluto 3.0.6-gentoo #5 SMP Sun Nov 27 17:26:38 WST 2011 x86_64 Intel(R) Atom(TM) CPU N550 @ 1.50GHz GenuineIntel GNU/Linux

I used tcc 0.9.25 and gcc 4.5.3; gcc with -O0, -O1, -O2 and -O3. Inlining doubled the speed, but it's still ``SLOW AS FUCK''.

>Do you repeat the test twice to eliminate caching effects?
Yes.

>How about sequences which are not random bytes? How fast is it on a large file which is correct, or mostly correct UTF-8?

>How fast is it on UTF-8 files that are mostly ASCII? How about ones that are mostly Chinese or other characters that require two or three bytes to encode?

>How about files that contain a lot of non-BMP characters (beyond U+FFFF).
I haven't tried these; I will now, thanks.

>Random bytes are not even the kind of input that a UTF-8 validator will be expected to usually validate. Suppose you write a validator that processes 100 megs per second on the expected kinds of inputs, but 1 meg per second on random bytes, does it matter?
Now that you mention it... no. Nobody's going to feed /dev/urandom into it anyway.

>Wouldn't your validator throw a lot of errors on random bytes, generating a lot of error message output? It does not look as if your program stops on the first error. Do you disable the error output when testing on random data?
No, I haven't. That'd be a good idea (error() probably has a high function call overhead). Though, I do do the testing like this, so...

time ./a.out < testrandom1meg > /dev/null

UTF-8 validator

1 Name: Anonymous 2012-01-28 9:20

36 Name: Anonymous 2012-01-28 18:48

Name: Anonymous 2012-01-28 9:20

Name: Anonymous 2012-01-28 18:48