Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 15:23

>>1

This program is ridiculously complicated for a simple encoding like UTF-8, and that's the main reason why it doesn't perform as well as you think it should.

You say 1.5 megabytes per second on your notebook?

What CPU? (Atom, or real processor?) What operating system? What version of what compiler did you use, with what options?

Do you repeat the test twice to eliminate caching effects?

How about sequences which are not random bytes? How fast is it on a large file which is correct, or mostly correct UTF-8?

How fast is it on UTF-8 files that are mostly ASCII? How about ones that are mostly Chinese or other characters that require two or three bytes to encode?

How about files that contain a lot of non-BMP characters (beyond U+FFFF).

Random bytes are not even the kind of input that a UTF-8 validator will be expected to usually validate. Suppose you write a validator that processes 100 megs per second on the expected kinds of inputs, but 1 meg per second on random bytes, does it matter?

Wouldn't your validator throw a lot of errors on random bytes, generating a lot of error message output? It does not look as if your program stops on the first error. Do you disable the error output when testing on random data?

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List