/prog/ - UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 9:56

The msz function can be replaced with inline assembly using the bsr (bit-scan reverse) instruction if you're on x86(_64).

Instead of switching on the state, use an inner loop.

And there's probably a ton of bit-twiddling optimizations you could do, but I can't be bothered to think about it. (Read "Hacker's delight" for more info.)

Also, you've of course profiled this, right?

Name: Anonymous 2012-01-28 10:12

Before you start dicking around with inline-asm, run the loop against a mmap()ed file. It's likely that the getchar() (eg. reading byte-wise from stdin) is the bottle neck and not some bit manipulations.

Name: Anonymous 2012-01-28 10:23

What >>3 said. If you don't want to use mmap you can just try reading larger chunks of bytes into a memory buffer and see if that helps. Of course, to know what really would help, you should profile this, like >>2 said.

Name: Anonymous 2012-01-28 10:34

>>4
>If you don't want to use mmap you can just try reading larger chunks of bytes into a memory buffer and see if that helps.

Why do you insist on commenting about shit you clearly don't understand? It's almost like you're a retard. You understand that you are clueless, yet you still keep rattling on.

Name: 2 2012-01-28 13:22

>>5
If he really wants it to read from stdin, using mmap may not be an option, though I think getchar() already does some buffering internally. (I hardly know C (but I do know Unix, so I know that ttys and pipes normally can't be mmapped (so I'm not a blabbering idiot (that often)))). But yes, mmap is the way to go (but profile!(!)).

Name: Anonymous 2012-01-28 13:28

WHERE THE FUCK IS OP?????

Name: Anonymous 2012-01-28 13:29

this could help op.

http://linux.die.net/man/3/fread



#define BUFFER_LENGTH 1024

char buffer[BUFFER_LENGTH];

int ret = read(buffer, sizeof(char), BUFFER_LENGTH, stdin);

if(ret == 0) {

  // end of file reached. Break out.

} else {

  // ret is the number of characters successfully read into buffer.

  check_buffer_for_utf8_errors(utf8_scanner_state, buffer, ret);

}

Name: Anonymous 2012-01-28 13:30

change read to fread. woops

Name: Anonymous 2012-01-28 15:11

>>9
You're a fucking moron who has no possible future as a computer programmer.

Name: Anonymous 2012-01-28 15:14

>>10
Go scrub another midget you piece of shit faggot.

Name: Anonymous 2012-01-28 15:18

>>11
Go make some more uneducated statements you mental midget.

Name: Anonymous 2012-01-28 15:19

>>11
I can also cite a few instances where your code will break. Man you're one stupid human being.

Name: Anonymous 2012-01-28 15:23

>>1

This program is ridiculously complicated for a simple encoding like UTF-8, and that's the main reason why it doesn't perform as well as you think it should.

You say 1.5 megabytes per second on your notebook?

What CPU? (Atom, or real processor?) What operating system? What version of what compiler did you use, with what options?

Do you repeat the test twice to eliminate caching effects?

How about sequences which are not random bytes? How fast is it on a large file which is correct, or mostly correct UTF-8?

How fast is it on UTF-8 files that are mostly ASCII? How about ones that are mostly Chinese or other characters that require two or three bytes to encode?

How about files that contain a lot of non-BMP characters (beyond U+FFFF).

Random bytes are not even the kind of input that a UTF-8 validator will be expected to usually validate. Suppose you write a validator that processes 100 megs per second on the expected kinds of inputs, but 1 meg per second on random bytes, does it matter?

Wouldn't your validator throw a lot of errors on random bytes, generating a lot of error message output? It does not look as if your program stops on the first error. Do you disable the error output when testing on random data?

Name: Anonymous 2012-01-28 15:28

>>13

that wasn't me.

updated:



#define BUFFER_LENGTH 1024

char buffer[BUFFER_LENGTH];

for(;;) {

  int ret = read(buffer, sizeof(char), BUFFER_LENGTH, stdin);

  if(ret == 0) {

    // end of file reached, or an error occurred.

    // In either case, just break out.

    break;

  } else {

    // ret is the number of characters successfully read into buffer.

    check_buffer_for_utf8_errors(utf8_scanner_state, buffer, ret);

  }

}

mind telling us how the code would break, so that readers of the thread would actually learn something from your responses?

Name: Anonymous 2012-01-28 15:29

>>14

Continuing with earlier post.

In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities. Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.

You should also check for the invalid code points U+DF00 through U+DFFF.

Name: Anonymous 2012-01-28 15:30

forgot to change it to fread again. woops.

Name: Anonymous 2012-01-28 15:50

>>2
x86(_64)
If it ain't MIPS, it's crap.

Name: Anonymous 2012-01-28 16:08

http://gcc.gnu.org/gcc-4.6/changes.html
A new general optimization level, -Ofast, has been introduced. It combines the existing optimization level -O3 with options that can affect standards compliance but result in better optimized code. For example, -Ofast enables -ffast-math.

Name: Anonymous 2012-01-28 16:17

>>18
Fuck your shitty architecture, x86 is king.

Name: kodak_gallery_programmer !!qmiXqQhekkGXVVD 2012-01-28 16:22

>>15
Eh? Can't google the answer this time you idiot? Again, I don't think computer programming is your thing.

Name: Anonymous 2012-01-28 16:30

>>21

are you saying there is behavior of the fread function that is not documented in its man page? and that this behavior is being invoked by the program, under certain cases?

Name: kodak_gallery_programmer !!A5Gh5Y8ySzoMedM 2012-01-28 16:58

>>22
I suck dicks for a living.

Name: Anonymous 2012-01-28 17:05

>>22
You used read, but gave it the arguments for fread.

Name: MIPS 2012-01-28 17:07

>>20

x86 is king at consuming wattage and frying eggs, or performing like a dog in form factors where it doesn't fry eggs.

MIPS-based processors (and others) have better performance at low voltage levels.

Intel x86 cannot compete in the mobile arena where you need long battery life and decent performance.

Name: Anonymous 2012-01-28 17:16

>>25
Mobile arena? Oh you mean the "fragmented cheap chinese shit for which you can barely find firmware updates, much less source code or (god forbid) specifications" arena? No thanks, I like my freedom. If it says "MIPS" or "ARM" or "Tegra" on the package, you can be certain not to ever truly own your device. Fuck your shit, faggot.

Name: Anonymous 2012-01-28 17:37

>>26

`

>Jewish CPUs with closed-source microcode and BIOS

>freedom

hahahaohwow.jpg

Name: Anonymous 2012-01-28 17:39

>>24
it was a typo. I forgot to correct it twice. mybad.

Name: Anonymous 2012-01-28 17:44

>>27
Alright, then find me a non-x86 10"-display 6-cell battery netbook which can be entirely driven by open source software under 300 USD and I'll agree that your toy architectures stand a chance.

Name: Anonymous 2012-01-28 17:45

>>26
People don't truly own their iPhones too. And they truly don't care.

Name: Anonymous 2012-01-28 17:52

>>30
That's because they're a bunch of ignorant faggots.

Name: Anonymous 2012-01-28 18:02

>>29

that's affected by business stuff though. That doesn't necessarily measure the capabilities of a processor in a certain application. And something that might not be profitable enough to be developed could still be useful for some people, just not enough to justify producing it.

Name: Anonymous 2012-01-28 18:05

Validate my anal dubs

Name: Anonymous 2012-01-28 18:22

>>29
x86 is bloated extensions to a 16-bit toy architecture designed for Datapoint terminals.

Name: Anonymous 2012-01-28 18:42

Use fgets.

Name: Anonymous 2012-01-28 18:48

OP here. I posted this just before I went to sleep. I'm truly impressed with the SNR in this thread, /prog/.

>>2

The msz function can be replaced with inline assembly using the bsr (bit-scan reverse) instruction if you're on x86(_64).

Instead of switching on the state, use an inner loop.

And there's probably a ton of bit-twiddling optimizations you could do, but I can't be bothered to think about it. (Read "Hacker's delight" for more info.)
Will do, thanks.

Also, you've of course profiled this, right?
I'm not sure how to profile a C program, but I'll look it up and find out; thanks.

>>14

>This program is ridiculously complicated for a simple encoding like UTF-8, and that's the main reason why it doesn't perform as well as you think it should.

There is more complexity than a minimal code-size implementation, but I sacrificed some simplicity for (what I think, at least) some optimisation and cleanliness. Technically, I don't even need msz_byte, but finding the most significant zero is a 'cool' way to determine the kind of byte it is (ASCII = 7, continuation = 6, start byte = 5, 4, 3, 2, 1). A simply coded implementation would probably be even slower.

>You say 1.5 megabytes per second on your notebook?

>What CPU? (Atom, or real processor?) What operating system? What version of what compiler did you use, with what options?

delan@pluto ~ $ grep 'model name' /proc/cpuinfo

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

delan@pluto ~ $ cat /etc/gentoo-release; uname -a

Gentoo Base System release 2.0.3

Linux pluto 3.0.6-gentoo #5 SMP Sun Nov 27 17:26:38 WST 2011 x86_64 Intel(R) Atom(TM) CPU N550 @ 1.50GHz GenuineIntel GNU/Linux

I used tcc 0.9.25 and gcc 4.5.3; gcc with -O0, -O1, -O2 and -O3. Inlining doubled the speed, but it's still ``SLOW AS FUCK''.

>Do you repeat the test twice to eliminate caching effects?
Yes.

>How about sequences which are not random bytes? How fast is it on a large file which is correct, or mostly correct UTF-8?

>How fast is it on UTF-8 files that are mostly ASCII? How about ones that are mostly Chinese or other characters that require two or three bytes to encode?

>How about files that contain a lot of non-BMP characters (beyond U+FFFF).
I haven't tried these; I will now, thanks.

>Random bytes are not even the kind of input that a UTF-8 validator will be expected to usually validate. Suppose you write a validator that processes 100 megs per second on the expected kinds of inputs, but 1 meg per second on random bytes, does it matter?
Now that you mention it... no. Nobody's going to feed /dev/urandom into it anyway.

>Wouldn't your validator throw a lot of errors on random bytes, generating a lot of error message output? It does not look as if your program stops on the first error. Do you disable the error output when testing on random data?
No, I haven't. That'd be a good idea (error() probably has a high function call overhead). Though, I do do the testing like this, so...

time ./a.out < testrandom1meg > /dev/null

Name: Anonymous 2012-01-28 18:51

OP here, continued.

>>16

In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities.
That's what I use msz_byte = msz(byte); for. If msz_byte is 7, then it's ASCII, if it's 6, then it's a continuation, if it's 1 to 5, then it's a start byte.

Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.
I have done all of these things. Sorry if the code is not clear.

>You should also check for the invalid code points U+DF00 through U+DFFF.
You mean U+D800 to U+DFFF (surrogate pairs)? I have:

/* Check if a codepoint is valid, returning 1 if so and 0 otherwise. Invalid

   codepoints include those higher than U+10ffff, any codepoint from U+fdd0 to

   U+fdef inclusive, as well as the last two codepoints in every plane, and

   all surrogate pair values (U+d800 to U+dfff inclusive). */

 

inline int valid(uint32_t cp) {

        return (

                (cp < 0x110000) &&

                ((cp < 0xfdd0) || (cp > 0xfdef)) &&

                ((cp & 0xfffe) != 0xfffe) &&

((cp & 0xfffff800) != 0xd800)



        );

}

Name: Anonymous 2012-01-28 19:07

This has undefined behavior in it.

Name: YOUR_SUPER_NIGGER 2012-01-28 19:20

>>22
Okay you uneducated stupid shit, compare the following code with yours. Notice the extra cases that I handle.



ssize_t readn(int fd, void *vptr, size_t n) {

    size_t nleft;

    ssize_t nread;

    char *ptr;



    ptr = vptr;

    nleft = n;

    while (nleft > 0) {

        if ((nread = read(fd, ptr, nleft)) < 0) {

            if (errno == EINTR)

                nread = 0;

            else

                return (-1);

        } else if (nread == 0) 

            break;



        nleft -= nread;

        ptr += nread;

    }

    return (n - nleft);

}

Name: >>2 2012-01-28 19:36

>>36
Your welcome.

Name: Anonymous 2012-01-28 20:03

>>39

http://linux.die.net/man/2/read

I see, so if the read was interrupted by a signal, then my program would terminate prematurely, while yours would spin on reading until it got it, or failed from a different error. I was trying to use fread though, which can be used on operating systems that are not necessarily posix compliant, and checking to see if the read failed due to a signal interrupt doesn't seem to be possible using fread, at least it isn't specified in the man page.

Name: Anonymous 2012-01-28 21:18

>>37

You may have done those things, but not in the natural way that most C programmers would have done them. You don't need general bit scanning, just some classification based on masks:

if ((byte & 0x80) == 0) { /* it's ascii */ }
else if ((byte & 0xC0 == 0xF0)) { /* cont. byte */ }
else if ((byte & 0xE0) == 0xC0) { /* 110xxxxx: 2 byte code */
... etc

Each of the cases can be handled in a dedicated block of code.
Not every case falls neatly into the highest significant zero idea.

Not saying that this is necessarily where your bottleneck is, but the code isn't exactly tight.

You know, you could even have a 256 entry lookup table to classify the byte.

switch (byte_0_table[byte]) {
case ASCII: ... break;
case INVALID: ... break;
case CODE2: ... break; // two byte code prefix
}

Name: YOUR_SUPER_NIGGER 2012-01-28 21:27

>>42
That last method looks pretty tight.

Name: Anonymous 2012-01-28 21:28

>>42

Not every case falls neatly into the highest significant zero idea.
Yes, it does.

For ASCII bytes, 0b0xxxxxxx will always have the highest zero at bit 7. For continuation bytes, 0b10xxxxxx will always have the highest zero at bit 6. And so on. I believe that (as long as the msz function itself is fast, and it isn't at the moment) using the most significant zero test is faster than masking or a lookup table.

Name: Anonymous 2012-01-28 21:31

>>36

Ah see? You're just redirecting errors to /dev/null and feeding in random bytes? You're probably feeding reams of error message crap into /dev/null. Just one bad byte produces a pretty long error message, right? Have you measured the quantity of error message material as a ratio relative to input bytes? If you feed in 10 megs of random bytes, how large is the error output? You know, one in four randomly chosen bytes is a continuation byte!!! (These occupy 0x80-0xBF).

Have you run the program under "time"? How much is the system time versus user time?

You might want to try profiling it just for fun with gprof or even better, oprofile.

And just as I suspected, your laptop is an Atom device. Those things are slow, just accept it. Think of all the energy you're saving and the battery life you're getting (when not running your program, anyway, haha). These Atom CPU's are probably about as fast as desktop PC's from ten years ago or more.

Name: Anonymous 2012-01-28 21:33

>>44

Actually you are right, sorry.

And, nice dubs.

Name: Anonymous 2012-01-28 23:04

>>1

also, the compiler might be able to optimize your code better if you used regular looping constructs, rather than gotos. Gotos can make certain optimizations difficult.

Name: Anonymous 2012-01-29 2:38

http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbstowcs.html
Use a UTF-8 LC_CTYPE and call mbstowcs(NULL, s, 0) where s is the string you want to validate. If it returns (size_t)-1 you have an invalid string. Otherwise, it will return the length in wide characters.

Name: Anonymous 2012-01-29 4:40

>>48
I do not trust the libc implementation to do it correctly. I intend to write an independent, correct implementation.

Name: Anonymous 2012-01-29 5:03

>>48
Also, there is a difference: that validates a string in memory, while my implementation validates stdin as a stream with O(1) memory consumption.

Name: Anonymous 2012-01-30 4:55

>>48

Yagodda be kidding!!! Calling some fawking C library function written by imbeciles which will do something slightly different on every system it's compiled on.

Name: Anonymous 2012-01-31 5:14

I've incorporated some optimisations [1], and the program is now reduced in runtime by 50% - 80%. I think the only real optimisation left is to read blocks at a time to reduce getchar() call overhead.

http://pastebin.com/miDqFzww
________________________________
[1] http://codereview.stackexchange.com/a/8485/10422

Name: Anonymous 2012-01-31 5:43

One optimisation I made afterwards was to eliminate the getchar() function call overhead by reading and parsing 4096 bytes at a time. This cut down runtimes by at least another 50%.

http://pastebin.com/DfnVi6vt

Name: Anonymous 2012-01-31 5:58

Updated to fix numerous bugs:

http://pastebin.com/NdWivCez

Name: Anonymous 2012-01-31 6:04

More bugs fixed:

http://pastebin.com/ywcm6LWC

UTF-8 validator

1 Name: Anonymous 2012-01-28 9:20

2 Name: Anonymous 2012-01-28 9:56

3 Name: Anonymous 2012-01-28 10:12

4 Name: Anonymous 2012-01-28 10:23

5 Name: Anonymous 2012-01-28 10:34

6 Name: 2 2012-01-28 13:22

7 Name: Anonymous 2012-01-28 13:28

8 Name: Anonymous 2012-01-28 13:29

9 Name: Anonymous 2012-01-28 13:30

10 Name: Anonymous 2012-01-28 15:11

11 Name: Anonymous 2012-01-28 15:14

12 Name: Anonymous 2012-01-28 15:18

13 Name: Anonymous 2012-01-28 15:19

14 Name: Anonymous 2012-01-28 15:23

15 Name: Anonymous 2012-01-28 15:28

16 Name: Anonymous 2012-01-28 15:29

17 Name: Anonymous 2012-01-28 15:30

18 Name: Anonymous 2012-01-28 15:50

19 Name: Anonymous 2012-01-28 16:08

20 Name: Anonymous 2012-01-28 16:17

21 Name: kodak_gallery_programmer !!qmiXqQhekkGXVVD 2012-01-28 16:22

22 Name: Anonymous 2012-01-28 16:30

23 Name: kodak_gallery_programmer !!A5Gh5Y8ySzoMedM 2012-01-28 16:58

24 Name: Anonymous 2012-01-28 17:05

25 Name: MIPS 2012-01-28 17:07

26 Name: Anonymous 2012-01-28 17:16

27 Name: Anonymous 2012-01-28 17:37

28 Name: Anonymous 2012-01-28 17:39

29 Name: Anonymous 2012-01-28 17:44

30 Name: Anonymous 2012-01-28 17:45

31 Name: Anonymous 2012-01-28 17:52

32 Name: Anonymous 2012-01-28 18:02

33 Name: Anonymous 2012-01-28 18:05

34 Name: Anonymous 2012-01-28 18:22

35 Name: Anonymous 2012-01-28 18:42

36 Name: Anonymous 2012-01-28 18:48

37 Name: Anonymous 2012-01-28 18:51

38 Name: Anonymous 2012-01-28 19:07

39 Name: YOUR_SUPER_NIGGER 2012-01-28 19:20

40 Name: >>2 2012-01-28 19:36

41 Name: Anonymous 2012-01-28 20:03

42 Name: Anonymous 2012-01-28 21:18

43 Name: YOUR_SUPER_NIGGER 2012-01-28 21:27

44 Name: Anonymous 2012-01-28 21:28

45 Name: Anonymous 2012-01-28 21:31

46 Name: Anonymous 2012-01-28 21:33

47 Name: Anonymous 2012-01-28 23:04

48 Name: Anonymous 2012-01-29 2:38

49 Name: Anonymous 2012-01-29 4:40

50 Name: Anonymous 2012-01-29 5:03

51 Name: Anonymous 2012-01-30 4:55

52 Name: Anonymous 2012-01-31 5:14

53 Name: Anonymous 2012-01-31 5:43

54 Name: Anonymous 2012-01-31 5:58

55 Name: Anonymous 2012-01-31 6:04