/prog/ - UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 9:56

The msz function can be replaced with inline assembly using the bsr (bit-scan reverse) instruction if you're on x86(_64).

Instead of switching on the state, use an inner loop.

And there's probably a ton of bit-twiddling optimizations you could do, but I can't be bothered to think about it. (Read "Hacker's delight" for more info.)

Also, you've of course profiled this, right?

Name: Anonymous 2012-01-28 10:12

Before you start dicking around with inline-asm, run the loop against a mmap()ed file. It's likely that the getchar() (eg. reading byte-wise from stdin) is the bottle neck and not some bit manipulations.

Name: Anonymous 2012-01-28 10:23

What >>3 said. If you don't want to use mmap you can just try reading larger chunks of bytes into a memory buffer and see if that helps. Of course, to know what really would help, you should profile this, like >>2 said.

Name: Anonymous 2012-01-28 10:34

>>4
>If you don't want to use mmap you can just try reading larger chunks of bytes into a memory buffer and see if that helps.

Why do you insist on commenting about shit you clearly don't understand? It's almost like you're a retard. You understand that you are clueless, yet you still keep rattling on.

Name: 2 2012-01-28 13:22

>>5
If he really wants it to read from stdin, using mmap may not be an option, though I think getchar() already does some buffering internally. (I hardly know C (but I do know Unix, so I know that ttys and pipes normally can't be mmapped (so I'm not a blabbering idiot (that often)))). But yes, mmap is the way to go (but profile!(!)).

Name: Anonymous 2012-01-28 13:28

WHERE THE FUCK IS OP?????

Name: Anonymous 2012-01-28 13:29

this could help op.

http://linux.die.net/man/3/fread



#define BUFFER_LENGTH 1024

char buffer[BUFFER_LENGTH];

int ret = read(buffer, sizeof(char), BUFFER_LENGTH, stdin);

if(ret == 0) {

  // end of file reached. Break out.

} else {

  // ret is the number of characters successfully read into buffer.

  check_buffer_for_utf8_errors(utf8_scanner_state, buffer, ret);

}

Name: Anonymous 2012-01-28 13:30

change read to fread. woops

Name: Anonymous 2012-01-28 15:11

>>9
You're a fucking moron who has no possible future as a computer programmer.

Name: Anonymous 2012-01-28 15:14

>>10
Go scrub another midget you piece of shit faggot.

Name: Anonymous 2012-01-28 15:18

>>11
Go make some more uneducated statements you mental midget.

Name: Anonymous 2012-01-28 15:19

>>11
I can also cite a few instances where your code will break. Man you're one stupid human being.

Name: Anonymous 2012-01-28 15:23

>>1

This program is ridiculously complicated for a simple encoding like UTF-8, and that's the main reason why it doesn't perform as well as you think it should.

You say 1.5 megabytes per second on your notebook?

What CPU? (Atom, or real processor?) What operating system? What version of what compiler did you use, with what options?

Do you repeat the test twice to eliminate caching effects?

How about sequences which are not random bytes? How fast is it on a large file which is correct, or mostly correct UTF-8?

How fast is it on UTF-8 files that are mostly ASCII? How about ones that are mostly Chinese or other characters that require two or three bytes to encode?

How about files that contain a lot of non-BMP characters (beyond U+FFFF).

Random bytes are not even the kind of input that a UTF-8 validator will be expected to usually validate. Suppose you write a validator that processes 100 megs per second on the expected kinds of inputs, but 1 meg per second on random bytes, does it matter?

Wouldn't your validator throw a lot of errors on random bytes, generating a lot of error message output? It does not look as if your program stops on the first error. Do you disable the error output when testing on random data?

Name: Anonymous 2012-01-28 15:28

>>13

that wasn't me.

updated:



#define BUFFER_LENGTH 1024

char buffer[BUFFER_LENGTH];

for(;;) {

  int ret = read(buffer, sizeof(char), BUFFER_LENGTH, stdin);

  if(ret == 0) {

    // end of file reached, or an error occurred.

    // In either case, just break out.

    break;

  } else {

    // ret is the number of characters successfully read into buffer.

    check_buffer_for_utf8_errors(utf8_scanner_state, buffer, ret);

  }

}

mind telling us how the code would break, so that readers of the thread would actually learn something from your responses?

Name: Anonymous 2012-01-28 15:29

>>14

Continuing with earlier post.

In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities. Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.

You should also check for the invalid code points U+DF00 through U+DFFF.

Name: Anonymous 2012-01-28 15:30

forgot to change it to fread again. woops.

Name: Anonymous 2012-01-28 15:50

>>2
x86(_64)
If it ain't MIPS, it's crap.

Name: Anonymous 2012-01-28 16:08

http://gcc.gnu.org/gcc-4.6/changes.html
A new general optimization level, -Ofast, has been introduced. It combines the existing optimization level -O3 with options that can affect standards compliance but result in better optimized code. For example, -Ofast enables -ffast-math.

Name: Anonymous 2012-01-28 16:17

>>18
Fuck your shitty architecture, x86 is king.

Name: kodak_gallery_programmer !!qmiXqQhekkGXVVD 2012-01-28 16:22

>>15
Eh? Can't google the answer this time you idiot? Again, I don't think computer programming is your thing.

Name: Anonymous 2012-01-28 16:30

>>21

are you saying there is behavior of the fread function that is not documented in its man page? and that this behavior is being invoked by the program, under certain cases?

Name: kodak_gallery_programmer !!A5Gh5Y8ySzoMedM 2012-01-28 16:58

>>22
I suck dicks for a living.

Name: Anonymous 2012-01-28 17:05

>>22
You used read, but gave it the arguments for fread.

Name: MIPS 2012-01-28 17:07

>>20

x86 is king at consuming wattage and frying eggs, or performing like a dog in form factors where it doesn't fry eggs.

MIPS-based processors (and others) have better performance at low voltage levels.

Intel x86 cannot compete in the mobile arena where you need long battery life and decent performance.

Name: Anonymous 2012-01-28 17:16

>>25
Mobile arena? Oh you mean the "fragmented cheap chinese shit for which you can barely find firmware updates, much less source code or (god forbid) specifications" arena? No thanks, I like my freedom. If it says "MIPS" or "ARM" or "Tegra" on the package, you can be certain not to ever truly own your device. Fuck your shit, faggot.

Name: Anonymous 2012-01-28 17:37

>>26

`

>Jewish CPUs with closed-source microcode and BIOS

>freedom

hahahaohwow.jpg

Name: Anonymous 2012-01-28 17:39

>>24
it was a typo. I forgot to correct it twice. mybad.

Name: Anonymous 2012-01-28 17:44

>>27
Alright, then find me a non-x86 10"-display 6-cell battery netbook which can be entirely driven by open source software under 300 USD and I'll agree that your toy architectures stand a chance.

Name: Anonymous 2012-01-28 17:45

>>26
People don't truly own their iPhones too. And they truly don't care.

Name: Anonymous 2012-01-28 17:52

>>30
That's because they're a bunch of ignorant faggots.

Name: Anonymous 2012-01-28 18:02

>>29

that's affected by business stuff though. That doesn't necessarily measure the capabilities of a processor in a certain application. And something that might not be profitable enough to be developed could still be useful for some people, just not enough to justify producing it.

Name: Anonymous 2012-01-28 18:05

Validate my anal dubs

Name: Anonymous 2012-01-28 18:22

>>29
x86 is bloated extensions to a 16-bit toy architecture designed for Datapoint terminals.

Name: Anonymous 2012-01-28 18:42

Use fgets.

Name: Anonymous 2012-01-28 18:48

OP here. I posted this just before I went to sleep. I'm truly impressed with the SNR in this thread, /prog/.

>>2

The msz function can be replaced with inline assembly using the bsr (bit-scan reverse) instruction if you're on x86(_64).

Instead of switching on the state, use an inner loop.

And there's probably a ton of bit-twiddling optimizations you could do, but I can't be bothered to think about it. (Read "Hacker's delight" for more info.)
Will do, thanks.

Also, you've of course profiled this, right?
I'm not sure how to profile a C program, but I'll look it up and find out; thanks.

>>14

>This program is ridiculously complicated for a simple encoding like UTF-8, and that's the main reason why it doesn't perform as well as you think it should.

There is more complexity than a minimal code-size implementation, but I sacrificed some simplicity for (what I think, at least) some optimisation and cleanliness. Technically, I don't even need msz_byte, but finding the most significant zero is a 'cool' way to determine the kind of byte it is (ASCII = 7, continuation = 6, start byte = 5, 4, 3, 2, 1). A simply coded implementation would probably be even slower.

>You say 1.5 megabytes per second on your notebook?

>What CPU? (Atom, or real processor?) What operating system? What version of what compiler did you use, with what options?

delan@pluto ~ $ grep 'model name' /proc/cpuinfo

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

model name      : Intel(R) Atom(TM) CPU N550   @ 1.50GHz

delan@pluto ~ $ cat /etc/gentoo-release; uname -a

Gentoo Base System release 2.0.3

Linux pluto 3.0.6-gentoo #5 SMP Sun Nov 27 17:26:38 WST 2011 x86_64 Intel(R) Atom(TM) CPU N550 @ 1.50GHz GenuineIntel GNU/Linux

I used tcc 0.9.25 and gcc 4.5.3; gcc with -O0, -O1, -O2 and -O3. Inlining doubled the speed, but it's still ``SLOW AS FUCK''.

>Do you repeat the test twice to eliminate caching effects?
Yes.

>How about sequences which are not random bytes? How fast is it on a large file which is correct, or mostly correct UTF-8?

>How fast is it on UTF-8 files that are mostly ASCII? How about ones that are mostly Chinese or other characters that require two or three bytes to encode?

>How about files that contain a lot of non-BMP characters (beyond U+FFFF).
I haven't tried these; I will now, thanks.

>Random bytes are not even the kind of input that a UTF-8 validator will be expected to usually validate. Suppose you write a validator that processes 100 megs per second on the expected kinds of inputs, but 1 meg per second on random bytes, does it matter?
Now that you mention it... no. Nobody's going to feed /dev/urandom into it anyway.

>Wouldn't your validator throw a lot of errors on random bytes, generating a lot of error message output? It does not look as if your program stops on the first error. Do you disable the error output when testing on random data?
No, I haven't. That'd be a good idea (error() probably has a high function call overhead). Though, I do do the testing like this, so...

time ./a.out < testrandom1meg > /dev/null

Name: Anonymous 2012-01-28 18:51

OP here, continued.

>>16

In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities.
That's what I use msz_byte = msz(byte); for. If msz_byte is 7, then it's ASCII, if it's 6, then it's a continuation, if it's 1 to 5, then it's a start byte.

Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.
I have done all of these things. Sorry if the code is not clear.

>You should also check for the invalid code points U+DF00 through U+DFFF.
You mean U+D800 to U+DFFF (surrogate pairs)? I have:

/* Check if a codepoint is valid, returning 1 if so and 0 otherwise. Invalid

   codepoints include those higher than U+10ffff, any codepoint from U+fdd0 to

   U+fdef inclusive, as well as the last two codepoints in every plane, and

   all surrogate pair values (U+d800 to U+dfff inclusive). */

 

inline int valid(uint32_t cp) {

        return (

                (cp < 0x110000) &&

                ((cp < 0xfdd0) || (cp > 0xfdef)) &&

                ((cp & 0xfffe) != 0xfffe) &&

((cp & 0xfffff800) != 0xd800)



        );

}

Name: Anonymous 2012-01-28 19:07

This has undefined behavior in it.

Name: YOUR_SUPER_NIGGER 2012-01-28 19:20

>>22
Okay you uneducated stupid shit, compare the following code with yours. Notice the extra cases that I handle.



ssize_t readn(int fd, void *vptr, size_t n) {

    size_t nleft;

    ssize_t nread;

    char *ptr;



    ptr = vptr;

    nleft = n;

    while (nleft > 0) {

        if ((nread = read(fd, ptr, nleft)) < 0) {

            if (errno == EINTR)

                nread = 0;

            else

                return (-1);

        } else if (nread == 0) 

            break;



        nleft -= nread;

        ptr += nread;

    }

    return (n - nleft);

}

Name: >>2 2012-01-28 19:36

>>36
Your welcome.

UTF-8 validator

1 Name: Anonymous 2012-01-28 9:20

2 Name: Anonymous 2012-01-28 9:56

3 Name: Anonymous 2012-01-28 10:12

4 Name: Anonymous 2012-01-28 10:23

5 Name: Anonymous 2012-01-28 10:34

6 Name: 2 2012-01-28 13:22

7 Name: Anonymous 2012-01-28 13:28

8 Name: Anonymous 2012-01-28 13:29

9 Name: Anonymous 2012-01-28 13:30

10 Name: Anonymous 2012-01-28 15:11

11 Name: Anonymous 2012-01-28 15:14

12 Name: Anonymous 2012-01-28 15:18

13 Name: Anonymous 2012-01-28 15:19

14 Name: Anonymous 2012-01-28 15:23

15 Name: Anonymous 2012-01-28 15:28

16 Name: Anonymous 2012-01-28 15:29

17 Name: Anonymous 2012-01-28 15:30

18 Name: Anonymous 2012-01-28 15:50

19 Name: Anonymous 2012-01-28 16:08

20 Name: Anonymous 2012-01-28 16:17

21 Name: kodak_gallery_programmer !!qmiXqQhekkGXVVD 2012-01-28 16:22

22 Name: Anonymous 2012-01-28 16:30

23 Name: kodak_gallery_programmer !!A5Gh5Y8ySzoMedM 2012-01-28 16:58

24 Name: Anonymous 2012-01-28 17:05

25 Name: MIPS 2012-01-28 17:07

26 Name: Anonymous 2012-01-28 17:16

27 Name: Anonymous 2012-01-28 17:37

28 Name: Anonymous 2012-01-28 17:39

29 Name: Anonymous 2012-01-28 17:44

30 Name: Anonymous 2012-01-28 17:45

31 Name: Anonymous 2012-01-28 17:52

32 Name: Anonymous 2012-01-28 18:02

33 Name: Anonymous 2012-01-28 18:05

34 Name: Anonymous 2012-01-28 18:22

35 Name: Anonymous 2012-01-28 18:42

36 Name: Anonymous 2012-01-28 18:48

37 Name: Anonymous 2012-01-28 18:51

38 Name: Anonymous 2012-01-28 19:07

39 Name: YOUR_SUPER_NIGGER 2012-01-28 19:20

40 Name: >>2 2012-01-28 19:36