Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 validator

Name: Anonymous 2012-01-28 9:20

I wrote a little UTF-8 'validator' that checks stdin for a correct UTF-8 stream, and reports any errors along the way. It is very stringent; it even reports overlong forms as an error condition, in addition to the usual unexpected byte errors and such. One problem is that it is ``SLOW AS FUCK''; it can only check about 1.5 MB/s of random bytes on my netbook. Could you, the experts of optimisation, help me, /prog/?

Note: get rid of the ``inline''s if it fails to compile. I was just being retarded there.

http://pastebin.com/e5RrL6nq

Name: Anonymous 2012-01-28 21:18

>>37

You may have done those things, but not in the natural way that most C programmers would have done them. You don't need general bit scanning, just some classification based on masks:

  if ((byte & 0x80) == 0) { /* it's ascii */ }
  else if ((byte & 0xC0 == 0xF0)) { /* cont. byte */ }
  else if ((byte & 0xE0) == 0xC0) { /* 110xxxxx: 2 byte code */
  ... etc
 
Each of the cases can be handled in a dedicated block of code.
Not every case falls neatly into the highest significant zero idea.

Not saying that this is necessarily where your bottleneck is, but the code isn't exactly tight.

You know, you could even have a 256 entry lookup table to classify the byte.

  switch (byte_0_table[byte]) {
  case ASCII: ... break;
  case INVALID: ... break;
  case CODE2: ... break; // two byte code prefix
  }

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List