OP here, continued.
>>16
In UTF-8 you can just look at one byte of the input. It will tell you exactly that either the byte is plain ASCII (bit 7 is zero), or else you can do some simple masking and testing operations to classify that byte into one of several possibilities.
That's what I use
msz_byte = msz(byte); for. If
msz_byte is 7, then it's ASCII, if it's 6, then it's a continuation, if it's 1 to 5, then it's a start byte.
Each possiblity precisely indicates how many continuation bytes follow, so you can divide the code into cases. A continuation byte could also appear, which is an error. In each correct case you can check that there are enough bytes, and that the code is not overlong (i.e. the character being encoded could use a shorter code). The cases that represent codes outside of U+10FFFF can be rejected easily.
I have done all of these things. Sorry if the code is not clear.
>You should also check for the invalid code points U+DF00 through U+DFFF.
You mean U+D800 to U+DFFF (surrogate pairs)? I have:
/* Check if a codepoint is valid, returning 1 if so and 0 otherwise. Invalid
codepoints include those higher than U+10ffff, any codepoint from U+fdd0 to
U+fdef inclusive, as well as the last two codepoints in every plane, and
all surrogate pair values (U+d800 to U+dfff inclusive). */
inline int valid(uint32_t cp) {
return (
(cp < 0x110000) &&
((cp < 0xfdd0) || (cp > 0xfdef)) &&
((cp & 0xfffe) != 0xfffe) &&((cp & 0xfffff800) != 0xd800)
);
}