Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Fuck this character encoding shit.

Name: Anonymous 2008-12-12 15:26

Let's say I have a 8-bit C string with an arbitrary character encoding. This string represents a local file path. The encoding is arbitrary since it is obtained by calling some .zip file routines - which do not specifically require one encoding or another, so basically everyone can encode his filenames as it goddamn pleases him.
Now, another library requires that all file names are to be encoded in UTF-8 before they can be passed to it.

Now how the fuck do I convert from one encoding to UTF-8 if I do not fucking know which base encoding was used? This is driving me crazy.

Name: Anonymous 2008-12-12 19:05

>>13
What, do they not use Unicode?

Anyways, here's a shitty and untested function I wrote to guess between UTF-8 or UTF-16.


#include <stdbool.h>
#include <stddef.h>

typedef enum
{
    //! UTF-8 or ASCII encoding.
    UTF8,
    //! Big-endian UTF-16 encoding.
    UTF16BE,
    //! Little-endian UTF-16 encoding.
    UTF16LE
} GuessedEncoding;

//! Returns true if the bytes have a UTF-8 BOM.
static inline bool hasUTF8BOM(unsigned char *bytes, size_t size)
{
    return size >= 3 && bytes[0] == 0xEF &&
                        bytes[1] == 0xBB &&
                        bytes[2] == 0xBF;
}

//! Returns true if the bytes have a UTF-16 BE BOM.
static inline bool hasUTF16BEBOM(unsigned char *bytes, size_t size)
{
    return size >= 2 && bytes[0] == 0xFE &&
                        bytes[1] == 0xFF;
}

//! Returns true if the bytes have a UTF-16 LE BOM.
static inline bool hasUTF16LEBOM(unsigned char *bytes, size_t size)
{
    return size >= 2 && bytes[0] == 0xFF &&
                        bytes[1] == 0xFE;
}

GuessedEncoding guessEncoding(unsigned char *bytes, size_t size)
{
    const double UTF16_NUL_THRESHOLD = 0.10;
    size_t i, num_nuls = 0;

    /* If there's a BOM, then rejoice! we have no more work to do. */
    if(hasUTF8BOM(bytes, size))
        return UTF8;
    if(hasUTF16BEBOM(bytes, size))
        return UTF16BE;
    if(hasUTF16LEBOM(bytes, size))
        return UTF16LE;

    /* Count the number of NUL bytes in the bytes. */
    for(i = 0; i < size; ++i)
        if(!bytes[i])
            num_nuls++;

    /* If the number of NULs excedes the threshold, then assume it's UTF-16. */
    if((double)size / num_nuls >= UTF16_NUL_THRESHOLD) {
        /* Find the first UTF-16 pair with a NUL. */
        for(i = 0; i < size + 1 && bytes[i] && bytes[i+1]; i *= 2);

        /* If the first byte of a two-byte pair is NUL, then it's big-endian.
         * Otherwise, it's little-endian.
         */
        return !bytes[i] ? UTF16BE : UTF16LE;
    }

    /* Fuck it, it's probably UTF-8. */
    return UTF8;
}

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List