Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Fuck this character encoding shit.

Name: Anonymous 2008-12-12 15:26

Let's say I have a 8-bit C string with an arbitrary character encoding. This string represents a local file path. The encoding is arbitrary since it is obtained by calling some .zip file routines - which do not specifically require one encoding or another, so basically everyone can encode his filenames as it goddamn pleases him.
Now, another library requires that all file names are to be encoded in UTF-8 before they can be passed to it.

Now how the fuck do I convert from one encoding to UTF-8 if I do not fucking know which base encoding was used? This is driving me crazy.

Name: Anonymous 2008-12-12 18:39

There are only two encoding systems still in widespread use: UTF-8 and UTF-16. (I'm lumping ASCII into UTF-8 since ASCII can be read as UTF-8 without problems.) UTF-16 should be easy to detect for many languages because it'll have a lot of NUL bytes, or it'll have a lot of non-alphabet bytes. If the text is UTF-16 and has all those NUL bytes, then it'd be easy to tell if it's big-endian or little-endian.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List