/prog/ - Fuck this character encoding shit.

Name: Anonymous 2008-12-12 15:26

Let's say I have a 8-bit C string with an arbitrary character encoding. This string represents a local file path. The encoding is arbitrary since it is obtained by calling some .zip file routines - which do not specifically require one encoding or another, so basically everyone can encode his filenames as it goddamn pleases him.
Now, another library requires that all file names are to be encoded in UTF-8 before they can be passed to it.

Now how the fuck do I convert from one encoding to UTF-8 if I do not fucking know which base encoding was used? This is driving me crazy.

Name: Anonymous 2008-12-12 15:27

Let user choose encoding if it's not ascii?

Name: Anonymous 2008-12-12 15:31

>>2 is your best bet if your userbase consists of people who know what `encoding' means. Otherwise, you need to guess. There are lots of libraries available to help you in that.

Name: Anonymous 2008-12-12 15:41

>>1
1. Identify the encoding.
2. Convert to UTF-8 if it's not already UTF-8 or ASCII.

Name: Anonymous 2008-12-12 15:43

>>3
Which libraries? The only one I know of is perl's Encode::Guess, but it doesn't work with 8-bit encodings. I'm not OP so it's all right to HELP ME!

Name: Anonymous 2008-12-12 15:46

>>4
Yes, I did come up with a similar plan before, however I never made it past step 1 since ZIP files do not come with any charset or encoding hints.

>>3
Thank you, I guess I'll look into something like this.

Name: Anonymous 2008-12-12 15:57

>>6
Develop your own heuristics to identify the encoding then.

Name: Anonymous 2008-12-12 16:27

import chardet

Name: Anonymous 2008-12-12 18:00

You need to look for a UTF-8 byte order mark.

Name: Anonymous 2008-12-12 18:28

There are many solutions to this problem. And by solutions I mean shitty workarounds.

>>9 is not one of them, since you're very unlikely to find BOMs in filenames (but perhaps UTF-8 in ZIP does use them - I don't know, I'm not an EXPERT on ZIP FILES).

Your best bets are:

1. Just read the character encoding the host OS is using and always assume that. This is what most Windows applications do. This forces users to change said encoding or use tools like AppLocale to troubleshoot encoding problems though.

2. Try to guess. This is the most realistic option.

3. If 2 fails, or if it can return scores and the winner is not very clear, prompt the user giving the best choices with an example of how the filename looks with each. Microsoft Word does this for text files.

Hopefully now you'll understand why ADVANCED ENTERPRISE LANGUAGES such as PYTHON 3000 enforce an universal character encoding to avoid this utter bullshit.

ONE WORD: THE FORCED UNICODE 8-BIT USE. THREAD OVER

Name: Anonymous 2008-12-12 18:39

There are only two encoding systems still in widespread use: UTF-8 and UTF-16. (I'm lumping ASCII into UTF-8 since ASCII can be read as UTF-8 without problems.) UTF-16 should be easy to detect for many languages because it'll have a lot of NUL bytes, or it'll have a lot of non-alphabet bytes. If the text is UTF-16 and has all those NUL bytes, then it'd be easy to tell if it's big-endian or little-endian.

Name: Anonymous 2008-12-12 18:40

>>10
You mean ONE WORD: THE FORCED BREAKING OF READDIR, THREAD OVER

Name: Anonymous 2008-12-12 18:47

>>11
except in Asia.

Name: Anonymous 2008-12-12 19:05

>>13
What, do they not use Unicode?

Anyways, here's a shitty and untested function I wrote to guess between UTF-8 or UTF-16.



#include <stdbool.h>

#include <stddef.h>



typedef enum

{

    //! UTF-8 or ASCII encoding.

    UTF8,

    //! Big-endian UTF-16 encoding.

    UTF16BE,

    //! Little-endian UTF-16 encoding.

    UTF16LE

} GuessedEncoding;



//! Returns true if the bytes have a UTF-8 BOM.

static inline bool hasUTF8BOM(unsigned char *bytes, size_t size)

{

    return size >= 3 && bytes[0] == 0xEF &&

                        bytes[1] == 0xBB &&

                        bytes[2] == 0xBF;

}



//! Returns true if the bytes have a UTF-16 BE BOM.

static inline bool hasUTF16BEBOM(unsigned char *bytes, size_t size)

{

    return size >= 2 && bytes[0] == 0xFE &&

                        bytes[1] == 0xFF;

}



//! Returns true if the bytes have a UTF-16 LE BOM.

static inline bool hasUTF16LEBOM(unsigned char *bytes, size_t size)

{

    return size >= 2 && bytes[0] == 0xFF &&

                        bytes[1] == 0xFE;

}



GuessedEncoding guessEncoding(unsigned char *bytes, size_t size)

{

    const double UTF16_NUL_THRESHOLD = 0.10;

    size_t i, num_nuls = 0;



    /* If there's a BOM, then rejoice! we have no more work to do. */

    if(hasUTF8BOM(bytes, size))

        return UTF8;

    if(hasUTF16BEBOM(bytes, size))

        return UTF16BE;

    if(hasUTF16LEBOM(bytes, size))

        return UTF16LE;



    /* Count the number of NUL bytes in the bytes. */

    for(i = 0; i < size; ++i)

        if(!bytes[i])

            num_nuls++;



    /* If the number of NULs excedes the threshold, then assume it's UTF-16. */

    if((double)size / num_nuls >= UTF16_NUL_THRESHOLD) {

        /* Find the first UTF-16 pair with a NUL. */

        for(i = 0; i < size + 1 && bytes[i] && bytes[i+1]; i *= 2);



        /* If the first byte of a two-byte pair is NUL, then it's big-endian.

         * Otherwise, it's little-endian.

         */

        return !bytes[i] ? UTF16BE : UTF16LE;

    }



    /* Fuck it, it's probably UTF-8. */

    return UTF8;

}

Name: Anonymous 2008-12-12 19:07

>>14
Anyways, here's a shitty and untested function ...
I wonder if you think any intelligent programmer would read past that part of your sentence.

Name: Anonymous 2008-12-12 19:09

>>14
Correct. Japs still cling to Shift-JIS, Coreans use EUC-KR, and Chinese are too poor computers. I've heard that Japs and Chinese are bitter over having to share a single codeblock in Unicode.

Name: Anonymous 2008-12-12 19:38

>>16
As a matter of culture, Asian trash will use shit encodings to feel like they're worth anything. The same culture that makes them think it's okay for every website to require the installation of some shitty ActiveX plug-in (that doesn't even work in any IE past version 6, forcing them to stay in the past), and that a webpage is better the more flash crap it has and and the more hideous its design is.

ONE WORD: FAILED CULTURE. THREAD OVER

Name: Anonymous 2008-12-12 20:13

>>16
Yeah, Google "Han unification." Amusing drama.

Name: Anonymous 2008-12-13 16:23

>>15
Intelligent programmers on /prog/? WTF are you smoking? (If 420chan were still around, they'd probably want to know that, too.)

Name: Anonymous 2008-12-14 2:19

If 420chan were still around
4/10

Name: Anonymous 2008-12-14 8:47

Fuck this character encoding shit.
~~ANONIX QUALITY~~

Name: Anonymous 2008-12-14 8:57

>>14
/* Fuck it, it's probably UTF-8. */
10/10

Name: Anonymous 2008-12-14 9:33

for(i = 0; i < size + 1 && bytes[i] && bytes[i+1]; i *= 2);
Are you sure?

Name: Anonymous 2008-12-14 11:50

/* fuck it, we'll do it live */

Name: Anonymous 2008-12-14 12:51

>>1

if(!is_valid_utf8(string)){

 /* don't break stuff even if the user is a faggot */

 fputs("fuck off!", stderr);

 abort();

}

Name: Anonymous 2010-02-04 5:09

You all forgot europeans and their ISO-8851-* shit.

Name: Anonymous 2010-02-04 5:44

Just use whatever the system is using if you're on Win32. This actually sucks if someone used some other encoding when making the archives(sjis,etc), but can be simply solved by using applocale. Letting the user choose encoding is also sane. Guessing is possible, but I doubt it would be reliable, and a misidentified encoding with no means to control it in some way is worse than the other possibilities. Maybe a saner choice is to combine the 2 aproaches: use system encoding, but allow the user to override encoding via some option.

Name: Anonymous 2010-02-04 6:31

>>26
Eurofag here. Using utf-8 for I-don't-even-remember-how-long, dunno no one still using ISO. Troll harder.

Name: Anonymous 2010-02-04 7:55

Why are our buck-toothed slant-eyed yellow friends so fucking バカ? Seriously, cut this shit out, use Unicode like every other goddamn civilized society.

Name: Anonymous 2010-02-04 8:10

>>29
U MENA 馬鹿？
You should look at this graph, http://www.h-online.com/open/news/item/Unicode-dominates-web-918063.html?view=zoom;zoom=1 .If we assume that this is accurate, then it's just a matter of time before non-UTF-8 dies out on the web and your criticism would be better spent on the US and Western Europe.

Name: Anonymous 2010-02-04 8:16

>>30
HAHAHAHAHAHA
YOU ARE A FUCKING IDIOT
GO LISTEN TO LINKIN PARK, WRITE AN ESSAY, DO HOMEWORK, WATCH FAMILY GUY, OR WHATEVER THE FUCK YOU KIDS SO TODAY
YOU DON'T BELONG HERE.

Name: Anonymous 2010-02-04 10:15

responding to necrobumps

god damn it /prog/, too many imageboard fucks here.

Name: Anonymous 2010-02-04 12:06

>>30
Only says about Web. When it comes to localized applications, some countries keep using their crappy encodings, just look at the Japanese, Chinese and Russians. You have to force the damn locale on each of those applications to be able to use them. In some cases, the application may even check if the system locale and other settings correspond to their country and if they don't, either display a message, or just silently exit. I'm talking even about apps made in 2010. Xenophobic much?

Name: Anonymous 2010-02-04 15:52

>>32
It's only about 5 weeks old, so let it slide. Necrobumping has happened for a long time, and sometimes led to a renewal of a good conversation. Don't let last year sour your opinion of them.

The real trouble is people like >>31, who can't wait for the weekend to act retarded.
>>33
That is likely to be the case for quite a while, just like companies that use IE6 for their intranet. Best not to think about it.

Name: Anonymous 2010-02-04 16:24

>>34
Actually, I just noticed that it was 2008 not 2009 :( ``sage'' this shit

Fuck this character encoding shit.

1 Name: Anonymous 2008-12-12 15:26

2 Name: Anonymous 2008-12-12 15:27

3 Name: Anonymous 2008-12-12 15:31

4 Name: Anonymous 2008-12-12 15:41

5 Name: Anonymous 2008-12-12 15:43

6 Name: Anonymous 2008-12-12 15:46

7 Name: Anonymous 2008-12-12 15:57

8 Name: Anonymous 2008-12-12 16:27

9 Name: Anonymous 2008-12-12 18:00

10 Name: Anonymous 2008-12-12 18:28

11 Name: Anonymous 2008-12-12 18:39

12 Name: Anonymous 2008-12-12 18:40

13 Name: Anonymous 2008-12-12 18:47

14 Name: Anonymous 2008-12-12 19:05

15 Name: Anonymous 2008-12-12 19:07

16 Name: Anonymous 2008-12-12 19:09

17 Name: Anonymous 2008-12-12 19:38

18 Name: Anonymous 2008-12-12 20:13

19 Name: Anonymous 2008-12-13 16:23

20 Name: Anonymous 2008-12-14 2:19

21 Name: Anonymous 2008-12-14 8:47

22 Name: Anonymous 2008-12-14 8:57

23 Name: Anonymous 2008-12-14 9:33

24 Name: Anonymous 2008-12-14 11:50

25 Name: Anonymous 2008-12-14 12:51

26 Name: Anonymous 2010-02-04 5:09

27 Name: Anonymous 2010-02-04 5:44

28 Name: Anonymous 2010-02-04 6:31

29 Name: Anonymous 2010-02-04 7:55

30 Name: Anonymous 2010-02-04 8:10

31 Name: Anonymous 2010-02-04 8:16

32 Name: Anonymous 2010-02-04 10:15

33 Name: Anonymous 2010-02-04 12:06

34 Name: Anonymous 2010-02-04 15:52

35 Name: Anonymous 2010-02-04 16:24