Let's say I have a 8-bit C string with an arbitrary character encoding. This string represents a local file path. The encoding is arbitrary since it is obtained by calling some .zip file routines - which do not specifically require one encoding or another, so basically everyone can encode his filenames as it goddamn pleases him.
Now, another library requires that all file names are to be encoded in UTF-8 before they can be passed to it.
Now how the fuck do I convert from one encoding to UTF-8 if I do not fucking know which base encoding was used? This is driving me crazy.
Name:
Anonymous2008-12-12 15:27
Let user choose encoding if it's not ascii?
Name:
Anonymous2008-12-12 15:31
>>2 is your best bet if your userbase consists of people who know what `encoding' means. Otherwise, you need to guess. There are lots of libraries available to help you in that.
>>1
1. Identify the encoding.
2. Convert to UTF-8 if it's not already UTF-8 or ASCII.
Name:
Anonymous2008-12-12 15:43
>>3
Which libraries? The only one I know of is perl's Encode::Guess, but it doesn't work with 8-bit encodings. I'm not OP so it's all right to HELP ME!
Name:
Anonymous2008-12-12 15:46
>>4
Yes, I did come up with a similar plan before, however I never made it past step 1 since ZIP files do not come with any charset or encoding hints.
>>3
Thank you, I guess I'll look into something like this.
Name:
Anonymous2008-12-12 15:57
>>6
Develop your own heuristics to identify the encoding then.
Name:
Anonymous2008-12-12 16:27
import chardet
Name:
Anonymous2008-12-12 18:00
You need to look for a UTF-8 byte order mark.
Name:
Anonymous2008-12-12 18:28
There are many solutions to this problem. And by solutions I mean shitty workarounds.
>>9 is not one of them, since you're very unlikely to find BOMs in filenames (but perhaps UTF-8 in ZIP does use them - I don't know, I'm not an EXPERT on ZIP FILES).
Your best bets are:
1. Just read the character encoding the host OS is using and always assume that. This is what most Windows applications do. This forces users to change said encoding or use tools like AppLocale to troubleshoot encoding problems though.
2. Try to guess. This is the most realistic option.
3. If 2 fails, or if it can return scores and the winner is not very clear, prompt the user giving the best choices with an example of how the filename looks with each. Microsoft Word does this for text files.
Hopefully now you'll understand why ADVANCED ENTERPRISE LANGUAGES such as PYTHON 3000 enforce an universal character encoding to avoid this utter bullshit.
ONE WORD: THE FORCED UNICODE 8-BIT USE. THREAD OVER
Name:
Anonymous2008-12-12 18:39
There are only two encoding systems still in widespread use: UTF-8 and UTF-16. (I'm lumping ASCII into UTF-8 since ASCII can be read as UTF-8 without problems.) UTF-16 should be easy to detect for many languages because it'll have a lot of NUL bytes, or it'll have a lot of non-alphabet bytes. If the text is UTF-16 and has all those NUL bytes, then it'd be easy to tell if it's big-endian or little-endian.
Name:
Anonymous2008-12-12 18:40
>>10
You mean ONE WORD: THE FORCED BREAKING OF READDIR, THREAD OVER
/* If there's a BOM, then rejoice! we have no more work to do. */
if(hasUTF8BOM(bytes, size))
return UTF8;
if(hasUTF16BEBOM(bytes, size))
return UTF16BE;
if(hasUTF16LEBOM(bytes, size))
return UTF16LE;
/* Count the number of NUL bytes in the bytes. */
for(i = 0; i < size; ++i)
if(!bytes[i])
num_nuls++;
/* If the number of NULs excedes the threshold, then assume it's UTF-16. */
if((double)size / num_nuls >= UTF16_NUL_THRESHOLD) {
/* Find the first UTF-16 pair with a NUL. */
for(i = 0; i < size + 1 && bytes[i] && bytes[i+1]; i *= 2);
/* If the first byte of a two-byte pair is NUL, then it's big-endian.
* Otherwise, it's little-endian.
*/
return !bytes[i] ? UTF16BE : UTF16LE;
}
>>14 Anyways, here's a shitty and untested function ...
I wonder if you think any intelligent programmer would read past that part of your sentence.
Name:
Anonymous2008-12-12 19:09
>>14
Correct. Japs still cling to Shift-JIS, Coreans use EUC-KR, and Chinese are too poor computers. I've heard that Japs and Chinese are bitter over having to share a single codeblock in Unicode.
>>16
As a matter of culture, Asian trash will use shit encodings to feel like they're worth anything. The same culture that makes them think it's okay for every website to require the installation of some shitty ActiveX plug-in (that doesn't even work in any IE past version 6, forcing them to stay in the past), and that a webpage is better the more flash crap it has and and the more hideous its design is.
ONE WORD: FAILED CULTURE. THREAD OVER
Name:
Anonymous2008-12-12 20:13
>>16
Yeah, Google "Han unification." Amusing drama.
Name:
Anonymous2008-12-13 16:23
>>15
Intelligent programmers on /prog/? WTF are you smoking? (If 420chan were still around, they'd probably want to know that, too.)
Just use whatever the system is using if you're on Win32. This actually sucks if someone used some other encoding when making the archives(sjis,etc), but can be simply solved by using applocale. Letting the user choose encoding is also sane. Guessing is possible, but I doubt it would be reliable, and a misidentified encoding with no means to control it in some way is worse than the other possibilities. Maybe a saner choice is to combine the 2 aproaches: use system encoding, but allow the user to override encoding via some option.
Name:
Anonymous2010-02-04 6:31
>>26
Eurofag here. Using utf-8 for I-don't-even-remember-how-long, dunno no one still using ISO. Troll harder.
Why are our buck-toothed slant-eyed yellow friends so fucking バカ? Seriously, cut this shit out, use Unicode like every other goddamn civilized society.
>>30
HAHAHAHAHAHA
YOU ARE A FUCKING IDIOT
GO LISTEN TO LINKIN PARK, WRITE AN ESSAY, DO HOMEWORK, WATCH FAMILY GUY, OR WHATEVER THE FUCK YOU KIDS SO TODAY
YOU DON'T BELONG HERE.
>>30
Only says about Web. When it comes to localized applications, some countries keep using their crappy encodings, just look at the Japanese, Chinese and Russians. You have to force the damn locale on each of those applications to be able to use them. In some cases, the application may even check if the system locale and other settings correspond to their country and if they don't, either display a message, or just silently exit. I'm talking even about apps made in 2010. Xenophobic much?
>>32
It's only about 5 weeks old, so let it slide. Necrobumping has happened for a long time, and sometimes led to a renewal of a good conversation. Don't let last year sour your opinion of them.
The real trouble is people like >>31, who can't wait for the weekend to act retarded. >>33
That is likely to be the case for quite a while, just like companies that use IE6 for their intranet. Best not to think about it.