It'll force to restart the parsing and reflow if the initial guess was wrong. Also while it works in practice it's an undecidable problem, you're just lucky the common character encodings work in a way that allows this faggotry to work. Typical WEB design quality. I chuckle when people ask for the "correct" way do to stuff in the HTML-and-friends cesspool. Just test in in the three of four browsers that matter or copy the code from a popular page so it's already tested for you.
>>8
What part did you not understand? The encoding is needed to read the content, and the encoding is specified as part of the content. It's a recipe for disaster.
>>9
Did you mean: the encoding is specified as part of the content, and the encoding is needed to read other parts of the content
Name:
Anonymous2009-06-06 12:38
How can you parse a page if you ignore its encoding? There might be some important Arabic or Chinese mention before the character encoding, so if you must parse the document to attempt to find the character encoding, it is still necessary to parse it again from the start once you know its encoding. It is a colossal waste of CPU cycles, which, at the scale of the web, produces as much carbon as a country like Portugal.
>>23
The browser can do that itself. Or it could just use xattrs.
Name:
Anonymous2009-06-06 16:39
The problem was solved by >>14. >>11 is a retard. And whoever uses any other encoding pointed out by >>16 other than UTF-8 deserves to be ASSRAPED OUTRAGEOUSLY.
All UTF variants are by definition a strict superset of all other character sets - in fact, UTF goes as far as having multiple representations for many characters.
Name:
Anonymous2009-06-06 19:46
>>28
You‘re thinking of the Universal Character Set, not UTF-*.
>>27
There is no point to SJIS where UTF-8 can be used.
Name:
Anonymous2009-06-07 7:32
>>33
There's no point to Japanese where English can be used either.
Name:
Anonymous2009-06-07 8:37
>>34
True, but that's not the point. You don't need SJIS to render nip characters. I don't know why they insist on using SJIS. Maybe because they want to feel unique? Who knows.
Name:
Anonymous2009-06-07 9:46
>>35
It's actually the other way around. The world -should- be using SJIS for everything. If you're alphabet can't be found in SJIS, then perhaps your from a country which, tbqh, doesn't really matter much.
Name:
Anonymous2009-06-07 10:37
>>36
SJIS is PIG DISGUSTING Microsoft proprietary encoding. Please to use free and technology above EUC-JP.
>>36
SJIS fails to encode seven of the world's eight most widely spoken languages.
>>38
Actually the world is encoded in UTF-16LE and arguably transmitted in a mix of UTF-8 and local "whatever Windows 9x shipped with" encodings. When your toy OS reaches 85% market share we'll talk.
>>16
it's invalid to use the meta tag to declare the encoding in documents where the entire document up to that point is not the same as it would be in ascii.
Name:
Anonymous2009-06-08 13:06
>>46
Anything that uses backslashes as path dividers can't be Unix or even close.
>>6
You should use both: HTTP headers for when the page is transferred over HTTP, meta tags for when the pages is saved loaded from disk.
And you should be using UTF-8, always. China and Japan need to get with the fucking program. UTF-16 is only used for Windows bullshite, because NT was designed before UTF-8 existed.
Name:
Anonymous2009-06-08 14:14
>>49
Too bad Windows NT limits the characters you can put in a file name, even when NTFS will happily allow any character except '\' and U+0000. The character I'd really like to use in '?', but there's no excuse for barring any of them: those who care about getting their DOS apps to work can stick with DOS filename restrictions, while the rest of us move on, but NO!
FUCKING MICROSOFT!
Name:
FrozenVoid2009-06-08 14:17
>>51
Ah well, in such case when you have "absolute freedom" you get un-deletable directories and files(with -) and other funny junk when your "happily working" program encounters them the first time.
>>51
Thinking too much of backwards compatibility is Microsoft's biggest problem. Though I suppose that's what you have to do if you don't want to lose your market share.
Name:
Anonymous2009-06-08 14:22
>>51
Also, I'd also like to be able to use ?s, for example when saving images which source I'm not sure of, but since I can't use question marks I settle with perhaps.
Name:
Anonymous2009-06-08 14:50
China and Japan need to get with the fucking program. UTF-16 is only used for Windows bullshite, because NT was designed before UTF-8 existed.
GB 18030 is better than UTF-8.
Name:
Anonymous2009-06-08 16:16
>>55
Wait, China mandates a Unicode-based encoding? I thought the whole reason China/Japan weren't on Unicode was because of Han unification. Why did China find it necessary to create yet another Unicode encoding? I've glanced over the Wikipedia article, but couldn't find a reason.
>>59
Oh, so they've made new code points for the characters they really want but Unicode combined? I guess that makes sense, then.
Name:
Anonymous2009-06-08 18:22
>>59
Also, all characters that are 1 byte in UTF-8 are 1 byte in GB 18030, and GB 18030 has more characters that are 2 bytes than UTF-8 has that are 2 or 3 bytes, and no character in GB 18030 is more than 4 bytes.
Name:
Anonymous2009-06-08 18:44
>>60
The only reason to use GB 18030 is to reduce your memory usage for Chinese text (in RAM and on disk) by one third. Can't imagine why anyone would want that, though. It's not like Chinese people can read.
Name:
Anonymous2009-06-08 18:48
The only encoding I use is Latin-1. You can all go fuck yourselfs'.
>>59
It encodes a superset of the BMP, which is 65,538 code points. The total number of code points is over 9000 more like 17 million.
Name:
Anonymous2009-06-08 18:56
>>64 This gives a total of 1,587,600 (126×10×126×10) possible 4 byte sequences, which is easily sufficient to cover Unicode's 1,111,998 (17×65536 − 2048 surrogates − 66 noncharacters) assigned and reserved code points.
Name:
Anonymous2009-06-08 19:13
Chinese and Japanese are huge wastes of space. How many code points have we ceded to their bloated ideograms?
Learn to use a reasonably sized alphabet and like it, fags.
>>66
Their literacy rates are better than a lot of countries that use 'reasonably sized alphabets' since each ideogram encodes meaning. If you compare Chinese Twitter to English Twitter, they can have tweets with real content[1]. There is something to be said for their ``alphabets''. That said, just shut up and drink the Unicode kool-aid already.
--
1. Disclaimer: This post is in no way, shape or form an endorsement of twitter(tm) or other web 2.0 faggotry
Name:
Anonymous2009-06-08 20:23
>>67
Chinese is cool for that reason, but Japanese seems to have the worst of both worlds: text is longer than English[1], but they still use thousands of characters.
1. My conclusion based on my four years of high-school Japanese.
Name:
Anonymous2009-06-08 20:26
>>42
Actually, while some behemoths use UTF-8 (mostly the big USA megacorps, also 4chan), I still see plenty of pages using local encoding, specially small sites done by non-EXPERT WEB DEVELOPERS.
Also, while I don't visit a lot of [i]weaboo(i] or chinese sites, I still have to find one that uses UTF-8 as opposed to ShiftJIS or BIG5 respectively.
>>67
I really dislike when people say "drink the X kool-aid." Just letting you know.
Name:
Anonymous2009-06-08 21:23
>>67
There may be something to be said for the Chinese ``alphabet'', but that something is not that it has a convenient or sane digital representation.
Name:
Anonymous2009-06-08 21:26
We should introduce a moderate number of word glyphs into English. While far Eastern languages clearly take it too far, surely a few hundred glyphs for the most common words could compact our language immensely.
Name:
Anonymous2009-06-08 21:48
>>73
Just take the language as it is and use order 14 prediction by partial matching with a 512MB adaptive context model initialized with a few megabytes of carefully selected text. Use an arithmetic range coder to write the probability choices using a 4096-symbol alphabet (for example each glyph could be a braille-style 3x4 rectangle).
This should give between 10 and 20 characters per symbol for natural English text, depending on how well the context is able to predict it.
Don't expect high acceptance rates though, as it happens to be a bit human-unfriendly.
Name:
Anonymous2009-06-08 21:58
>>73
You /prog/ers already whine and bitch that people use too much symbols and not enough WORDS.
>>51
You can use characters which are deemed illegal in the filenames through NT-specific APIs, but don't blame me if it breaks a lot of other applications, and creates 'undeletable/unaccesible' files. There's already plenty of applications which were compiled in ANSI mode which just plain fail at reading files using Unicode filenames (they do work (partially) if you switch your system locale, or use Applocale, or some other wrapper, so that the converted filename's characters match the locale's charset).
tl;dr: Those limitations are mostly for your own good, but they're not enough to stop various breakages due to bad coding. You can bypass the limitations if you really desire to.