/prog/ - smoke Xarn everyday

Name: Anonymous 2010-06-23 2:51

http://cairnarvon.rotahall.org/

Name: Anonymous 2010-06-28 5:13

>>79
it's trivial to determine if a particular byte sequence is not valid ascii (any bytes with 8th bit set), utf8 (any bytes with 8th bit set that aren't part of a valid utf8 multibyte character), or shift_jis (any bytes other than 00-0F,A1-DF that aren't part of a valid shift_jis double-byte character). no magic necessary.

Name: Anonymous 2010-06-28 5:22

>>71,73,78,81
Morons like this are precisely why Unicode support is so shitty in most high-level languages: on the one hand there's the assumption that character encoding are easy to get right, and on the other there's the belief that you can reasonably hide character encoding details from the programmer.
Both are obvious bullshit, and history has borne this out again and again.

Name: Anonymous 2010-06-28 7:38

>>82
history has Bjarne this out again and again
Fixed that for you.

Name: Anonymous 2010-06-28 8:29

What I'd like a programming language to do is allow me to convert a string between encodings I tell it to, that is: no more "unrecognized literal at position ..." in Python, where I can't even DO anything with the string.

Name: Anonymous 2010-06-28 10:45

>>84
Configure Python correctly, then.

Alternately, Ruby has no proper notion of Unicode, the best thing it has is converting from one encoding to another. So if you want to keep your head in the sand go use that.

Name: Anonymous 2010-06-28 11:48

>>84
Most people who complain about Unicode support in language X (where X is anything besides PHP) just haven't read the documentation. You're no exception.

Name: Anonymous 2010-06-28 12:09

>>86
:<
No, I do read the documentation when needed. I think that I was able to fix most of my problems by using codecs with the errors='replace' option, or by wrapping a stream in a stream decoder.
Still, it's really annoying when you just want to write a simple script for something (though I should learn Perl for that), or when (as >>69 said) the problem is caused by a library you're using.

Name: Anonymous 2010-06-28 12:11

Oh, also: Python's glob has the clever behaviour of changing the return type based on the input type, that is returning unicode strings when you do glob(u'*'), and ASCII strings (which are incorrect) when you do glob('*').

Name: Anonymous 2010-06-28 12:47

>>88
Except file paths that cannot be decoded to unicode are still returned as byte strings.

Name: Anonymous 2010-06-28 13:10

>>87
And many of those libraries come standard.

>>> urllib.quote(u'\N{snowman}')

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/usr/lib/python2.6/urllib.py", line 1222, in quote

    res = map(safe_map.__getitem__, s)

KeyError: u'\u2603'

Granted 3.x fixed a lot of the Unicode idiocy, but at the expense of making broken filenames completely invisible and inaccessible, and I'm not sure that was the best tradeoff.

Name: Anonymous 2010-06-28 15:27

>>90
Weren't they going to have a dual bytestring and unicode interface? And then they were going to add some dangerously magical auto-quoting to the unicode interface as well.

Name: Anonymous 2010-06-28 15:48

>>91
There's a bytes type, which is actually quite useful and sensible -- individual elements are numeric, so b'ABCDE'[1] == 66. Works a lot like char * in C, actually.

I'm not sure what sort of auto-quoting you're referring to.

Name: Anonymous 2010-12-28 5:02

Newer Posts

smoke Xarn everyday

1 Name: Anonymous 2010-06-23 2:51

81 Name: Anonymous 2010-06-28 5:13

82 Name: Anonymous 2010-06-28 5:22

83 Name: Anonymous 2010-06-28 7:38

84 Name: Anonymous 2010-06-28 8:29

85 Name: Anonymous 2010-06-28 10:45

86 Name: Anonymous 2010-06-28 11:48

87 Name: Anonymous 2010-06-28 12:09

88 Name: Anonymous 2010-06-28 12:11

89 Name: Anonymous 2010-06-28 12:47

90 Name: Anonymous 2010-06-28 13:10

91 Name: Anonymous 2010-06-28 15:27

92 Name: Anonymous 2010-06-28 15:48

94 Name: Anonymous 2010-12-28 5:02