Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UnicodeEncodeError

Name: Anonymous 2009-03-29 9:19

Hey /prog/rammers, I need some help. I'm using Python 2.5.2 in conjunction with mechanize and html5lib (BeautifulSoup parser). I'm trying to parse some page from 4chan which often include strange characters (take a look at the attached file). My Python interpreter now always throws up with e.g:

"UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)"

This only happens if the characters are really special: German umlauts (ä,ö,ü) work without problems, furthermore html5lib should automatically convert the given HTML source to unicode.

Do you've got any ideas?

Name: Anonymous 2009-03-29 16:35

>>1
As you say, BeautifulSoup gives you Unicode output. However, whenever you plan to actually write that output to file/standard output/database/whatever, you first have to encode it to a multibyte character set of your choice. Normally, this would probably UTF-8.

So, unless you explictly encode the Unicode string, Python does this automatically for you whenever a 8bit character string is required. This does, however, expose a fatal weakness in Python: Its default character set is, for some reason, ASCII. Meaning that e.g. print u'\uXXXX' will try to convert the string using the ASCII codec before printing it, which likes to fail hard for characters outside the ASCII range.

So, to circumvent this problem, you have two options:
1. Encode the string manually before processing it further (string.encode('utf-8'))
2. Set the default encoding to UTF-8 (or whichever codec you desire) by executing this somewhere:
import sys; reload(sys)
sys.setdefaultencoding('utf-8')


Don't worry about it, it took me ages to actually understand the mechanics behind this.

PS: HIBT by posting this? I do not know, but I hope not.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List