/prog/ - UnicodeEncodeError

Name: Anonymous 2009-03-29 9:19

Hey /prog/rammers, I need some help. I'm using Python 2.5.2 in conjunction with mechanize and html5lib (BeautifulSoup parser). I'm trying to parse some page from 4chan which often include strange characters (take a look at the attached file). My Python interpreter now always throws up with e.g:

"UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)"

This only happens if the characters are really special: German umlauts (ä,ö,ü) work without problems, furthermore html5lib should automatically convert the given HTML source to unicode.

Do you've got any ideas?

Name: Anonymous 2009-03-29 9:22

>>1
The stimulation of respiration through the introduction of tobacco smoke by a rectal tube was first practiced by the North American Indians.[1] In 1745, Richard Mead was among the first Western scholars to recommend tobacco smoke enemas to resuscitate victims of drowning.[2] One of the earliest reports of resuscitation by rectally applied tobacco smoke dates from 1746, when a seemingly drowned woman is reported as being successfully revived after, on the advice of a passing sailor, the stem of the sailor's pipe was inserted into her rectum and air was blown into the pipe's bowl through a piece of perforated paper.[2]

Name: Anonymous 2009-03-29 9:22

File attached:
http://img3.imagebanana.com/img/x9aklfri/screenshot_002.png

Name: Anonymous 2009-03-29 9:41

No one?

Name: Anonymous 2009-03-29 9:49

>>4
This is not /b/, threads don't die after two hours here, so please wait or [u]go back, please[/u].

Name: Anonymous 2009-03-29 11:06

Use http://synthcode.com/scheme/html-parser.scm

Name: Anonymous 2009-03-29 13:20

Ah, that's why -- Python doesn't support Unicode.

Name: Anonymous 2009-03-29 13:47

>>7
Stop talking bullshit, of course it does, troll.

Name: Anonymous 2009-03-29 15:02

[haxus_the_great@haxxboxx:~] ghci

GHCi, version 6.8.2: http://www.haskell.org/ghc/  :hibt for help

Loading package base ... linking ... done.

Prelude> let ħḁχMyÁɳùẋ = True

Prelude> ħḁχMyÁɳùẋ

True

Prelude>

Sure it does.

Name: Anonymous 2009-03-29 16:35

>>1
As you say, BeautifulSoup gives you Unicode output. However, whenever you plan to actually write that output to file/standard output/database/whatever, you first have to encode it to a multibyte character set of your choice. Normally, this would probably UTF-8.

So, unless you explictly encode the Unicode string, Python does this automatically for you whenever a 8bit character string is required. This does, however, expose a fatal weakness in Python: Its default character set is, for some reason, ASCII. Meaning that e.g. print u'\uXXXX' will try to convert the string using the ASCII codec before printing it, which likes to fail hard for characters outside the ASCII range.

So, to circumvent this problem, you have two options:
1. Encode the string manually before processing it further (string.encode('utf-8'))
2. Set the default encoding to UTF-8 (or whichever codec you desire) by executing this somewhere:

import sys; reload(sys)

sys.setdefaultencoding('utf-8')

Don't worry about it, it took me ages to actually understand the mechanics behind this.

_{PS: HIBT by posting this? I do not know, but I hope not.}

Name: Anonymous 2009-03-29 16:36

>>10
If you have to ask, you have.

Name: Anonymous 2009-03-29 16:37

I never understood why the class was called "BeautifulSoup." Seems like a pretty non-descriptive of Campbells to me.

Name: Anonymous 2009-03-29 21:21

>>12
Because it's Python.

Name: Anonymous 2009-03-30 4:53

>>10
Alternately, fix your fucking locale and stop taking a shit on your .py files.

Name: Anonymous 2009-03-30 4:58

>>12
Because it serves partly as a pretty-printer for tag-soup HTML files.

Name: Anonymous 2009-03-30 10:20

>>10
You have not been, thanks for your answer! (OP)

Name: Anonymous 2010-12-10 1:19

Name: Anonymous 2011-02-02 22:49

Name: Anonymous 2011-02-03 6:56

Don't change these.

Name:		Email:

Entire Thread Thread List

UnicodeEncodeError

1 Name: Anonymous 2009-03-29 9:19

2 Name: Anonymous 2009-03-29 9:22

3 Name: Anonymous 2009-03-29 9:22

4 Name: Anonymous 2009-03-29 9:41

5 Name: Anonymous 2009-03-29 9:49

6 Name: Anonymous 2009-03-29 11:06

7 Name: Anonymous 2009-03-29 13:20

8 Name: Anonymous 2009-03-29 13:47

9 Name: Anonymous 2009-03-29 15:02

10 Name: Anonymous 2009-03-29 16:35

11 Name: Anonymous 2009-03-29 16:36

12 Name: Anonymous 2009-03-29 16:37

13 Name: Anonymous 2009-03-29 21:21

14 Name: Anonymous 2009-03-30 4:53

15 Name: Anonymous 2009-03-30 4:58

16 Name: Anonymous 2009-03-30 10:20

18 Name: Anonymous 2010-12-10 1:19

19 Name: Anonymous 2011-02-02 22:49

20 Name: Anonymous 2011-02-03 6:56

Name: Anonymous 2009-03-29 9:19

Name: Anonymous 2009-03-29 9:22

Name: Anonymous 2009-03-29 9:22

Name: Anonymous 2009-03-29 9:41

Name: Anonymous 2009-03-29 9:49

Name: Anonymous 2009-03-29 11:06

Name: Anonymous 2009-03-29 13:20

Name: Anonymous 2009-03-29 13:47

Name: Anonymous 2009-03-29 15:02

Name: Anonymous 2009-03-29 16:35

Name: Anonymous 2009-03-29 16:36

Name: Anonymous 2009-03-29 16:37

Name: Anonymous 2009-03-29 21:21

Name: Anonymous 2009-03-30 4:53

Name: Anonymous 2009-03-30 4:58

Name: Anonymous 2009-03-30 10:20

Name: Anonymous 2010-12-10 1:19

Name: Anonymous 2011-02-02 22:49

Name: Anonymous 2011-02-03 6:56