You know, I like visitors on my website too, but spamming it all over /prog/ is just sad and pathetic. Don't give me this bullshit that you are not him. It's sad.
>>2
Since nearly all of his posts make it to the front page of r/programming and r/coding nowadays, I don't think Xarn is terribly anxious to get the three or four extra hits being linked to on /prog/ will get him.
Name:
Anonymous2010-06-25 4:06
mode change 100755 => 100644 progscrape.py If you just want to scrape world4ch's /prog/, you can run the script directly (./progscrape.py, or python2.5 progscrape.py if you have several versions of Python installed). HIBT?
This thread is now about progscrape. Is anyone else getting 403 errors when trying to verify tripcodes, or is it just me? Are scrapers banned from using the HTML interface now?
>>38 here, >>40 was my first back to x please, you made my day bro.
Name:
Anonymous2010-06-26 1:50
>>40
change the useragent string, problem solved.
oh, and now you have to set verify_trips to true, because xarn couldn't figure out how to change the useragent string.
>>49
The server isn't broken. Scrapers are very intentionally blocked from accessing the HTML interface, and it's just common fucking courtesy to respect that.
If you don't understand that, maybe you should stick to the imageboards.
>>57
The HTML interface is working again for me too. I don't know whether the filter was turned off or an exception was added for progscrape specifically or something else was going on.
The latest subject.txt corruption happened when MrVacBob deleted the 327 spam posts yesterday. He also deleted some threads entirely, so the number of posts in your database may no longer agree with the number of posts subject.txt says there should be.
Name:
Anonymous2010-06-27 4:44
Traceback (most recent call last):
File "./progscrape.py", line 248, in <module>
(thread[0], post, p['name'], p['meiru'], p['trip'], p['now'], p['com']))
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a te
xt_factory that can interpret 8-bit bytestrings (like text_factory = str). It is
highly recommended that you instead just switch your application to Unicode str
ings.
>>68
adding db_conn.textfactory = str after the db_conn = sqlite3.connect(db_name) line seems to have fixed that.
i'd rather not muck around with FIOC's idiotic type system to figure out how to do the "highly recommended" fix instead.
>>74
Namely, because web browsers must conform to hundreds, possibly thousands of unique encodings- not just a few that are ``still in widespread use."
Solution: all strings should be valid XML with an encoding declaration.
Name:
Anonymous2010-06-28 4:11
>>74 [/code]if string is not valid utf8 or ascii
if string is valid shift_jis
convert string from shift_jis to utf8
else convert string from iso 8859-1 to utf8
return string[/code]
>>79
it's trivial to determine if a particular byte sequence is not valid ascii (any bytes with 8th bit set), utf8 (any bytes with 8th bit set that aren't part of a valid utf8 multibyte character), or shift_jis (any bytes other than 00-0F,A1-DF that aren't part of a valid shift_jis double-byte character). no magic necessary.
>>71,73,78,81
Morons like this are precisely why Unicode support is so shitty in most high-level languages: on the one hand there's the assumption that character encoding are easy to get right, and on the other there's the belief that you can reasonably hide character encoding details from the programmer.
Both are obvious bullshit, and history has borne this out again and again.
What I'd like a programming language to do is allow me to convert a string between encodings I tell it to, that is: no more "unrecognized literal at position ..." in Python, where I can't even DO anything with the string.
Alternately, Ruby has no proper notion of Unicode, the best thing it has is converting from one encoding to another. So if you want to keep your head in the sand go use that.
Name:
Anonymous2010-06-28 11:48
>>84
Most people who complain about Unicode support in language X (where X is anything besides PHP) just haven't read the documentation. You're no exception.
Name:
Anonymous2010-06-28 12:09
>>86 :<
No, I do read the documentation when needed. I think that I was able to fix most of my problems by using codecs with the errors='replace' option, or by wrapping a stream in a stream decoder.
Still, it's really annoying when you just want to write a simple script for something (though I should learn Perl for that), or when (as >>69 said) the problem is caused by a library you're using.
Name:
Anonymous2010-06-28 12:11
Oh, also: Python's glob has the clever behaviour of changing the return type based on the input type, that is returning unicode strings when you do glob(u'*'), and ASCII strings (which are incorrect) when you do glob('*').
>>87
And many of those libraries come standard. >>> urllib.quote(u'\N{snowman}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/urllib.py", line 1222, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\u2603'
Granted 3.x fixed a lot of the Unicode idiocy, but at the expense of making broken filenames completely invisible and inaccessible, and I'm not sure that was the best tradeoff.
>>90
Weren't they going to have a dual bytestring and unicode interface? And then they were going to add some dangerously magical auto-quoting to the unicode interface as well.
>>91
There's a bytes type, which is actually quite useful and sensible -- individual elements are numeric, so b'ABCDE'[1] == 66. Works a lot like char * in C, actually.
I'm not sure what sort of auto-quoting you're referring to.