Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

My world4ch scraper

Name: Anonymous 2010-04-16 17:27

As I've posted several times in the past[1][2][3], I have written a world4chscraper in Python. I've rewritten it recently, cleaning up the code a lot and I believe that its quality is now sufficiently good to withstand /Prague/'s scrutiny.

Source code here: # http://www.mediafire.com/?nmy04n5ytgz #

Features:
* fairly VROOM VROOM (I just tested it today, archived 11415 threads and 425890 posts in 980.53 seconds; in your face, http://dis.4chan.org/read/prog/1205354504/58)
* has a nice progress bar
* parses properly all Shiitchan fuckups known to me as of now (even the most recent http://dis.4chan.org/read/prog/1220718054)

Enjoy.

____________________
References:
1: http://dis.4chan.org/read/prog/1252024842
2: http://dis.4chan.org/read/prog/1255410333/22,24
3: http://dis.4chan.org/read/prog/1205354504/40,43,45

Name: Anonymous 2010-09-03 15:20

>>24
It's a fair bet that >>21 didn't take issue with the way you wrote Python, but with the fact that you used it at all. I would be thoroughly surprised if >>21 actually knew any.

The disadvantages it has over /prog/scrape, in my opinion, though, are these (in no particular order):

    1. It's unnecessarily OO. /prog/scrape is nicely (or shittily, if you prefer) procedural, which makes it easier to follow.
    2. The code is spread out over several files, and it's not necessarily obvious which the main one is. It would be nicer if it were just a single small script you could put somewhere.
    3. When it was written, this wasn't an issue, but /prog/scrape is now more customizable through command line switches.
    4. /prog/scrape also has some technical advantages: it uses the JSON interface by default, and accepts gzipped content. The former makes it less error-prone, the latter more bandwidth-friendly.
    5. It uses Postgres instead of sqlite.
    6. It doesn't have a man page or a bash completion script.
    7. It's a zip file on Mediafire instead of a proper repository on Github or Bitbucket or somewhere.

And also perhaps

    8. It's not compatible with existing /prog/scrape databases. Most people who would be interested already have one of those, and it doesn't really matter that it only takes sixteen minutes to scrape all of /prog/; that's sixteen minutes they wouldn't need to spend if they didn't switch.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List