Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

My world4ch scraper

Name: Anonymous 2010-04-16 17:27

As I've posted several times in the past[1][2][3], I have written a world4chscraper in Python. I've rewritten it recently, cleaning up the code a lot and I believe that its quality is now sufficiently good to withstand /Prague/'s scrutiny.

Source code here: # http://www.mediafire.com/?nmy04n5ytgz #

Features:
* fairly VROOM VROOM (I just tested it today, archived 11415 threads and 425890 posts in 980.53 seconds; in your face, http://dis.4chan.org/read/prog/1205354504/58)
* has a nice progress bar
* parses properly all Shiitchan fuckups known to me as of now (even the most recent http://dis.4chan.org/read/prog/1220718054)

Enjoy.

____________________
References:
1: http://dis.4chan.org/read/prog/1252024842
2: http://dis.4chan.org/read/prog/1255410333/22,24
3: http://dis.4chan.org/read/prog/1205354504/40,43,45

Name: Anonymous 2010-09-06 12:36

>>36
Except that it's redundant, requires building a list that is never used, and map() is an iterator in Python 3 so it's a very bad habit to get into regardless. Especially considering that the previous line is already iterating the very same data and it would be highly sensible to combine the two. Not to mention that threads is first an integer and then a list, and it's never used aside from that. It's just awful coding style overall.

Much, much better and saner would be:
scraper = (scrape_json if use_json else scrape_html)
for n in xrange(threads):
    threading.Thread(target=scraper).start()

Unless I have missed some nuance, that's equivalent, more efficient, clearer to read, doesn't mutilate existing variables, and on top of that, it's less code.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List