Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

My world4ch scraper

Name: Anonymous 2010-04-16 17:27

As I've posted several times in the past[1][2][3], I have written a world4chscraper in Python. I've rewritten it recently, cleaning up the code a lot and I believe that its quality is now sufficiently good to withstand /Prague/'s scrutiny.

Source code here: # http://www.mediafire.com/?nmy04n5ytgz #

Features:
* fairly VROOM VROOM (I just tested it today, archived 11415 threads and 425890 posts in 980.53 seconds; in your face, http://dis.4chan.org/read/prog/1205354504/58)
* has a nice progress bar
* parses properly all Shiitchan fuckups known to me as of now (even the most recent http://dis.4chan.org/read/prog/1220718054)

Enjoy.

____________________
References:
1: http://dis.4chan.org/read/prog/1252024842
2: http://dis.4chan.org/read/prog/1255410333/22,24
3: http://dis.4chan.org/read/prog/1205354504/40,43,45

Name: Anonymous 2010-04-16 17:34

Python [...] withstand /Prague/'s scrutiny
[b]YOU THOUGHT WRONG BITCH[/u]

Name: Anonymous 2010-04-16 17:36

>>2
GET OUT, ME

Name: Anonymous 2010-04-16 18:01

I got turned off when I heard about the /prog/ress bar

Name: Anonymous 2010-04-16 18:06

>>4
You can always turn it off.
see what I did there?

Name: Anonymous 2010-04-16 18:24

>>5
Or I could just write my own.

Name: Anonymous 2010-04-17 6:55

Doesn't work anyway...

Name: Anonymous 2010-04-17 7:11

http://dis.4chan.org/read/prog/1269637830 related

Which interpreter should I use?

Name: Anonymous 2010-04-17 8:26

>>7
Wait, what doesn't?

Name: Anonymous 2010-04-17 14:57

>>7
OPs world4ch scraper.

Name: Anonymous 2010-04-17 16:35

Ugh, what's this PostgreSQL shit? I just want a scraper.

Name: Anonymous 2010-04-17 16:37

yo, there's no escapin',
when my bot starts scrapin',
i could archive you whole site,
all yo images taken

Name: Anonymous 2010-04-17 17:53

>>11
You can't have a scraper.

Name: Anonymous 2010-04-17 19:52

>>11
                            _,.---""----.,_
                         .-' __.----...---.;
                       .'  .'               `'.
                      /  .'                    '.
                     /  /                  `'",  `\
                   .'_,' ,"`                       `'.
                  /,'                       .-.       \
                .'        .-.              /___\  (  . `,
               /  .-'    /___\             |_  |   \  '. \
              ;| /       |_  |             \_)_/   |    \ \
              /| |   ;  _\_)_/             `   ,  /     |  \
             / | |  /  ___   `  .-~`````~-. ,='   |     |  |
            ;  \ \ /.-'   `-._              |    /|     |  \
           .'``''--' _.-      `\            |  .'  \    |  |;
          /         ' _,        \           |-`|   |    |  ||
         /  /        /_\       /_\ .-~``~-. |  |   |   /   ||
        |  |         \(/       |(|          |  /   |   |   ||
       ,|  |        -~~`       ;`|   .---.  |  |   |   /   ||
    .-/ |  |    /        .-~`~  \|  (     '.|  /   ;--/    /|
   / /  |  \   |   -t-           | -~'-\    \  `--'  `-..-` ;
  |/|   /   \  |`._  \      -~` /       |    \             /
  /\|   | .--'  `\ `._`._      /        |-~` |            /
  \/\   '.____,.-'    `""`-._.'    ~-. /     /       .' .'
     \             .--._.-'`         .'`- _.'       (.-'
      |           (       ,    ',_.-'`"""`          |
      \   \     .-'  '-;---;..--'  /,----.y         |
 jgs  |    '.   `.__,-'    /       |      |         |
      |     |`-.,__,    Y  |       |      |         |
      |  == | =|   |    |  |- ~ -. |      | .- ~ -. |
      |     |  |   | == | =|~ - ~` |      | '~ - ~' |
     /.-.-.-\-.-\  |    |  |       |      |         |
     `""""""`""""`/.-.-.-\-.\-.-.-.-\    /.-.-.-.-.-.\
                  `"""""""`""`-------'   '------------'

Name: Anonymous 2010-04-17 22:17

>>14 I LOVE YOU! I LOVE YOUR POST! I READ IT 5 TIMES! KEEP POSTING!

Name: Anonymous 2010-04-19 5:21

5.29 KB file on Mediafire
Github or GTFO.

Name: plop 2010-09-03 1:08

what is the url to paste to the create command ?
worl4chan.py create .....

Name: Anonymous 2010-09-03 9:35

>>17
Reading is hard.

Just use /prog/scrape.

Name: Anonymous 2010-09-03 12:13

I just use wget lol.

Name: /prog/ Etiquette Advisor !fzcXE63Op. 2010-09-03 14:11

Actual content! Have a bump, and know that you are King of /prog/ for two years.

Name: Anonymous 2010-09-03 14:13

>>20
He fails hard for writing it in shitty ass FIOC
Fuck off ``faggot'' we don't like your kind 'round these parts, son.

Name: Anonymous 2010-09-03 14:46

>>20
I'd love for you to point out where the content in >>17- is.

Name: Anonymous 2010-09-03 14:51

>>21
back to /pr/, please.

As obnoxious ass >>20 is, you're worse.

Name: Anonymous 2010-09-03 14:53

>>20
( ・∀・) Thanks bro! And sorry for telling you to fuck off in one of the other threads, you sound like a swell fellow.
( ゚ -゚) Though it looks like /Prague/ didn't really like it. I'd like to hear why, according to >>21, I fail at writing in FIOC.

Name: Anonymous 2010-09-03 15:20

>>24
It's a fair bet that >>21 didn't take issue with the way you wrote Python, but with the fact that you used it at all. I would be thoroughly surprised if >>21 actually knew any.

The disadvantages it has over /prog/scrape, in my opinion, though, are these (in no particular order):

    1. It's unnecessarily OO. /prog/scrape is nicely (or shittily, if you prefer) procedural, which makes it easier to follow.
    2. The code is spread out over several files, and it's not necessarily obvious which the main one is. It would be nicer if it were just a single small script you could put somewhere.
    3. When it was written, this wasn't an issue, but /prog/scrape is now more customizable through command line switches.
    4. /prog/scrape also has some technical advantages: it uses the JSON interface by default, and accepts gzipped content. The former makes it less error-prone, the latter more bandwidth-friendly.
    5. It uses Postgres instead of sqlite.
    6. It doesn't have a man page or a bash completion script.
    7. It's a zip file on Mediafire instead of a proper repository on Github or Bitbucket or somewhere.

And also perhaps

    8. It's not compatible with existing /prog/scrape databases. Most people who would be interested already have one of those, and it doesn't really matter that it only takes sixteen minutes to scrape all of /prog/; that's sixteen minutes they wouldn't need to spend if they didn't switch.

Name: Anonymous 2010-09-03 15:25

You forgot the part where it's not written by a Xarn.

Name: Anonymous 2010-09-03 15:29

>>25.26
( ゚ -゚) Oh, I see.

Name: Anonymous 2010-09-03 15:32

>>26
People like Xarn because he wrote /prog/scrape, not the other way around. If someone forked /prog/scrape and made it multi-threaded, it might very well end up more popular than the main branch.

Name: Anonymous 2010-09-03 15:41

It would also help if it and you had a name. It needs a name so people can talk about it, and you need a name so people can contact you with bug reports and feature requests.

But yeah, the main problem is the PostgreSQL.

Name: Anonymous 2010-09-03 15:47

I, for one, enjoyed the fact that you spent so much time berating Xarn and /prog/scrape, and then ended up producing a scraper that was worse in nearly every way, but just had threading.

Name: Anonymous 2010-09-03 16:06

>>30
(´゜д゜)

Name: Anonymous 2010-09-05 14:35

Name: Anonymous 2010-09-05 17:28

>>32
With sixteen threads, that finishes in 698 seconds (503396 posts in 12929 threads). I was going to test >>1's to compare, but it requires Python 2.6. So add that to the list of crimes in >>25.

Name: Anonymous 2010-09-05 17:35

threads = [threading.Thread(target=lambda: scrape_json() if use_json else scrape_html()) \
           for n in xrange(threads)]
map(lambda n: n.start(), threads)

HIBT?

Name: Anonymous 2010-09-06 8:39

http://github.com/Cairnarvon/progscrape/tree/threaded
I just tried this and it's still slow as fuck.
If i start /prog/scrape, wait until it says "Fetching subject.txt..." and then start the other scraper that I use, the other scraper finishes before /prog/scrape says "Got it."

Name: Anonymous 2010-09-06 11:18

>>34
Nothing wrong with that last line.

Name: Anonymous 2010-09-06 12:36

>>36
Except that it's redundant, requires building a list that is never used, and map() is an iterator in Python 3 so it's a very bad habit to get into regardless. Especially considering that the previous line is already iterating the very same data and it would be highly sensible to combine the two. Not to mention that threads is first an integer and then a list, and it's never used aside from that. It's just awful coding style overall.

Much, much better and saner would be:
scraper = (scrape_json if use_json else scrape_html)
for n in xrange(threads):
    threading.Thread(target=scraper).start()

Unless I have missed some nuance, that's equivalent, more efficient, clearer to read, doesn't mutilate existing variables, and on top of that, it's less code.

Name: Anonymous 2010-09-06 13:18

>>37
Not to mention that threads is first an integer and then a list
Welcome to dynamic type systems. There's no reason to be afraid of them.

Unless I have missed some nuance
Yes: reference-counting garbage collection.

Name: Anonymous 2010-09-06 17:06

>>38
It's not dynamic typing that I object to, it's the inevitable confusion between what your variable actually means later on. Best to pick unique names for each circumstance.

Fuck, I don't like Haskell and that's the best thing it has going for it, you can't be a dickwad about variable names and make unmaintainable crap like this.

Name: Anonymous 2012-03-28 2:15

my farts burn my anus
it hurts
in a good way

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List