Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Hey Xarn, apply this patch!

Name: Anonymous 2010-07-27 6:37

--- progscrape.py    2010-07-26 13:42:46.000000000 +0200
+++ /tmp/progscrape.py    2010-07-27 11:00:18.991885078 +0200
@@ -299,8 +299,14 @@
 
 
         if verify_trips and len(tripv) > 0:
+            # We get 403 with long URLs
+            # if too many trips to check, fetch the whole thread
+            tripv_url = read_url + thread[0] + '/'
+            if len(tripv) < 200:
+                tripv_url += ','.join(tripv)
             try:
-                hp = urlopen(read_url + thread[0] + '/' + ','.join(tripv))
+                hp = urlopen(tripv_url)
 
             except:
                 print "Couldn't access HTML interface to verify tripcodes.",\


This will solve the 403 errors.

Name: Anonymous 2010-07-27 6:45

Forgot to point the offensive thread.

http://dis.4chan.org/read/prog/1247978789

Name: Anonymous 2010-07-27 7:55

FORGET MY ANUS

Name: Xarn !Rmk.XarnE2!OR/nEWfAt6nbhpH 2010-07-27 14:03

Alright.

Name: Xarn !Rmk.XarnE2!OR/nEWfAt6nbhpH 2010-07-27 15:22

>>5
That's reasonable. Any other suggestions?

Name: Anonymous 2010-07-27 16:55

Bamp for useful thread.

Name: Anonymous 2010-07-27 17:03

How about using multiple threads?

Name: Anonymous 2010-07-27 17:11

>>8
Threading is pointless in the common use case, and approaching DOSing when scraping an entire board.

Name: Anonymous 2010-07-27 17:33

>>6
Add some blinking lights so that the program makes my boss think I'm doing important work.

Name: Anonymous 2010-07-27 17:59

>>9
How about at least using pipelining then?

Name: Anonymous 2010-07-27 18:05

>>11
That's unpythonic.

Name: Anonymous 2010-07-27 18:15

>>12
Oh you. Quit talking up the idea.

Name: Xarn !Rmk.XarnE2!OR/nEWfAt6nbhpH 2010-07-27 18:52

>>11
I'm not entirely convinced that makes a lot of difference, but alright.

Name: Xarn !Rmk.XarnE2!OR/nEWfAt6nbhpH 2010-07-27 19:07

Actually, I guess that last update isn't strictly pipelining, though a lot of people seem to be calling it that. It's just a persistent connection, which is still an improvement.
AFAIK there's no documented way to do HTTP pipelining using just the Python standard library.

Name: Anonymous 2010-07-27 19:19

Sage for Xarn

Name: Anonymous 2010-07-27 20:18

>>16
Age for anti-Xarn.

Name: Anonymous 2010-07-27 20:35

AFAIK there's no documented way to do HTTP pipelining using just the Python standard library.
not in the Python standard library, but PycURL can do it: http://pycurl.sourceforge.net/doc/curlmultiobject.html

it should be trivial to check if PycURL is installed and use libhttp otherwise.

Name: Anonymous 2010-07-27 21:05

>>18
s/libhttp/httplib/

Name: Anonymous 2010-07-27 21:29

There's also the issue that adding real pipelining would require restructuring most of the program and not really be that useful.

Name: Anonymous 2010-07-27 21:57

adding real pipelining would require restructuring most of the program
That's a sure sign that the program should have been structured from the start instead of a mess of spaghetti code.

Name: Anonymous 2010-07-27 22:23

>>21
It fetches a page, does things with it, and then does it again for the next one until all pages have been fetched. The fact that adapting that to fetch all of those page asymmetrically with a priority queue (when --verify-trips is turned on) is less than straightforward isn't a comment on the design of the program so much as it is just the difference between one algorithm and a completely different one.

Name: Anonymous 2010-07-27 22:40

>>22
Design of a program in no way relates to algorithms and data structures chosen
Clever, almost.

Name: Anonymous 2010-07-27 23:26

>>23
Reading comprehension isn't your forté. You may want to work on that.

Name: Anonymous 2010-07-28 0:26

>>22
If your program performs a trivially parallelizable task and is not itself trivially parallelizable, you're doing something very wrong.

Name: Anonymous 2010-07-28 0:44

>>25
The issues of parallelisation and HTTP pipelining are orthogonal. It's pretty easy to add multithreading to progscrape, and a few people have, over the years.
Though it would fuck up the progress bar.

Name: Anonymous 2010-07-28 2:24

>>26
Though it would fuck up the progress bar.
HIBT? It can't possibly be that difficult to make a progress bar work with threads.

Name: Anonymous 2010-07-28 13:52

>>27
As written, the progress bar just goes up one line (using the ANSI console codes Xarn is so fond of) and prints some text. If each thread has its own progress bar that will be a bit of a mess.

Name: Anonymous 2010-08-03 12:14

MESS MY ANUS

Also, I've found that your tripcode searcher does not allow searching for numeric trips like 9000 or so. Please fix it.

Name: Anonymous 2010-08-03 12:20

>>28
Perhaps Xarn should use ncurses.

Name: Anonymous 2010-08-03 12:56

Name: Anonymous 2010-08-03 13:11

>>31
Pig disgusting abuse of whitespace.

Name: Anonymous 2010-08-03 13:56

>>31
Fuck off hotaru.

Name: Anonymous 2010-08-03 14:03

>>33
Hotaru doesn't use capital letters like that, and will probably be annoyed by the fact that I capitalized ``Hotaru".

Name: Anonymous 2010-08-03 14:05

>>33,34
Don't be mean, this Hotaru is only seven years old.

Name: Anonymous 2010-08-03 14:38

>>35
He wrote a tripcode searcher when he was two years old?

Name: Anonymous 2010-08-03 15:16

>>29
Which tripcode searcher is that? Because the one on Github does just fine.

Name: Anonymous 2010-08-03 15:24

>>31
Hey, that's like a less readable, slower version of Xarn's, except that it's not as easy to distribute over a cluster, suffers from NIH syndrome, and doesn't compile.

Name: Anonymous 2010-08-03 15:27

slower
not as easy to distribute over a cluster
doesn't compile
Obviously false.

less readable
Completely subjective.

NIH syndrome
I don't think that means what you think it means.

Name: Anonymous 2010-08-03 15:42

>>39
Completely subjective.
Hello, FV.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List