Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-

Auto-downloading of threads?

Name: X-POST 2008-04-13 20:03

Because of the nature of 4chan I constantly have to download threads and use DownThemAll to get the images. This is fucking tedious because

1 I have to keep coming back to save the page
2 DownThemAll still needs to be manually guided to start downloading much further than from the FIRST IMAGE because that's just wasting time


I'd like to be able to just one click a thread and have the script keep downloading that thread until it dies, but intelligently. So it would download new updates alone, and not download the entire page each time. It would also download the newest images, and not bother starting to download from the top.


Any greasemonkey scripts for this?

Name: Anonymous 2008-04-13 20:05

man httrack

Name: Anonymous 2008-04-13 20:05

Good luck with that -- it's NP-complete.

Name: Anonymous 2008-04-13 20:06

you take 4chan too serious

Name: Anonymous 2008-04-13 20:07

>>4
I take grammar even more serious

Name: Anonymous 2008-04-13 20:18

>>1
Unfortunately, the image boards don't have the http://dis.4chan.org/read/{board}/{thread_id}/{post_no+} syntax that the text boards do, so you can't incrementally retrieve like that. But you may be able to use HTTP's If-Modified-Since header to only grab the page if it's been changed (it should return a 304 status code if not, otherwise the usual 200.)

Name: Anonymous 2008-04-13 21:11

define(S_SQLCONF, 'MySQL connection failure');     //MySQL connection failure

Name: Anonymous 2008-04-13 21:14

>>6
Theres also that subjects.txt (or whatever) that lists all the topics and (if i remember correctly) the number of replies.

Name: Anonymous 2008-04-13 21:15

Name: Anonymous 2008-04-13 21:18

Oh wait you guys are talking about the image boards. FUCK.

Name: Anonymous 2008-04-13 21:44

>>6
I'm not sure if the imageboard dynamic pages accept Content-Range headers, but that would be a good way to avoid downloading content that's already been downloaded along with If-Modified-Since since e.g. the OP does not change throughout the lifetime of the thread.

Name: Anonymous 2008-04-13 21:51

>>11
That's a good idea, but it doesn't appear to work (on img.4chan.org, at least)

Name: Anonymous 2008-04-13 22:04

Why dont you just use a bloom filter?

Name: Anonymous 2008-04-13 22:20

>>6
I'm not sure about If-modified-since, but doing a HEAD request and checking Last-modified works fine. Throw in gzip and exponential backoff for non-changing threads, and you're golden as far as resource hogging goes.

Name: Anonymous 2008-04-13 22:36

$ telnet img.4chan.org 80
Trying 66.207.165.181...
Connected to img.4chan.org.
Escape character is '^]'.
GET /b/ HTTP/1.0
Host: img.4chan.org
Range: bytes=100-120

HTTP/1.1 206 Partial Content
Server: nginx
Date: Mon, 14 Apr 2008 02:35:52 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 21
Last-Modified: Mon, 14 Apr 2008 02:35:51 GMT
Connection: close
Content-Range: bytes 100-120/61234

 content="noarchive"/Connection closed by foreign host.

Name: Anonymous 2008-04-13 22:46

>>15
$ telnet img.4chan.org 80
Trying 66.207.165.181...
Connected to img.4chan.org.
Escape character is '^]'.
GET /b/res/62666490.html HTTP/1.0
Host: img.4chan.org
Range: bytes=100-120

 content="noarchive"/Connection closed by foreign host.

It works for threads too. I guess it's because the thread pages aren't dynamically generated, but are just updated with each post.

Name: Anonymous 2008-04-14 1:01

img's 404 page:

<meta name="generator" content="HTML Tidy for Windows (vers 12 April 2005), see www.w3.org" />

...what?

Name: Anonymous 2008-04-14 2:29

>>1
man wget

Name: Anonymous 2008-04-14 21:58

>>15

Well, shit. You can download the exact range of bytes that you need from a webpage? Great.

Name: Anonymous 2008-04-14 22:41

% alias 4get
4get='wget -r -l1 -A jpg,png,gif -I "*/src" -nd -U "lol" -nc'

% alias 4mirror
4mirror='wget -nc -nH --cut-dirs=1 -p -k -r -l1 -A jpg,png,gif,html -I "*/src,*/thumb" -U "lol"'

% cat .wgetrc
robots=off


4get gets all images, doesn't redownload already downloaded images.

4mirror mirrors the thread, doesn't redownload aleady downloaded images, but you need to delete res/*.html before you run it again or it won't mirror anything.

If downloading the stupid html page takes too much time for you, you need a better connection.

Homework:
- (Easy) Make a script that will call 4mirror periodically. Extra credit: instead of removing res/*.html before calling 4mirror, rename it to *.html.bak. If you get 404'd, restore the file.

- (Medium) Make an equivalent of 4get (try perl) that will get the images with their original filenames. If two images on the same thread have the same name, you cannot overwrite them. Extra credit: store the Futaba timestamp in metadata when the format allows it.

- (Advanced) Make a script that will do the equivalent of the script above periodically until you get 404'd. Turn it into a Firefox extension/Greasemonkey script.

Name: Anonymous 2008-04-14 22:51

rm -rf /

Name: Anonymous 2009-03-06 12:38


Page OF SICP YOU can not find   a MySpace password   until you find   a peice of   open source software   wikipedia is full   of people like   4 said exactly   what I wanted   to learn COMPUTER   SCIENCE not COMPUTER   BI NEEZ BULSHIT   Plz give me   your web adress.

Name: Anonymous 2009-03-06 15:28


Threads And this will haunt him to   the MAX EXTREME.

Name: Anonymous 2009-08-17 0:57

Lain.

Name: Anonymous 2011-05-08 16:17

Well shit.
Where the fuck did >>25 go?

Name: Sgt.Kabukiman 2012-05-22 3:24

All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy
 All work and no play makes Jack a dull boy

Don't change these.
Name: Email:
Entire Thread Thread List