Because of the nature of 4chan I constantly have to download threads and use DownThemAll to get the images. This is fucking tedious because
1 I have to keep coming back to save the page
2 DownThemAll still needs to be manually guided to start downloading much further than from the FIRST IMAGE because that's just wasting time
I'd like to be able to just one click a thread and have the script keep downloading that thread until it dies, but intelligently. So it would download new updates alone, and not download the entire page each time. It would also download the newest images, and not bother starting to download from the top.
>>1
Unfortunately, the image boards don't have the http://dis.4chan.org/read/{board}/{thread_id}/{post_no+} syntax that the text boards do, so you can't incrementally retrieve like that. But you may be able to use HTTP's If-Modified-Since header to only grab the page if it's been changed (it should return a 304 status code if not, otherwise the usual 200.)
Oh wait you guys are talking about the image boards. FUCK.
Name:
Anonymous2008-04-13 21:44
>>6
I'm not sure if the imageboard dynamic pages accept Content-Range headers, but that would be a good way to avoid downloading content that's already been downloaded along with If-Modified-Since since e.g. the OP does not change throughout the lifetime of the thread.
Name:
Anonymous2008-04-13 21:51
>>11
That's a good idea, but it doesn't appear to work (on img.4chan.org, at least)
Name:
Anonymous2008-04-13 22:04
Why dont you just use a bloom filter?
Name:
Anonymous2008-04-13 22:20
>>6
I'm not sure about If-modified-since, but doing a HEAD request and checking Last-modified works fine. Throw in gzip and exponential backoff for non-changing threads, and you're golden as far as resource hogging goes.
Name:
Anonymous2008-04-13 22:36
$ telnet img.4chan.org 80
Trying 66.207.165.181...
Connected to img.4chan.org.
Escape character is '^]'.
GET /b/ HTTP/1.0
Host: img.4chan.org
Range: bytes=100-120
Well, shit. You can download the exact range of bytes that you need from a webpage? Great.
Name:
Anonymous2008-04-14 22:41
% alias 4get
4get='wget -r -l1 -A jpg,png,gif -I "*/src" -nd -U "lol" -nc'
% alias 4mirror
4mirror='wget -nc -nH --cut-dirs=1 -p -k -r -l1 -A jpg,png,gif,html -I "*/src,*/thumb" -U "lol"'
% cat .wgetrc
robots=off
4get gets all images, doesn't redownload already downloaded images.
4mirror mirrors the thread, doesn't redownload aleady downloaded images, but you need to delete res/*.html before you run it again or it won't mirror anything.
If downloading the stupid html page takes too much time for you, you need a better connection.
Homework:
- (Easy) Make a script that will call 4mirror periodically. Extra credit: instead of removing res/*.html before calling 4mirror, rename it to *.html.bak. If you get 404'd, restore the file.
- (Medium) Make an equivalent of 4get (try perl) that will get the images with their original filenames. If two images on the same thread have the same name, you cannot overwrite them. Extra credit: store the Futaba timestamp in metadata when the format allows it.
- (Advanced) Make a script that will do the equivalent of the script above periodically until you get 404'd. Turn it into a Firefox extension/Greasemonkey script.
Name:
Anonymous2008-04-14 22:51
rm -rf /
Name:
Anonymous2009-03-06 12:38
Page OF SICP YOU can not find a MySpace password until you find a peice of open source software wikipedia is full of people like 4 said exactly what I wanted to learn COMPUTER SCIENCE not COMPUTER BI NEEZ BULSHIT Plz give me your web adress.
Name:
Anonymous2009-03-06 15:28
Threads And this will haunt him to the MAX EXTREME.
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy