>>41
He posted it before. Why don't you look in /prog/scrape :D
Name:
Anonymous2010-03-25 21:47
>>41
You can test it yourself, but it took ~16 min for a full scrape from scratch, as opposed to ~2 hours with Xarn's scraper1 (and at that time there were more posts to scrape).
It uses (by default) 16 worker threads and uses the JSON interface (which Xarn might have not been aware of) whenever it can. It also downloads only the missing posts (post selection also works with the JSON interface).
And it has a nice progress bar.
>>43
The JSON interface didn't exist when Xarn first wrote /prog/scrape, and the problem with the JSON interface is that there's no way to distinguish a genuine tripcode from a fake one.
Just downloading the missing posts instead of the entire thread is a good idea, though.
Name:
Anonymous2010-03-26 13:38
>>44 The JSON interface didn't exist when Xarn first wrote /prog/scrape I heard so too.
the problem with the JSON interface is that there's no way to distinguish a genuine tripcode from a fake one. That's why my scraper also downloads the HTML versions of posts with tripcodes to verify them.
Name:
Anonymous2010-03-26 14:24
But your scraper doesn't parse the line progscrape fails on correctly because there is no correct way to parse it. It's Shiichan's fuckup, and progscrape is fine.
Name:
Anonymous2010-03-26 17:55
There's also the issue that the JSON interface is specific to world4ch, whereas /prog/scrape, name notwithstanding, wants to be a general Shiitchan scraper.
>>45
If you use the JSON interface but also pull the HTML to verify tripcodes, your implementation will be significantly slower than one using just the HTML interface, at least on /prog/ itself.
The only real speed-up you can get is from multi-threading, but as netiquette goes that's barely a step up from DOSing dis.
>>57
Often enough.
One additional problem is the fact that Shiitchan is a clusterfuck of bugs, and threads like http://dis.4chan.org/read/prog/1237515841 exist. Through the HTML interface that's passed over immediately, but through JSON there's a thousand posts there, each of which will trigger pulling a post through the HTML interface.
You can change your strategy specifically to deal with threads like that, but that will make the general case more problematic.
If you don't care about verifying tripcodes at all, the JSON interface is slightly faster than the HTML one, but it will still take a few hours.
>>58
Threads like that one are probably caused because someone abused that silent bump bug to kill a thread (a bug which is now hopefully fixed) and whoever fixed that thread deleted all exploit posts but did not remove the last post 1000.
>>59 a bug which is now hopefully fixed
You give the world4ch admins too much credit. Has the newline bug even been fixed yet?
Name:
Anonymous2010-03-28 19:20
Xarn is a Deathclaw being held captive in Navarro. He is the key character in the Deal with the deathclaw quest there. The player can either kill Xarn, free him, or recruit him as an ally in that quest.
In battle, Xarn is a powerful companion. He has 250 HP and can attack up to four times with his claws. However, he tends to rush into battle and cannot heal himself, though the player can heal him. He follows the player throughout Navarro until they leave the area, at which point he leaves. Given the high number of troops in Navarro, Xarn is a powerful ally to have around for the duration of the player's time there.