Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

imageboards

Name: Anonymous 2008-10-06 16:35

Hey EXPERT PROGRAMMERS
I want to make something that'll archive the imageboards I want. I know the text field is limited to 2000 characters, but what about the name/mail/subject ones? Oh, and how does it work with unicode characters?

Name: Anonymous 2008-10-07 12:02

>>12
How do you access the database from python?
I use MySQLdb1. PyGreSQL works just as well; they all implement the same API2

Downloading shit
Download each thread index page (varies by board). From those, you can generate a list of (thread_id, most_recent_post_id) tuples.

Filter out duplicate threads and threads for which you already have all posts, then fetch the thread pages of each of the remaining threads. For each of the thread pages, use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts. Then just dump everything you need to the DB.

Oh, and how do you get all posts from a thread?
Depends on how you lay out the database. If you have a table for each board and the unique ID for each post is the post number, then it's as you described. If you throw all the posts from all the boards into a single table and the unique ID for each post is not the same as the post_no, the number that the post was on the imageboard, then it's a bit more complicated (because you first have to resolve the unique identifier from the thread number) --

SELECT * FROM posts WHERE post.thread_id = (SELECT post_id FROM posts WHERE post_no=$thread_number LIMIT 1);

Also, if you want to have search functionality that doesn't suck donkey dicks, read up on fulltext indexes for the database of your choice3,4.

                              
References:
[1] http://mysql-python.sourceforge.net/MySQLdb.html
[2] http://www.python.org/dev/peps/pep-0249/
[3] http://www.postgresql.org/docs/8.3/static/textsearch-intro.html
[4] http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List