>>12
How do you access the database from python?
I use MySQLdb
1. PyGreSQL works just as well; they all implement the same API
2
Downloading shit
Download each thread index page (varies by board). From those, you can generate a list of
(thread_id, most_recent_post_id) tuples.
Filter out duplicate threads and threads for which you already have all posts, then fetch the thread pages of each of the remaining threads. For each of the thread pages, use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts. Then just dump everything you need to the DB.
Oh, and how do you get all posts from a thread?
Depends on how you lay out the database. If you have a table for each board and the unique ID for each post is the post number, then it's as you described. If you throw all the posts from all the boards into a single table and the unique ID for each post is not the same as the
post_no, the number that the post was on the imageboard, then it's a bit more complicated (because you first have to resolve the unique identifier from the thread number) --
SELECT * FROM posts WHERE post.thread_id = (SELECT post_id FROM posts WHERE post_no=$thread_number LIMIT 1);
Also, if you want to have search functionality that doesn't suck donkey dicks, read up on fulltext indexes for the database of your choice
3,4.
References:
[1] http://mysql-python.sourceforge.net/MySQLdb.html
[2] http://www.python.org/dev/peps/pep-0249/
[3] http://www.postgresql.org/docs/8.3/static/textsearch-intro.html
[4] http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html