Hey EXPERT PROGRAMMERS
I want to make something that'll archive the imageboards I want. I know the text field is limited to 2000 characters, but what about the name/mail/subject ones? Oh, and how does it work with unicode characters?
The original filename (post_origimg) could really stand to be a TINYTEXT (255 chars) -- there are filenames longer than 128 characters and I get warnings in my logs once a week or so about fields getting truncated.
The important thing about Unicode is that you tell the user's browser that you're using Unicode. You can throw UTF-8 data into a Latin-1 encoded database and not damage the data at all. What will happen is that it'll interpret the UTF-8 as Latin-1 encoded data, which will affect sorting order (but do you give a shit how Chinese characters are ordered? I sure don't). Oh, and multibyte characters will take up multiple bytes, so you may not be able to fit as many of them into the fields as you think. 4scrape uses a UTF-8 character set for the MySQL database because of the later reason.
Realistically, as long as you send the charset="UTF-8" option in the Content-Type header, it doesn't make a shit whether your language/database supports it or not.
How do you access the database from python? I thought of using PostgreSQL and PyGreSQL.
How do you get the threads and update them? I wanted to download all 11 pages once in a while, update my threadlist, and then download all threads separately, but there is probably a much better way to do this.
Oh, and how do you get all posts from a thread? Just something like select * from s_posts where thread_id ==thread_number?
Name:
Anonymous2008-10-07 11:08
From AoRF's (slightly modified) script:
create table if not exists $self->{table} (
num int unsigned primary key not null,
parent int unsigned,
timestamp int unsigned,
preview text,
preview_w smallint unsigned,
preview_h smallint unsigned,
media text,
media_w smallint unsigned,
media_h smallint unsigned,
media_size INT unsigned,
media_hash tinytext,
email tinytext,
name text,
trip tinytext,
title text,
comment text,
index parent_index(parent),
index media_hash_index(media_hash(8)),
index name_index(name(10)),
index trip_index(trip(8)),
fulltext index comment_index(comment)
) engine=myisam;
You'll need at least another field for the full size image (three if you want the dimensions). MyISAM because of the FULLTEXT search. Or you could just put everything except comment_index on an InnoDB table and num + comment_index on a MyISAM one.
Name:
Anonymous2008-10-07 12:02
>>12 How do you access the database from python?
I use MySQLdb1. PyGreSQL works just as well; they all implement the same API2
Downloading shit
Download each thread index page (varies by board). From those, you can generate a list of (thread_id, most_recent_post_id) tuples.
Filter out duplicate threads and threads for which you already have all posts, then fetch the thread pages of each of the remaining threads. For each of the thread pages, use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts. Then just dump everything you need to the DB.
Oh, and how do you get all posts from a thread?
Depends on how you lay out the database. If you have a table for each board and the unique ID for each post is the post number, then it's as you described. If you throw all the posts from all the boards into a single table and the unique ID for each post is not the same as the post_no, the number that the post was on the imageboard, then it's a bit more complicated (because you first have to resolve the unique identifier from the thread number) --
SELECT * FROM posts WHERE post.thread_id = (SELECT post_id FROM posts WHERE post_no=$thread_number LIMIT 1);
Also, if you want to have search functionality that doesn't suck donkey dicks, read up on fulltext indexes for the database of your choice3,4.
I looked at >>13 and got very happy because there was at least another person using lowercase characters in sql. Then I realized it was my code.
Name:
Anonymous2008-10-07 12:27
Anyway, OP, if you're really interested, you should look at some existing imageboard software (wakaba, or that python board noone cares about).
Name:
Anonymous2008-10-07 13:21
>>13 >>14
Thanks for help, guys!
Oh, and I'm doing this 'project' mainly to learn, since I haven't got much experience in doing stuff with python.
use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts
(I think) I know what a regex is, but how do you use it to 'grab' things?
I was concerned by the efficiency of that select statement. If you were to save every post, the db would get very big really quickly. Isn't a select statement iterating through the entire table and just checking whether the data matches the conditions?
Name:
Anonymous2008-10-07 13:34
>>17
This operation is O(1) for databases. If you're going to use table from >>13, here is request to get thread that is confirmed to work fast: select * from $self->{table} where num=? or parent=?
and here's page: select * from
(select num from $self->{table} where parent=0 order by num desc limit ? offset ?) as threads join $self->{table}
on threads.num=$self->{table}.num or threads.num=$self->{table}.parent
order by case parent when 0 then $self->{table}.num else parent end asc,timestamp asc
This is mostly taken from wakaba
Name:
Anonymous2008-10-07 13:56
If I were to select * from posts where id>1000, would that just skip over posts 1..999? Or would it check every post from 1..999 to see if the id is over 1000?
>>15
I think the only thing I added in there was the capcode enum. And I did capitalize ENUM.
>>14
Hey, Taro, do you want to host the /a/rchive and /jp/archive? Compared to 4scrape, the space it takes is negligible. The guy I was freeloading off pulled the plug on the server.
>>22
Yeah, sure. Gimmeh some contact information (or a link to the database dump and script) and I'll throw it onto Barasuishou. If you've got a domain for it, I can set that up too (you don't need to transfer ownership or anything, I'll just hax it with a CNAME DNS entry).
>>28
We're not your personal college. If you're unwilling to learn on your own (at least, to the point where discussion of the topic is interesting for the rest of us) then we're going to be unwilling to teach you.
>>26
Oh no, it's not off-topic at all. I started this thread because of the /a/rchive, so if it'll be online again, I won't have to make something by myself.
>>29
I did that. I guess I'll have to fix my question once again.
>>14 said: use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts
I said: I know I can use them to find all substrings where the RE matches, but is that all?
I wanted to know if there's some fancy method in which I give the regex and the string as arguments and it doesn't return me a list of substrings that match the RE, but rather a list of tuples with values grabbed from the substrings.
I mean, if I got a regex "ab.c" and gave it a string "jf3ab9casabrtfdabecdf" it would return [ 9, 'e' ].
Name:
Anonymous2008-10-07 18:55
>>30
You're using the wrong function to find matches, I suspect.
>>30
I guess I'm still not understanding what you want. What are you getting the [9, 'e'] from? What exactly do you want it to return? It sounds like what you want is grouping, but I'm not sure.