/prog/ - imageboards

Name: Anonymous 2008-10-06 16:35

Hey ~~^EXPERT_PROGRAMMERS~~
I want to make something that'll archive the imageboards I want. I know the text field is limited to 2000 characters, but what about the name/mail/subject ones? Oh, and how does it work with unicode characters?

Name: Anonymous 2008-10-06 17:06

>>1
Terrible!

Name: Anonymous 2008-10-06 17:15

>>2
This thread has ended peacefuller.

Name: Anonymous 2008-10-06 17:58

come on, don't be assholes!

Name: Anonymous 2008-10-06 19:24

>>4
You are out of your league here

Name: Anonymous 2008-10-06 20:11

4chanarchive uses 128 chars for name, 128 chars for email, and 64 chars for subject. Also, The Php handles unicode characters for us.

Name: Anonymous 2008-10-06 21:58

>>1
Here's the schema dump for 4scrape (yes, I realize I'm a faggot for using MySQL, but I prefer their FULLTEXT search over PostGRE's) --

mysql> DESC s_posts;

+--------------+------------------+------+-----+---------+----------------+

| Field        | Type             | Null | Key | Default | Extra          |

+--------------+------------------+------+-----+---------+----------------+

| post_id      | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 

| thread_id    | int(10) unsigned | YES  | MUL | NULL    |                | 

| board_id     | int(10) unsigned | YES  | MUL | NULL    |                | 

| img_id       | int(10) unsigned | YES  | MUL | NULL    |                | 

| post_no      | int(10) unsigned | YES  | MUL | NULL    |                | 

| post_name    | tinytext         | YES  | MUL | NULL    |                | 

| post_email   | varchar(128)     | YES  | MUL | NULL    |                | 

| post_trip    | varchar(64)      | YES  |     | NULL    |                | 

| post_subject | tinytext         | YES  | MUL | NULL    |                | 

| post_date    | datetime         | YES  | MUL | NULL    |                | 

| post_comment | text             | YES  | MUL | NULL    |                | 

| post_img     | varchar(128)     | YES  |     | NULL    |                | 

| post_origimg | varchar(128)     | YES  | MUL | NULL    |                | 

| post_legacy  | int(10) unsigned | YES  |     | NULL    |                | 

+--------------+------------------+------+-----+---------+----------------+

14 rows in set (0.02 sec)



mysql> DESC s_images;

+----------------+---------------------+------+-----+---------+----------------+

| Field          | Type                | Null | Key | Default | Extra          |

+----------------+---------------------+------+-----+---------+----------------+

| img_id         | int(10) unsigned    | NO   | PRI | NULL    | auto_increment | 

| img_hash       | varchar(48)         | YES  | UNI | NULL    |                | 

| img_path       | mediumtext          | YES  |     | NULL    |                | 

| img_thumb      | mediumtext          | YES  |     | NULL    |                | 

| img_legacy     | int(10) unsigned    | YES  |     | 0       |                | 

| img_w          | int(11)             | YES  | MUL | NULL    |                | 

| img_h          | int(11)             | YES  | MUL | NULL    |                | 

| img_aspect     | float               | YES  | MUL | NULL    |                | 

| img_nsfw       | tinyint(3) unsigned | YES  | MUL | 0       |                | 

| img_nsfw_by    | varchar(20)         | YES  |     | NULL    |                | 

| img_animated   | tinyint(1)          | YES  | MUL | 0       |                | 

| img_color      | int(10) unsigned    | YES  | MUL | NULL    |                | 

| img_hits       | int(11)             | YES  | MUL | 0       |                | 

| img_hits_total | int(11)             | YES  | MUL | 0       |                | 

+----------------+---------------------+------+-----+---------+----------------+

14 rows in set (0.00 sec)



mysql> DESC s_boards;

+--------------+------------------+------+-----+---------+----------------+

| Field        | Type             | Null | Key | Default | Extra          |

+--------------+------------------+------+-----+---------+----------------+

| board_id     | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 

| board_sname  | varchar(16)      | YES  |     | NULL    |                | 

| board_name   | varchar(128)     | YES  |     | NULL    |                | 

| board_host   | varchar(32)      | YES  |     | NULL    |                | 

| board_parser | varchar(128)     | YES  |     | NULL    |                | 

| board_sfw    | tinyint(1)       | YES  |     | NULL    |                | 

| board_scrape | tinyint(1)       | YES  |     | NULL    |                | 

| board_view   | tinyint(1)       | YES  |     | NULL    |                | 

+--------------+------------------+------+-----+---------+----------------+

8 rows in set (0.00 sec)

The original filename (post_origimg) could really stand to be a TINYTEXT (255 chars) -- there are filenames longer than 128 characters and I get warnings in my logs once a week or so about fields getting truncated.

The important thing about Unicode is that you tell the user's browser that you're using Unicode. You can throw UTF-8 data into a Latin-1 encoded database and not damage the data at all. What will happen is that it'll interpret the UTF-8 as Latin-1 encoded data, which will affect sorting order (but do you give a shit how Chinese characters are ordered? I sure don't). Oh, and multibyte characters will take up multiple bytes, so you may not be able to fit as many of them into the fields as you think. 4scrape uses a UTF-8 character set for the MySQL database because of the later reason.

Realistically, as long as you send the charset="UTF-8" option in the Content-Type header, it doesn't make a shit whether your language/database supports it or not.

Any other questions?

Name: Anonymous 2008-10-06 22:30

>>7
Taro?

Name: Anonymous 2008-10-06 22:56

>>7
What really makes you a faggot is that you bothered to write "desc" in all caps

Name: Anonymous 2008-10-06 23:00

>>9
You have to yell at mysql to make it work. Maybe alternate caps every once in a while to keep it on its toes.

Name: Anonymous 2008-10-06 23:38

>>10
No amount of yelling fixes the ^data_corruption, though.

Name: Anonymous 2008-10-07 10:43

>>7

Oh god, ^SOMEONEHELPED_ME! Thanks!

How do you access the database from python? I thought of using PostgreSQL and PyGreSQL.
How do you get the threads and update them? I wanted to download all 11 pages once in a while, update my threadlist, and then download all threads separately, but there is probably a much better way to do this.
Oh, and how do you get all posts from a thread? Just something like select * from s_posts where thread_id ==thread_number?

Name: Anonymous 2008-10-07 11:08

From AoRF's (slightly modified) script:

create table if not exists $self->{table} (

    num int unsigned primary key not null,

    parent int unsigned,

    timestamp int unsigned,

    preview text,

    preview_w smallint unsigned,

    preview_h smallint unsigned,

    media text,

    media_w smallint unsigned,

    media_h smallint unsigned,

    media_size INT unsigned,

    media_hash tinytext,



    spoiler bool,

    deleted bool,

    capcode ENUM('N', 'M', 'A') not null default 'N',



    email tinytext,

    name text,

    trip tinytext,

    title text,

    comment text,

    

    index parent_index(parent),

    index media_hash_index(media_hash(8)),

    index name_index(name(10)),

    index trip_index(trip(8)),



    fulltext index comment_index(comment)

) engine=myisam;

You'll need at least another field for the full size image (three if you want the dimensions). MyISAM because of the FULLTEXT search. Or you could just put everything except comment_index on an InnoDB table and num + comment_index on a MyISAM one.

Name: Anonymous 2008-10-07 12:02

>>12
How do you access the database from python?
I use MySQLdb¹. PyGreSQL works just as well; they all implement the same API²

Downloading shit
Download each thread index page (varies by board). From those, you can generate a list of (thread_id, most_recent_post_id) tuples.

Filter out duplicate threads and threads for which you already have all posts, then fetch the thread pages of each of the remaining threads. For each of the thread pages, use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts. Then just dump everything you need to the DB.

Oh, and how do you get all posts from a thread?
Depends on how you lay out the database. If you have a table for each board and the unique ID for each post is the post number, then it's as you described. If you throw all the posts from all the boards into a single table and the unique ID for each post is not the same as the post_no, the number that the post was on the imageboard, then it's a bit more complicated (because you first have to resolve the unique identifier from the thread number) --

SELECT * FROM posts WHERE post.thread_id = (SELECT post_id FROM posts WHERE post_no=$thread_number LIMIT 1);

Also, if you want to have search functionality that doesn't suck donkey dicks, read up on fulltext indexes for the database of your choice^3,4.

^{References:

[1] http://mysql-python.sourceforge.net/MySQLdb.html

[2] http://www.python.org/dev/peps/pep-0249/

[3] http://www.postgresql.org/docs/8.3/static/textsearch-intro.html

[4] http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html}

Name: Anonymous 2008-10-07 12:25

I looked at >>13 and got very happy because there was at least another person using lowercase characters in sql. Then I realized it was my code.

Name: Anonymous 2008-10-07 12:27

Anyway, OP, if you're really interested, you should look at some existing imageboard software (wakaba, or that python board noone cares about).

Name: Anonymous 2008-10-07 13:21

>>13
>>14
Thanks for help, guys!
Oh, and I'm doing this 'project' mainly to learn, since I haven't got much experience in doing stuff with python.

use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts
(I think) I know what a regex is, but how do you use it to 'grab' things?

I was concerned by the efficiency of that select statement. If you were to save every post, the db would get very big really quickly. Isn't a select statement iterating through the entire table and just checking whether the data matches the conditions?

Name: Anonymous 2008-10-07 13:34

>>17
This operation is O(1) for databases. If you're going to use table from >>13, here is request to get thread that is confirmed to work fast:
select * from $self->{table} where num=? or parent=?
and here's page:
select * from (select num from $self->{table} where parent=0 order by num desc limit ? offset ?) as threads join $self->{table} on threads.num=$self->{table}.num or threads.num=$self->{table}.parent order by case parent when 0 then $self->{table}.num else parent end asc,timestamp asc

This is mostly taken from wakaba

Name: Anonymous 2008-10-07 13:56

If I were to select * from posts where id>1000, would that just skip over posts 1..999? Or would it check every post from 1..999 to see if the id is over 1000?

Name: Anonymous 2008-10-07 14:20

>>19
Depends on how you lay out your indexes. Read the fucking manual for your RDBMS.

Name: Anonymous 2008-10-07 14:41

>>20
RMS DB?

Name: Anonymous 2008-10-07 15:20

>>15
I think the only thing I added in there was the capcode enum. And I did capitalize ENUM.

>>14
Hey, Taro, do you want to host the /a/rchive and /jp/archive? Compared to 4scrape, the space it takes is negligible. The guy I was freeloading off pulled the plug on the server.

Name: Anonymous 2008-10-07 15:21

>>18
O(1)
That's great then.

Can somebody explain now how can I use regexes? I know I can use them to find all substrings where the RE matches, but is that all?

Name: Anonymous 2008-10-07 15:31

>>23
Can somebody explain now how can I use regexes?

Sure.

    $text=~m!   <td \s id="(\d+)"[^>]*> \s*

                <input[^>]*><span \s class="replytitle">(?>(.*?)</span>) \s*

                <span \s class="commentpostername">(?:<span [^>]*>)?(?:<a \s href="mailto:([^"]*)"[^>]*>)?([^<]*?)(?:</a>)?(?:</span>)?</span>

                (?: \s* <span \s class="postertrip">(?:<span [^>]*>)?([a-zA-Z0-9\.\+/\!]+)(?:</a>)?(?:</span>)?</span>)?

                (?: \s* <span \s class="commentpostername"><span [^>]*>\#\# \s (.?)[^<]*</span></span>)?

                \s ([^>]*) \s \s* <span[^>]*> \s* 

                (?>.*?</span>) \s*

                (?:

                    <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \s*

                    <span \s class="filesize">File \s :

                    <a \s href="([^"]*/src/\d+\.\w+)"[^>]*>[^<]*</a> \s*

                    \- \s* \((Spoiler \s Image,)?([\d\sGMKB\.]+)\, \s (\d+)x(\d+)(?:, \s* <span \s title="([^"]*)">[^<]*</span>)?\)

                    </span> \s*

                    (?:

                        <br>\s*<a[^>]*><img \s+ src=\S* \s+ border=\S* \s+ align=\S* \s+ (?:width=(\d+) \s height=(\d+))? [^>]*? md5="?([\w\d\=\+\/]+)"? [^>]*? ></a> \s*

                        |

                        <a[^>]*><span \s class="tn_reply"[^>]*>Thumbnail \s unavailable</span></a>

                    )

                    |

                    <br> \s*

                    <img [^>]* alt="File \s deleted\." [^>]* > \s*

                )?

                <blockquote>(?>(.*?)</blockquote>)</td></tr></table>

    !xs or $self->troubles("error parsing post\n------\n$text\n------\n") and return;

And that's how you parse a post using the power of [o][u][b]regexps[b][u][o].

Name: Anonymous 2008-10-07 15:32

>>24
Also the power of BBCode failures.

Name: Anonymous 2008-10-07 15:39

>>22
Yeah, sure. Gimmeh some contact information (or a link to the database dump and script) and I'll throw it onto Barasuishou. If you've got a domain for it, I can set that up too (you don't need to transfer ownership or anything, I'll just hax it with a CNAME DNS entry).

Sage for off topic.

Name: Anonymous 2008-10-07 16:20

>>26
I'm Eksopl, you should have my email on your blog.

I need to split the tar with the /a/ thumbnails, because Dreamhost's Apache doesn't have the support for big files on.

Name: Anonymous 2008-10-07 17:33

>>24
I don't even hope to understand that.
Fixed my question: how to use regexes in python?

Name: Anonymous 2008-10-07 17:37

>>28
We're not your personal college. If you're unwilling to learn on your own (at least, to the point where discussion of the topic is interesting for the rest of us) then we're going to be unwilling to teach you.

http://www.google.com/search?q=python+regex

Name: Anonymous 2008-10-07 18:17

>>26
Oh no, it's not off-topic at all. I started this thread because of the /a/rchive, so if it'll be online again, I won't have to make something by myself.

>>29
I did that. I guess I'll have to fix my question once again.

>>14 said:
use a regex to grab the contents and attributes of the OP, and a different regex to grab the rest of the posts
I said:
I know I can use them to find all substrings where the RE matches, but is that all?

I wanted to know if there's some fancy method in which I give the regex and the string as arguments and it doesn't return me a list of substrings that match the RE, but rather a list of tuples with values grabbed from the substrings.
I mean, if I got a regex "ab.c" and gave it a string "jf3ab9casabrtfdabecdf" it would return [ 9, 'e' ].

Name: Anonymous 2008-10-07 18:55

>>30
You're using the wrong function to find matches, I suspect.

>>> import re

>>> re.compile( "ab.c" ).findall( "jkf3ab9casabrtfdabecdf" )

['ab9c', 'abec']

Name: >>31 2008-10-07 18:58

>>30
I guess I'm still not understanding what you want. What are you getting the [9, 'e'] from? What exactly do you want it to return? It sounds like what you want is grouping, but I'm not sure.

>>> re.compile( 'a(.*?)b(.*?)c' ).findall( 'lakjlbjgcklfjaljgbkjgc' )

[('kjl', 'jg'), ('ljg', 'kjg')]

Name: >>31-32 2008-10-07 19:00

Oh, I see what you're getting at. You need to use groups in your regex. Here's a more succinct example with your strings --

>>> re.compile( "ab(.)c" ).findall( "jkf3ab9casabrtfdabecdf" )

['9', 'e']

Name: Anonymous 2008-10-07 19:04

Does anyone have the /a/rchive script? I'd like to see it, or any other archive script for that matter

>>32
I'm guessing he wanted this:

>>> import re

>>> re.compile("ab(.)c").findall("jf3ab9casabrtfdabecdf")

['9', 'e']

Name: Anonymous 2008-10-07 19:06

>>1-34
back to /tutorialsontrivialshit/, please.

Name: Anonymous 2008-10-07 23:53

Been posted on /jp/ yesterday, since /jp/ is the most appropriate place. Good luck trying to understand that, python user.

http://www.mediafire.com/?jm8jx0jjojy
http://archive.doesntexist.org/archive.7z

Name: Anonymous 2008-10-08 12:08

>>32-33
Yes, that's exactly what I wanted. Sorry for not being able to explain it better.
Thank you!

Name: Anonymous 2008-10-08 14:42

>>36
Thanks, /prog/user

Name: Anonymous 2008-10-08 16:14

>>38
The /a/rchive and /jp/-archive are both up again, anyway.

Name: Anonymous 2009-03-06 12:36

Select statement to get the job done sufficiently well in C anyway allowing for duplicate keys?

Name: Anonymous 2010-12-08 21:18

Don't change these.

Name:		Email:

Entire Thread Thread List

imageboards

1 Name: Anonymous 2008-10-06 16:35

2 Name: Anonymous 2008-10-06 17:06

3 Name: Anonymous 2008-10-06 17:15

4 Name: Anonymous 2008-10-06 17:58

5 Name: Anonymous 2008-10-06 19:24

6 Name: Anonymous 2008-10-06 20:11

7 Name: Anonymous 2008-10-06 21:58

8 Name: Anonymous 2008-10-06 22:30

9 Name: Anonymous 2008-10-06 22:56

10 Name: Anonymous 2008-10-06 23:00

11 Name: Anonymous 2008-10-06 23:38

12 Name: Anonymous 2008-10-07 10:43

13 Name: Anonymous 2008-10-07 11:08

14 Name: Anonymous 2008-10-07 12:02

15 Name: Anonymous 2008-10-07 12:25

16 Name: Anonymous 2008-10-07 12:27

17 Name: Anonymous 2008-10-07 13:21

18 Name: Anonymous 2008-10-07 13:34

19 Name: Anonymous 2008-10-07 13:56

20 Name: Anonymous 2008-10-07 14:20

21 Name: Anonymous 2008-10-07 14:41

22 Name: Anonymous 2008-10-07 15:20

23 Name: Anonymous 2008-10-07 15:21

24 Name: Anonymous 2008-10-07 15:31

25 Name: Anonymous 2008-10-07 15:32

26 Name: Anonymous 2008-10-07 15:39

27 Name: Anonymous 2008-10-07 16:20

28 Name: Anonymous 2008-10-07 17:33

29 Name: Anonymous 2008-10-07 17:37

30 Name: Anonymous 2008-10-07 18:17

31 Name: Anonymous 2008-10-07 18:55

32 Name: >>31 2008-10-07 18:58

33 Name: >>31-32 2008-10-07 19:00

34 Name: Anonymous 2008-10-07 19:04

35 Name: Anonymous 2008-10-07 19:06

36 Name: Anonymous 2008-10-07 23:53

37 Name: Anonymous 2008-10-08 12:08

38 Name: Anonymous 2008-10-08 14:42

39 Name: Anonymous 2008-10-08 16:14

40 Name: Anonymous 2009-03-06 12:36

42 Name: Anonymous 2010-12-08 21:18