Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

/prog/ stats

Name: Anonymous 2008-05-01 11:58

As of some time two days ago:

5242 threads.
100M data.
8.1M compressed (tar and bz2).

Don't ask me how I collected this data or moot will be pissed off and hax my anus.

Name: Anonymous 2008-05-01 12:14

YUO FUQIN HAX0R!!!!!11111111

Also, sage.

Name: Anonymous 2008-05-01 12:35

/r/ the /prog/ backup

Name: Anonymous 2008-05-01 12:50

Name: Anonymous 2008-05-01 13:02

I got the size down to 7.1M by removing the common headers and footers.  I'll share it after doing some more processing.

Name: Anonymous 2008-05-01 13:03

Oh, and I'll use LZMA compression as well.

Name: Anonymous 2008-05-01 13:11

>>5
You mean that you've just curl'd everything? How smart

Name: Anonymous 2008-05-01 13:14

>>7
You mean wget'd

Name: Anonymous 2008-05-01 13:20

>>8
Still retarded

Name: Anonymous 2008-05-01 13:21

import os
import sqlite3
import urllib2
import datetime
import re
import time
import sys

# TODO: better error handling
#       parse html to determine both plain text and original bbcode

print '*** world4ch archiver ***'
print ''

def geturl(url):
    req = urllib2.Request(url)
    req.add_header("User-Agent", "Mozilla/4.0 (compatible; world4ch archiving robot; anonymous)")
    res = urllib2.urlopen(req)
    return res.read()

def totimestamp(dt):
    return time.mktime(dt.timetuple()) + dt.microsecond/1e6

if not os.path.exists("world4ch_archive.db"):
    print 'creating archive database'
    db = sqlite3.connect("world4ch_archive.db")
    db.execute("""
        create table boards (
            board_name text not null primary key
        )
    """)
    db.execute("""
        create table threads (
            board_name text not null,
            thread_no integer not null,
            subject text not null,
            post_count integer not null,
            highest_post integer not null,
            time_of_last_post integer not null,
            primary key ( board_name, thread_no )
        )
    """)
    db.execute("""
        create table posts (
            board_name test not null,
            thread_no integer not null,
            post_no integer not null,
            date_time text not null,
            name text not null,
            trip text,
            email text,
            id text,
            html text not null,
            bbcode text not null,
            textonly text not null,
            primary key ( board_name, thread_no, post_no )
        )
    """)
    boards = "anime,book,carcom,comp,food,games,img,lang,lounge,music,newnew,newpol,prog,sci,sjis,sports,tech,tele,vip"
    boards = [(board_name,) for board_name in boards.split(',')]

    db.executemany("insert into boards values (?)", boards)
    db.commit()

print 'creating in-memory database'
db = sqlite3.connect(":memory:")
db.text_factory = str
db.execute("""
  create table subject_txt (
    board_name text,
    subject text,
    name text,
    icon text,
    thread_no integer,
    highest_post integer,
    nothing text,
    time_of_last_post integer,
    primary key (board_name, thread_no)
  )
""")

db.execute("""
  create unique index pk_subject_txt on subject_txt ( board_name, thread_no )
""")

print 'attaching archive database'
db.execute(r"attach database 'world4ch_archive.db' as archive")

re_thread = re.compile(r'<h2>(.*?)</h2>.*?<div class="thread">(.*?)<div class="bottomnav">.*?<td class="postfieldleft"><span class="postnum">(.*?)</span></td>',re.DOTALL)
re_post = re.compile(r'<h3><span class="postnum"><a .*?>(.*?)</a>.*?<span class="postername">(.*?)</span>.*?<span class="postertrip">(.*?)</span>.*?<span class="posterdate">(.*?)</span>.*?<span class="id">(.*?)</span>.*?</h3>.*?<blockquote>(.*?)</blockquote>',re.DOTALL)
re_email = re.compile(r'<a href="mailto:(.*?)">(.*?)</a>')

def get_new_posts():

    threads = db.execute("""
      select A.board_name, A.thread_no, B.highest_post+1, B.post_count
      from subject_txt A
      left join archive.threads B on
        A.thread_no = B.thread_no and
        A.board_name = B.board_name
      where
       (B.thread_no is null or
        A.highest_post > B.highest_post) and
            A.highest_post > 0
    """).fetchall()

    for i in xrange(len(threads)):
        board_thread = threads[i]
        board_name = board_thread[0]
        thread_no = board_thread[1]
        url = "http://dis.4chan.org/read/"+board_name+"/"+str(thread_no)
        if board_thread[2]==None:
            from_post = 1
            current_post_count = 0
        else:
            from_post = board_thread[2]
            url += "/"+str(from_post)+"-"
            current_post_count = board_thread[3]
        print board_name, "("+str(i+1)+"/"+str(len(threads))+")", thread_no, "from post", from_post,
        html_page = geturl(url)
        post_search = re_thread.findall(html_page)
        subject = post_search[0][0].strip().replace("&gt;",">").replace("&lt;",">").replace("&quot;",'"').replace("'","'").replace("&amp;","&")
        posts =  post_search[0][1].strip()
        highest_post = int(post_search[0][2])-1
        post_count = current_post_count
        post_no = 0
        time_of_last_post = 0
        for post in re_post.findall(posts):
            post_count += 1
            if post_no > 0:
                print '\b'*(len(str(post_no))+2),
            post_no = int(post[0])
            print post_no,
            trip = post[2]
            if trip == '':
                trip = None
            name_email = re_email.match(post[1])
            if name_email == None:
                name = post[1]
                email = None
            else:
                name = name_email.groups()[1]
                email = name_email.groups()[0]
            # NO! this breaks on img/1104652020/21! store as string instead.
            # date_time = int(totimestamp(datetime.datetime.strptime(post[3], "%Y-%m-%d %H:%M")))
            date_time = post[3].strip()
            time_of_last_post = date_time
            id = post[4].strip()
            if id == '':
                id = None
            html = post[5]
            ### HTML PARSING WILL GO HERE ###
            row = (board_name, thread_no, post_no, date_time, name, trip, email, id, html, '', '')
            cc = db.execute('replace into archive.posts values (?,?,?,?,?,?,?,?,?,?,?)', row)
        row = (board_name, thread_no, subject, post_count, highest_post, time_of_last_post)
        cc = db.execute('replace into archive.threads values (?,?,?,?,?,?)', row)
        cc = db.commit()
        print 'highest post is now',highest_post,'~'
        time.sleep(2)


# main board loop
if len(sys.argv) >= 2:
    comparator = "="
    if len(sys.argv) == 3:
        if sys.argv[2]=="-":
            comparator = ">="
    boards = db.execute("select board_name from archive.boards where board_name "+comparator+" ?", (sys.argv[1],)).fetchall()
else:
    boards = db.execute("select board_name from archive.boards").fetchall()

for board in boards:
    board_name = board[0]
    print 'getting subject list for /'+board_name+'/'
    subject_txt = geturl("http://dis.4chan.org/"+board_name+"/subject.txt")
    subject_txt = [tuple(line.rsplit("<>",6)) for line in subject_txt.split("\n") if line != ""]
    db.execute("delete from subject_txt")
    db.executemany("insert into subject_txt values ('"+board_name+"',?,?,?,?,?,?,?)", subject_txt)
    print 'retrieving new posts'
    get_new_posts()
    time.sleep(5)

Name: Anonymous 2008-05-01 13:22

>>8
Yeah, it took at least four hours!

Name: Anonymous 2008-05-01 13:24

>>10
Damned word wrap :-/

Name: Anonymous 2008-05-01 13:26

>>12
It's a bad idea to let random faggots without any knowledge to use these tools. Right, they must download and install FIOC, but it's still easier. So, sage

Name: Anonymous 2008-05-01 13:27

>>10
"""nyoro~n"""

Name: Anonymous 2008-05-01 13:29

OK, the archive is done... now what file sharing service should I use?  I feel like using RapidShit to troll you all, but it's probably pretty painful for an uploader as well.

Name: Anonymous 2008-05-01 13:31

>>15
Mediafire.

Name: Anonymous 2008-05-01 13:32

>>16
Estimated Time Remaining: 78 Days 19 Hours 55 Minutes
File Name: prog-20080429.tbz2 (596.33 KB / 7.08 MB)

Name: Anonymous 2008-05-01 13:35

Name: Anonymous 2008-05-01 13:35

Name: Anonymous 2008-05-01 14:13

Now
-excerpt the 3% that's useful
-set up a better forum with ranking, searching, category organization, and filtering out idiot users as we find them
-load 3% onto better forum
-Call Adult Friend Finder
-Post link to new forum

Name: Anonymous 2008-05-01 14:30

Somebody fork /prog/!

Name: Anonymous 2008-05-01 14:45

>>21
Forget it, it's proprietarily licensed.

Name: Anonymous 2008-05-01 14:46

>>19
proggold sussyfan htpc

Name: Anonymous 2008-05-01 15:04

>>23
*clicks*

Name: Anonymous 2008-05-01 15:26

>>18
WTF, how do I use that shit?

Name: Anonymous 2008-05-01 15:39

>>25
The world4ch_archiver.py script is to be run in Python, and the world4ch_archive.db.gz file should be gunzipped and then can be opened with SQLite (http://www.sqlite.org/download.html)

e.g.
$ sqlite3 world4ch_archive.db
SQLite version 3.5.8
Enter ".help" for instructions
sqlite> .tables
boards   posts    threads
sqlite> select max(date_time) from posts where board_name = 'prog';
2008-05-01 15:08
sqlite> select thread_no, post_no from posts where board_name = 'prog' and html like '%forced indentation%' order by date_time limit 1;
1138460471|12
sqlite> select board_name, count(*) from threads group by board_name order by count(*) desc;
lounge|23382
comp|7879
vip|7762
prog|5266
newpol|3662
games|3212
lang|2271
sci|2234
music|1995
anime|1977
tech|1835
newnew|1257
book|1179
img|1155
tele|875
food|640
sports|556
carcom|441
sjis|324
sqlite>

Name: Anonymous 2008-05-01 15:45

>>26
DON\'T HELP HIM!!

Name: Anonymous 2008-05-01 15:46

>>23
htpc
Home Theater PC?

Name: Anonymous 2008-05-01 15:49

Hussyfan Home Theater PC?
I am interested

Name: Anonymous 2008-05-01 16:46

hax my anus

Name: Anonymous 2008-05-01 17:02

lounge|23382
comp|7879
vip|7762
prog|5266


Interesting

Name: Anonymous 2008-05-01 18:13

In before everyone Markovs this shit.

Name: Anonymous 2008-05-01 18:32

>>31
Ah, but:
lounge|198395
prog|121808
vip|74376
comp|71173
newpol|64634
games|35186
sci|34154
lang|29904
anime|21724
newnew|14798
music|14199
tech|11966
book|10476
img|10195
sjis|8036
food|6375
tele|6191
carcom|3851
sports|3725

Name: Anonymous 2008-05-01 18:36

>>28
HTPC Theatre Personal Computer

Name: Anonymous 2008-05-01 18:51

>>33
What do those numbers mean? *confused*

Name: Anonymous 2008-05-01 18:52

Thread Count

Name: Anonymous 2008-05-01 18:53

>>27
Heh heh. Every time someone says this I think "Don't give him the stick!" from that video

Name: Anonymous 2008-05-01 18:53

Grepping the backup is incredibily fast <3

Name: Anonymous 2008-05-01 19:35

>>35
Post count

Name: Anonymous 2008-05-01 20:03

>>36
The sheets here are much better than mine.

Name: Anonymous 2008-05-01 20:22

>>33
I suspect the LISP threads may have helped somewhat.

Name: Anonymous 2008-05-01 20:41

So, anyone want to have a go at build an EXPERT TEXTBOARD SEARCH ENGINE?

Name: Anonymous 2008-05-01 21:23

>>42
grep should be all anybody ever needs.

Name: Anonymous 2008-05-02 3:04

>>42
EXPERT ()
{
    open http://www.google.com/search\?q=site:dis.4chan.org\ inurl:/read/prog/\ "$*"
}

Name: Anonymous 2008-05-02 4:25

>>44
Google doesn't index every thread and purges links from time to time.

Name: Anonymous 2008-05-02 5:18

>>39
Post count
NO EXCEPTIONS for the win.

Name: Anonymous 2008-05-02 12:46

http://www.google.com/search\?q=site:dis.4chan.org\ inurl:/read/prog/\ "$*"
What the fuck, shiichan.

Name: Anonymous 2010-12-28 5:50

Name: Sgt.KabukimanПᎪ 2012-05-23 4:35

뎴졕䄬年縤彺ꎷ〞

Don't change these.
Name: Email:
Entire Thread Thread List