Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Ultimate *chan monitor

Name: Anonymous 2009-01-20 16:36

To become a truly EXPERT PROGRAMMER, you must learn and appreciate the art of requirements and specifications (which, by meta requirement, change within days of product delivery).

This example case will be designing the ultimate *chan board monitor/web spider.

Requirements:
    ID#01: Toggle monitoring a single thread
    ID#02: Monitor a board for new threads
    ID#03: Minimize bandwidth as much as possible
    ID#04: Allow browsing of the downloaded content immediately
    ID#05: Allow viewing of the board separate from the monitor (but should not duplicate downloads)
    ID#06: Should be able to pickup where it left off after restart or crash
    ID#07: The number of connections should be configurable
    ID#08: Cross-platform
    ID#09: Provide automatic criteria for what threads to start monitoring
    ID#10: Automatically post replies to threads/posts that meet certain criteria
    ID#11: Mass spamming or file uploading
    ID#12: Options to download thread text, thumbnails, and/or full images
    ID#13: Reduce storage/data redundancy
    ID#14: Option to set how often to check for changes
    ID#15: Keep memory use to a minimum

Some implementation details:
   
ID#01: Toggle monitoring a single thread
    * When full board monitoring is enabled, this can tell it to ignore one thread
    * When full board monitoring is enabled, this could add the thread before the board monitor encounters it
ID#02: Monitor a board for new or changed threads
    * While monitoring the entire board, there's no need to download individual threads unless they are seen to have changed (post added, post deleted, image deleted, etc.)

ID#03: Minimize bandwidth as much as possible
    * Always Use Last-Modified HTTP tag
    * Use .gz transfer as much as possible
    * Keep database of all downloaded images: file type, size, dimensions, md5
    * Check the md5, file type, and size before downloading images
    * Use identical software the *chan board uses to generate thumbnails so they can be generated automatically without downloading
    * Alternatively use the thumbnail to do a fuzzy comparison (can only really determine if we don't already have the file for sure)
   
ID#06: Should be able to pickup where it left off after restart or crash
Persistent storage
    * details about all images ever downloaded
    * queue of pages/files to download
    * queue of downloaded data to process
    * thread details: number of posts, number of images, sticky, last post id, HTTP Last-Modified, marked for delete, banned count
    * post details:
   
ID#13: Reduce storage/data redundancy
Use pixel perfect comparison, and fuzzy comparison of images to determine which are identical or similar.


ID#15: Keep memory use to a minimum
Keep most data in persistent storage.


Given the above, there are 2 possible development platforms:
* A Firefox extension would be ideal
    - It could use pages you've browsed as part of its downloading and processing
    - You can also immediately browse downloaded and processed threads
    - Posting to the board is still available, and can even track the images you upload so it doesn't download them again but instead copies the local copy
* Alternative platform
    - Create an HTTP daemon that serves up all of its content
    - This could also be used by multiple people, either in an intranet, or even the internet
    - Essentially creating a reformatted and optimized mirror of the *chan board
    - Posting to the board may not be available

Areas for further specification:
* Flush out all the requirements
* Add use cases
* Specify various necessary modules (persistent storage, connection handling, html parsing, etc)

Name: Anonymous 2009-01-20 16:40

NO EXCEPTIONS

Name: Anonymous 2009-01-20 17:09

I could do this in 3 lines of Perl.

Name: Anonymous 2009-01-20 17:23

>>* Use identical software the *chan board uses to generate thumbnails so they can be generated automatically without downloading

Why would you recompress the thumbs to jpg, it will look worse, a better idea would be to give the user a series of templates he can use for autogenerating the page layout he desires for the board(default/custom)

Name: Anonymous 2009-01-20 20:57

Semirelavant query: I was writing a command line utility which was intended to post to a given board and automaticaly scrape and record replies to the newly created post; primarily for making requests on /r/, which I have bad habit of forgetting I've made, until long after they've 404'd.
Unfortunately, I can't figure out a way to determine what the thread number of the newly created thread is/will be, and so I can't figure out how to automaticaly start monitoring it. Suggestions?

Name: Anonymous 2009-01-20 21:03

>>5
Fork another thread that continually refreshes the main page and scrapes it for the text of the post that you are posting from the main thread. If the helper thread finds it, it records the id.

Name: Anonymous 2009-01-20 21:04

Just use /prog/scrape. Xarn quality.

Name: Anonymous 2009-01-20 21:06

OP is 16

Name: Anonymous 2009-01-20 21:17

Cool spec, bro

Name: Anonymous 2009-01-20 21:35

>>5
use noko and save the redirect link

Name: Anonymous 2009-01-20 21:39

Look at the source

Name: 5 2009-01-20 22:16

Thanks for the suggestions.

Name: Anonymous 2009-01-20 23:56

This has already been done, but very messily (anyone remember board.pl?)

Name: Anonymous 2009-01-21 0:32

>>13
I never intended it to be the ``Ultimate *chan monitor'', okay!? I just wanted to spam /b/, nothing more.

Name: Anonymous 2009-01-21 1:01

>>14
Please don't use ``faggot quotes''!

Name: Anonymous 2009-01-21 1:10

>>5
Unless /b/ has changed, the blank page that is shown after you post, then redirected to the main page (or thread with noko), contains the new thread number in an HTML comment.

ID#15: Keep track of every 'version' of a thread (i.e. before and after posts and images are deleted)
ID#16: Track all post id's and note if any are missed

Necessary modules:
Persistent storage - One or more databases to hold the download queue, the post-processing queue, and the final archived storage.
Connection handling - A priority pool of connections whose only job is to check the persistent storage for files to download. Once downloaded, updates the download queue and the post-processing queue.
Post processing - Multiple threads monitor the post-process persistent priority queue and process the downloaded threads.
HTML parsing - /b/ used to have malformed HTML that Firefox would save ineffectively (still does?). Once the html is well-formed, then any XML parsing library is usable (here's your ENTERPRISE portion of the solution).
Thread parsing - XPath is an easy way to pick out the useful fields. Regex for parsing the fields.
Storage - A database can hold all the thread text and properties, and various image properties. This makes it easily referenced when determining if an image has already been downloaded. However, it requires exporting the data to view it, writing a viewer, or perhaps an http daemon.
Output template - XSL may be more than most want to mess with. Perhaps a php type of output format.

Name: Anonymous 2009-01-21 1:29

ID#17: Configure multiple proxies for mass posting
Can be helpful for dumping image folders (and unfortunately redundant spamming). For dumping manga, some form of synchronization would be necessary so they are posted in order.

Name: Anonymous 2010-11-14 7:19

Name: Anonymous 2011-02-03 7:50

Name: anonymous 2011-05-17 19:50

Name: Anonymous 2011-05-17 23:59

To become a truly EXPERT DICKSUCKER, you must learn and appreciate the art of the shaft and head (which, by meta requirement, change within hours of me getting a hardon).

This example case will be designing the ultimate blow job.

Requirements:
    ID#01: SUCK MY DICK
    ID#02: SUCK MY DICK
    ID#03: SUCK MY DICK
    ID#04: SUCK MY DICK
    ID#05: SUCK MY DICK
    ID#06: SUCK MY DICK
    ID#07: SUCK MY DICK
    ID#08: SUCK MY DICK
    ID#09: SUCK MY DICK
    ID#10: SUCK MY DICK
    ID#11: SUCK MY DICK
    ID#12: SUCK MY DICK
    ID#13: SUCK MY DICK
    ID#14: SUCK MY DICK
    ID#15: SUCK MY DICK
    ID#16: SUCK MY DICK

Some implementation details:

ID#01: SUCK IT REALLY HARD
ID#02: SUCK IT REALLY HARD
ID#03: SUCK IT REALLY HARD
ID#04: SUCK IT REALLY HARD
ID#05: SUCK IT REALLY HARD

Name: Anonymous 2011-05-18 0:54

Good luck writing that, >>1-san. I know you can do it!

When you've grow up in ten to twenty years.

Name: Anonymous 2012-06-14 6:42

>>22
nice doubles

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List