After a series of tests, I have concluded that the maximum number of characters a /prog/ post can have is 10000. Any more and it rejects the post.
>>3
You can't just read the bytes as chars ala uuencode or you'll run into a variety of problems. For one, multiple newline characters are condensed by shiichan into 1, resulting in file corruption. Addition, the various null characters would most likely be stripped by the board software.
Let us assume for the time being that our script uses only ASCII values. There are 95 printable ASCII characters. Subtract the space character, which is likely to cause problems, and we are left with 94. Subtract the [] brackets, which can be misinterpreted by Shiichan as
HOT STRIPPING BBCOED tags, and we are left with 92 possible values, somewhere between 6 bits per character and 7 bits per character. While we could potentially write a script that encodes information using only these 92 values, converting between the two formats would be a somewhat complicated task. Let us assume then that there are at least 163 Unicode characters printable by the Shiichan software. 163 unicode values plus the standard 92 ASCII values gives us exactly 255 values, or 1 byte. This simplifies matters greatly; each byte can be represented by a different character, each guaranteed to be display on
/prog/ properly.
Next, comes the issue of file sizes. Assuming that we could use every character in a post to encode a file, a 10000 character post could hold 9.765625 kilobytes of data. Obviously, this is not enough to store anything larger than perhaps a small NES rom. By linking several posts together, however, one could circumvent this limitation. "GJS Jay Sussman Feat. JSB Sebastian Bach - We conjure the spirits of the computer with our spells.mp3", for example, is a 3.46 megabyte file. One could embed this file within
/prog/ in
only 363 posts using this script.
Allow me to propose a new standard: PBBCFEF, or
The Prague BBCode File Encoding Format (pronounced "Pibkafef"). Included under this standard are tools to convert a file to and from PBBCFEF, a series of
EXTENDED BBCODE TAGS designed the linking of PBBCFEF strings as well as differentiating between file formats, and a greasemonkey script or other browser addon that can automatically convert a series of PBBCFEF strings into an embedded image, ogg audio file, or other file available for download. Here's how the new tags are handled:
Files are encoded "in reverse." For example, if you are posting a file that spans 4 posts, the last post will contain the first 9.whatever kilobytes, the 3rd post will contain the second set of kilobytes, and so on. This is designed around the way a messageboard works. When one posts to /prog/, it is impossible to accurately link to a future post. If the standard worked by posting the first set of information then linking forward a post, a post made by someone other than the uploader would throw everything off course. Instead, the first post contains no link, only a end-of-file tag, while all subsequent posts link to the previous post in the thread. The last post (and first in the sequence) contains both an beginning-of-file tag and a link to the previous (or next in the sequence) post.
The linked approach offers great flexibility. For example, if one wanted to (for whatever reason) upload a file that was larger than possible to contain within a single thread, it would be possible to link a file across several threads. After reaching the reply limit, a new thread could be started, whose first post would link to the last post of the first threads. This could be done indefinitely, allowing even files several hundred megabytes in size to be stored on
/prog/. Due to the risk of bunging-up the board and attracting the ire of moderators, however, I'm planning on leaving this multi-thread feature out.
Both the beginning-of-file tag and end-of-file tag contain consist of 3 parts: the declaration of a PBBCFEF sequence, the file format used, and a filename. The information is included both at the first and last post to satisfy an advantage of each approach; it is convenient to a human browsing /prog/ that he be able to tell what file he is downloading from the first post in a thread, and it is convenient to the program converting the file that it be able to tell what kind of file is being decoded from the first part of the sequence.
A file begins (last post) with [data.fileformat.filename], and ends (first post) with [/data.fileformat.filename]. These tags are located at the beginning of their respective posts. The link tags are simply the number of the previous post within square brackets. It's in a 4-digit format just in case the Shiichan exploit faggots decide to fuck up your thread. The first post (end of file) still ends with [], to signify the end of the string.
An encoded file would look something like this:
Post 1
[/data.pdf.sicp]HOLYFUCK10000CHARACTERSI'MSTRETCHINGTHESCREENLIKEGOATSE[]
Post 2
A JILLION FUCKING CHARACTERS[0001]
Post 3
A bunch more character[0002]
Post 4
second sequence of file[0003]
Last post
[data.pdf.sicp]This is the beginning of the file[0004]
Assuming we allocate 32 characters to the end-of-file tag, 31 characters to the beginning-of-file tag, 2 characters to the end-of-file dummy link, and 6 characters to the link tag, one we are left with:
966 bytes of data in the first (end of file) post
963 bytes of data in the last (beginning of file) post
994 bytes of data in every post between
Some potential problems and their solutions:
Some fucker made a fake link within my sequence and corrupted the entire file - The encoder will automatically generate a unique tripcode for each file sequence. A decoder script will disregard any posts that do not share the beginning-of-file post's tripcode, preventing anyone from impersonating the uploader.
Even if he is uploading something relevant to the topic, I don't want some jerkass eating up a hundred of my thread's posts with his file sequence - The encoder will only convert sequences stored in the threads that contain "PBBCFEF" as the first seven characters of their subject line (or something similar). This way, files may only be posted in designated threads - anyone who wishes to link to a file within a thread can simply link to the thread his uploader created. Multiple files can still be uploaded in a single thread, it just needs to be a designated
Pibkafef thread.
These lines stretch a whole lot, and I don't want to scrolling through several hundred posts to find the beginning-of-file - The browser extension will hide any posts containing [####] by default. Upon detection of a unique end-of-file tag (the first post), it will display the filename and a message stating that the file is incomplete. Upon detection of both the end-of-file tag and its matching beginning-of-file tag, it will state the file has terminated, and begin converting the file. Once converted, a link will be offered. In the case of an embeddable file format (like an image or an ogg audio file), it could even directly display that file. I don't want to see this place turn into an imageboard, though, so I probably will leave the direct embedding option out.
I don't want to install this extension, and I don't want you faggots stretching the shit out of my board. - Refer to the second problem. Since files can only be contained within a thread that contains "PBBCFEF" in the subject line, you can simply use the 4chan Thread Filter greasemonkey script to filter them out automatically. If you don't have this extension, what the fuck is wrong with you? Go download it.
So, to recap, things that need to be worked on before this is finished:
1. A list of 163 Unicode characters that Shiichan can print
2. A program that can convert a file into a string of corresponding characters and back
3. A program that can convert a file into a collection of PBBCFEF-compatible posts and automatically upload the posts to a thread on 4chan using HTTP-POST
4. A browser extension that can convert a sequence of PBBCFEF strings into a binary file and provide a link.
A potential addition to the standard later down the road would be to use 12 bits per character, allowing for approximately 1.5 times more data to be stored post. PBBCFEF-12 would, however, take a long time to implement, since I'd have to hunt down 3841 more printable unicode characters. I think I'll stick to PBBCFEF-8 for now.
Yes, I am going to potentially spend the rest of my vacation working on this. Got a problem with it? If you'd like to make any suggestions or correct math errors, feel free.