I'm working on a new type of compression algorithm and have run into some difficulties.
Basically what I do is this:
Given raw data: D
I compute the MD5 hash: H = MD5(D)
At this point I've compressed by a factor of Size(D) / Size(H) (which for any data larger than 128-bits is significant)
My problem is uncompressing the data.
Currently what I'm doing is as follows: (keep in mind i'm working with bit values here)
G = {0}[trillion]
while (MD5(G) != H) G++;
I know this has the problem of unintentionally hashing the preceding zeros but the main issue I'm running into is runtime.
So far 20-30 bit lengths have run in feasible time but when I tried to run this with a 32-bit guess, it took several minutes to verify.
My MD5 implementation may be inefficient but it's also possible that this algorithm suffers from a runtime of 2^n.
Any tips? Or is this simply an unfeasible compression algorithm?
Name:
Anonymous2011-04-27 11:33
You should compress it using CRC32. It's faster and provides 4 times the compression.
what the problem is? y u forget #include "/usr/local/frozen/void.h"?
Name:
Alfe2011-06-17 6:09
Just in case you meant that honestly or anybody else took it for being a great idea: This won't work. Uncompressing is impossible.
Reasoning: There are 2^N different bitstrings with N bits. To compress these (i.e. in a decrompressible way!) you need to use a special property of the information called redundancy. MD5 does not do this. In fact, there are 2^(N-128) different bitstrings with 2^N bits which result in the same MD5 hash with 128 bits. Your brute force attack will simply find any of these (after a VERY long time). Since MD5 is just a hash algorithm, the bitstrings resulting in the same hash will look very different (that's one of the ideas of a good hash).
In fact, no just-mathematical/logical approach on compressing data can succeed because redundancy is always a domain specific property. For us, it typically means repetition (e.g. of patterns). But that must not always be the case. For compressing the first one billion digits of the number pi the approach of searching for repetitions would not lead to anything. Instead, compressing the digit string to the algorithm computing the digits would be a successful approach here.
>>21
Oh no. No. Please God no.
But you wouldn't have written that whole post just to troll.
You must have really thought…it was…worthwhile. The Redditors are amoung us.( ,v_v, )