/prog/ - Hey Look, This isn't a Language Shitpost

Name: cabbagebot 2011-07-07 0:30

Hello /proggles/. In the midst of all of this "faggotry," I propose a thread about actual programming.

Suppose you have a sort of root directory with a large number of subdirectories all containing a large number of files. The files all exist in ranges of size from a few kilobytes to several gigabytes. Given an input file, let's conduct a quick method for searching our root directory and its subdirectories for copies of the input file.

I've thought of a fairly decent algorithm that I will share if anyone here is interested enough to share their own.

Name: cabbagebot 2011-07-07 3:34

>>15
Of course. What else would I do with so much harddisk space?

>>16
Yes, this is much closer to what I had in mind.

In my solution, after eliminating files of the wrong size, you take the first n bytes of each file remaining, a small number at least at first to eliminate files that are very different, and throw each string into your favorite binary search tree, where each node of the tree is a sort of "bucket" holding the many strings that fall into it.
From here, you see into which bucket the first n bytes of the comparison files fall, then repeat this strategy using all of the files with strings in the same bucket. Each succession of searching would increase the number of bytes used in comparison to improve speed for very large, identical files.

Any improvements, /prog/?

Hey Look, This isn't a Language Shitpost

1 Name: cabbagebot 2011-07-07 0:30

18 Name: cabbagebot 2011-07-07 3:34

Name: cabbagebot 2011-07-07 0:30

Name: cabbagebot 2011-07-07 3:34