/prog/ - Hey Look, This isn't a Language Shitpost

Name: cabbagebot 2011-07-07 0:30

Hello /proggles/. In the midst of all of this "faggotry," I propose a thread about actual programming.

Suppose you have a sort of root directory with a large number of subdirectories all containing a large number of files. The files all exist in ranges of size from a few kilobytes to several gigabytes. Given an input file, let's conduct a quick method for searching our root directory and its subdirectories for copies of the input file.

I've thought of a fairly decent algorithm that I will share if anyone here is interested enough to share their own.

Name: cabbagebot 2011-07-07 0:32

Let's keep it in English and psuedocode, I should add.

Name: Anonymous 2011-07-07 0:34

Lisp is crap and for faggots.

Name: Anonymous 2011-07-07 0:45

A brute force method is easy. You could check the one file against each other file sequentially.

Name: cabbagebot 2011-07-07 0:51

>>4
Sure, a modest solution, but I think /prog/ can do better than O(n) time. Considering the number of files that the input must be checked against to ensure that you find every copy in the root directory (all of them), this is noticeably costly, especially if the input is a large list of files that need to be checked, rather than just one.

Name: Anonymous 2011-07-07 1:26

>>5
I wouldn't know. My NP-senses are tingling.

Name: Anonymous 2011-07-07 1:27

my pseudo-code has lambdas all over.

Name: cabbagebot 2011-07-07 1:27

Actually, yes, let's change the problem so that the input is a list of files that must be checked.

So we have a list of k files of input of variable size that must be checked against n files, also of widely varying size for every copy that exists in that list of files. How fast can we go, /prog/?

Name: Anonymous 2011-07-07 1:56

>>5
I don't think you have any idea what you're talking about.

>>8
Fuck you.

#!/bin/bash



hash=$(md5sum $1 | cut -f 1 -d " ")

for f in $(find . -size $(stat -c %s $1)c -printf '%p '); do

    [ $hash == $(md5sum $f | cut -f 1 -d " ") ] && echo $f

done

Now kindly go away.

Name: cabbagebot 2011-07-07 2:21

>>9
Actually, this is slow compared to the solution I had in mind. Using md5 hashing requires a hashing computation to occur on the entirety of all n files. In essence, you are recommending the solution proposed in >>4.

But nice try. As said before, this is a feasible, but not optimal solution. If I have to read every bit of every file, I'd hardly consider it fast.

Also, English or pseudocode please. It is rare to find somebody who speaks Bash natively.

Name: Anonymous 2011-07-07 2:31

What's that language that is supposedly designed to look like pseudocode but really looks like a pile of shit?

Oh wait, FIOC.

Get out, pythonista.

Name: Anonymous 2011-07-07 2:47

Check file sizes first. That should make the search much faster.

Name: Anonymous 2011-07-07 2:54

>>10
If you actually read the code you would see that it only compares files of same size.

Name: cabbagebot 2011-07-07 3:07

>>13
I will admit that I missed that, and so >>9's solution is much better than the previously recommended simplistic approach.

In this problem though, the set of n files is enormous. Imagine a situation in which an input file is several gigabytes, and large numbers of the files in the root are exactly the same size.

We can still do better, /prog/.

Name: Anonymous 2011-07-07 3:15

>>14
So, do you mindlessly copy your porn ISO images around because your harddisk still has free space left?

Such a situation is hardly relevant during normal usage.

Name: Anonymous 2011-07-07 3:17

>>14
Read the first n bytes and compare them to the input file's. If it fails, fallback to comparing the checksums.

Name: Anonymous 2011-07-07 3:19

>>14
I shit in your face.

Name: cabbagebot 2011-07-07 3:34

>>15
Of course. What else would I do with so much harddisk space?

>>16
Yes, this is much closer to what I had in mind.

In my solution, after eliminating files of the wrong size, you take the first n bytes of each file remaining, a small number at least at first to eliminate files that are very different, and throw each string into your favorite binary search tree, where each node of the tree is a sort of "bucket" holding the many strings that fall into it.
From here, you see into which bucket the first n bytes of the comparison files fall, then repeat this strategy using all of the files with strings in the same bucket. Each succession of searching would increase the number of bytes used in comparison to improve speed for very large, identical files.

Any improvements, /prog/?

Name: Anonymous 2011-07-07 4:09

>>18
Use a radix tree or a trie, lookup the original bytes.

Name: Anonymous 2011-07-07 4:10

>>18
I shit in your face.

Name: Anonymous 2011-07-07 4:26

What you do is you sort folder sizes. Then filetypes. Then file sizes. Then you stick your finger in your butthole and wonder why you didnt just get a history degree or something useful

Name: Anonymous 2011-07-07 4:26

>>20
That's not an improvement over >>17, at least piss in his anus.

Hey Look, This isn't a Language Shitpost

1 Name: cabbagebot 2011-07-07 0:30

2 Name: cabbagebot 2011-07-07 0:32

3 Name: Anonymous 2011-07-07 0:34

4 Name: Anonymous 2011-07-07 0:45

5 Name: cabbagebot 2011-07-07 0:51

6 Name: Anonymous 2011-07-07 1:26

7 Name: Anonymous 2011-07-07 1:27

8 Name: cabbagebot 2011-07-07 1:27

9 Name: Anonymous 2011-07-07 1:56

10 Name: cabbagebot 2011-07-07 2:21

11 Name: Anonymous 2011-07-07 2:31

12 Name: Anonymous 2011-07-07 2:47

13 Name: Anonymous 2011-07-07 2:54

14 Name: cabbagebot 2011-07-07 3:07

15 Name: Anonymous 2011-07-07 3:15

16 Name: Anonymous 2011-07-07 3:17

17 Name: Anonymous 2011-07-07 3:19

18 Name: cabbagebot 2011-07-07 3:34

19 Name: Anonymous 2011-07-07 4:09

20 Name: Anonymous 2011-07-07 4:10

21 Name: Anonymous 2011-07-07 4:26

22 Name: Anonymous 2011-07-07 4:26

Name: cabbagebot 2011-07-07 0:30

Name: cabbagebot 2011-07-07 0:32

Name: Anonymous 2011-07-07 0:34

Name: Anonymous 2011-07-07 0:45

Name: cabbagebot 2011-07-07 0:51

Name: Anonymous 2011-07-07 1:26

Name: Anonymous 2011-07-07 1:27

Name: cabbagebot 2011-07-07 1:27

Name: Anonymous 2011-07-07 1:56

Name: cabbagebot 2011-07-07 2:21

Name: Anonymous 2011-07-07 2:31

Name: Anonymous 2011-07-07 2:47

Name: Anonymous 2011-07-07 2:54

Name: cabbagebot 2011-07-07 3:07

Name: Anonymous 2011-07-07 3:15

Name: Anonymous 2011-07-07 3:17

Name: Anonymous 2011-07-07 3:19

Name: cabbagebot 2011-07-07 3:34

Name: Anonymous 2011-07-07 4:09

Name: Anonymous 2011-07-07 4:10

Name: Anonymous 2011-07-07 4:26

Name: Anonymous 2011-07-07 4:26