/prog/ - Finding Duplicate File Namesin Many Folders

Name: Anonymous 2009-01-19 18:45

Recently, a bug was found in the software my employer makes in which the same record number was issued to two or more clients, so that there would be different records with the same number on more than one client.

The bug was fixed, but we wanted to find out if any of our customers had been affected by this bug. A normal directory compare wouldn't work, because all the software we found could only compare between two folders; however, we need to compare the folders of multiple clients to see if any of them share file names with any others, which would be an indication that the bug had occurred.

So, I came up with an algorithm and wrote the code to do just that. The C# program I wrote can find 200,000+ duplicated records in 60+ network folders in about two minutes.

Challenge: come up with an algorithm that, given multiple folders, gives a list of file names shared between two or more folders, and for each duplicated file name found, a list of folders in which it is found.

Example:

Folders:
Folder1\ 1.txt 2.txt 3.txt Folder2\ 3.txt 9.txt 100.txt Folder3\ 2.txt 9.txt 55.txt

Results:
2.txt Folder1 Folder3 3.txt Folder1 Folder2 9.txt Folder2 Folder3

Name: Anonymous 2009-01-19 18:52

>>1
Congratulations, you know how to use a map/dictionary/hashtable. I'm real proud of you, son.

Name: Anonymous 2009-01-19 18:52

$ ls -R | uniq

Name: Anonymous 2009-01-19 19:03

>>2
Your not my father

Name: Anonymous 2009-01-19 19:06

>>1
Uh, add filenames into different sets per folder, do an intersect on each set. Bingo. Like, 5 lines of perl.
GTFO

Name: Anonymous 2009-01-19 19:30

>>5
5 lines of Perl is enough for any program.

Name: Anonymous 2009-01-19 20:06

>>5
I don't think you thought your approach through, buddy, because it is, y'know, stupid as fuck. What you should do is create a dictionary using the filenames as keys and lists of directories as values.

Name: Anonymous 2009-01-19 20:11

>>7
sup OP. nobody cares about your 200 line enterprise C#. gtfo

Name: Anonymous 2009-01-19 20:17

>>8
1/10

Name: Anonymous 2009-01-19 20:34

>>1,7,9
it sure is Tyler Durden around here

Name: Anonymous 2009-01-19 20:37

what I would do is:
have a dictionary where keys = filenames and value = list of directories
walk through directories
obtain list for current filename in dictionary
append current directory to list
store list for current filename in dictionary

after:
foreach key,value in dictionary
print key ;; filename
if value.length > 1: print value ;; directories

Name: Anonymous 2009-01-19 20:38

>>11
I'd estimate that to be roughly 3 lines of Haskell, 11 lines of perl, 25 lines of Python, 250 lines of C

Name: Anonymous 2009-01-19 20:39

>>12
250 lines of C
Excluding implementation of dictionary. Easily >1k lines of code

Name: Anonymous 2009-01-19 20:49

>>12
I'd like to see this Haskell implementation

Name: Anonymous 2009-01-19 22:05

>>12
>>3
1 line of bash

Name: Anonymous 2009-01-19 22:10

Filenames, meh, I'd rather take a little longer and use checksums (GNU uniq is better than the limited BSD version).

find . -type f -exec md5sum {} \; | sort | uniq -D -w32

[Damn, I really need to clean out my image collection, seems I have some stuff in triplicate.]

Name: Anonymous 2009-01-19 22:22

>>16
Uh, the whole point is that different records could have the same name, so a checksum wouldn't accomplish the task!

>>7
>>11
Yep, that's basically what I did at first. The final implementation is a bit different, though. I use a case-insensitive SortedDictionary<string, TreeNode>, because I display the results in a TreeView control, and I was told to sort the results in it.

>>8
More like 400 lines of C# now, not including the generated GUI designer code.

>>12
If you use a hash table, it'll be more than 3 lines of Haskell, at least if you want it to be legible, because the included hash table uses IO out the ass.

Name: DADDYUNIQ 2009-01-19 22:32

Namesin
i'll tell you if you payme enough ;)

Name: Anonymous 2009-01-19 23:01

4 lines of Haskell, if you don't count the imports.



import System

import System.Directory

import Data.Map hiding (map, filter)



main = do dirs <- getArgs

          listings <- mapM (fmap (drop 2) . getDirectoryContents) dirs

          mapM print $ filter ((>1) . length . snd) $ assocs $ fromListWith (++) 

                     $ concat $ zipWith zip listings $ map (repeat . (:[])) dirs

Name: Anonymous 2009-01-19 23:29

>>19
3 lines of Haskell, if you only count the imports.

Name: Anonymous 2009-01-19 23:40

And here's a 3-liner, disregarding imports.



import System

import System.Directory

import Data.Map hiding (map, filter)



main = mapM print =<< return . filter ((>1) . length . snd) . assocs . fromListWith (++) . concat 

                  =<< mapM (\d -> fmap (map (flip (,) [d]) . drop 2) $ getDirectoryContents d) 

                  =<< getArgs

Name: Anonymous 2009-01-19 23:51

>>20
Om

Name: Anonymous 2009-01-20 0:02

>>21

import System

import System.Directory

import Data.Map hiding (map, filter)

main = mapM print =<< return . filter ((>1) . length . snd) . assocs . fromListWith (++) . concat =<< mapM (\d -> fmap (map (flip (,) [d]) . drop 2) $ getDirectoryContents d) =<< getArgs

4-liner, including imports

Name: Anonymous 2009-01-20 0:14

(list-dups "C:\Documents and Settings\Christopher\My documents")

One-liner in my Lisp Domain-Specific Language.

Name: Anonymous 2009-01-20 0:29

>>12
And how many lines of Java?

Seriously though, I'm interested in the expressiveness of commonly used languages. Does Python fare that bad? I thought it was similar to Perl in code size (for legible code).

Name: Anonymous 2009-01-20 0:47

>>23
Programming is not about program size but efficiency.

Name: Anonymous 2009-01-20 0:54

>>26
Shorter programs are invariably more efficient to write. The only effective way to reduce program size is to create better abstractions, which just happen to be the best way to write code more efficiently.

Name: Anonymous 2009-01-20 0:54

>>26
LOL !

Name: Anonymous 2009-01-20 2:26

>>5,6,12
One line of Perl.

Name: Anonymous 2009-01-20 2:31

you didn't seriously write 250 lines of code in C# for that, did you?

Name: Anonymous 2009-01-20 2:55

one line of any language.
well, any language except assembly or FIOC.

Name: Anonymous 2009-01-20 3:12

i dunno but this sounds stupifying simple to me, finding unique filenames across MULTIPLE folders. so basically you can just pull up an anonymous list of ALL files in ALL the folders searched and sort it uniquely

like someone already mentioned, this can be done with posix tools like ls and sort in any posix shell

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 3:15

The advantage of languages which require hundreds of lines of code is that you can optimize and improve every algorithm.
With higher abstractions you can't do nothing but use the common built-in components which cannot be optimized.

_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:03

>>33
You can use a good optimizing compilter, or you could kep 2 versions, one where you simply implemented it concisely and it worked properly, and one where you went and OMGOPTIMIZED it. It's common for some types of soft to have a routine written in a language, then have the same routine written in assembly(where speed is critical) with SIMD instructions, then the application may chose to use the optimized version if the CPU allows it. The original goal is to get everything working and implemented according to your vision/reference/whatever, then to optimize what you can when everything is already in good state. As they say: "premature optimization is the root of all evil."

Now go READ YOUR SICP

Name: Anonymous 2009-01-20 4:08

>>33
Without the ability to create decent abstractions, complex algorithms become prohibitive to implement.

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 4:16

>>35
You can call your own abstracted functions.

_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:17

>>36
Wat

Name: Anonymous 2009-01-20 4:20

5 lines in BBCode

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 4:22

>>37
Just use the functions you wrote earlier to solve the same problem. e.g. every script i use might use this one:
function tag(x,y){if(!y){return document.getElementsByTagName(x)}else{return x.getElementsByTagName(y)}};

_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:25

>>35
>>37
YHBT

Finding Duplicate File Namesin Many Folders

1 Name: Anonymous 2009-01-19 18:45

2 Name: Anonymous 2009-01-19 18:52

3 Name: Anonymous 2009-01-19 18:52

4 Name: Anonymous 2009-01-19 19:03

5 Name: Anonymous 2009-01-19 19:06

6 Name: Anonymous 2009-01-19 19:30

7 Name: Anonymous 2009-01-19 20:06

8 Name: Anonymous 2009-01-19 20:11

9 Name: Anonymous 2009-01-19 20:17

10 Name: Anonymous 2009-01-19 20:34

11 Name: Anonymous 2009-01-19 20:37

12 Name: Anonymous 2009-01-19 20:38

13 Name: Anonymous 2009-01-19 20:39

14 Name: Anonymous 2009-01-19 20:49

15 Name: Anonymous 2009-01-19 22:05

16 Name: Anonymous 2009-01-19 22:10

17 Name: Anonymous 2009-01-19 22:22

18 Name: DADDYUNIQ 2009-01-19 22:32

19 Name: Anonymous 2009-01-19 23:01

20 Name: Anonymous 2009-01-19 23:29

21 Name: Anonymous 2009-01-19 23:40

22 Name: Anonymous 2009-01-19 23:51

23 Name: Anonymous 2009-01-20 0:02

24 Name: Anonymous 2009-01-20 0:14

25 Name: Anonymous 2009-01-20 0:29

26 Name: Anonymous 2009-01-20 0:47

27 Name: Anonymous 2009-01-20 0:54

28 Name: Anonymous 2009-01-20 0:54

29 Name: Anonymous 2009-01-20 2:26

30 Name: Anonymous 2009-01-20 2:31

31 Name: Anonymous 2009-01-20 2:55

32 Name: Anonymous 2009-01-20 3:12

33 Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 3:15

34 Name: Anonymous 2009-01-20 4:03

35 Name: Anonymous 2009-01-20 4:08

36 Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 4:16

37 Name: Anonymous 2009-01-20 4:17

38 Name: Anonymous 2009-01-20 4:20

39 Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 4:22

40 Name: Anonymous 2009-01-20 4:25

Name: Anonymous 2009-01-19 18:45

Name: Anonymous 2009-01-19 18:52

Name: Anonymous 2009-01-19 18:52

Name: Anonymous 2009-01-19 19:03

Name: Anonymous 2009-01-19 19:06

Name: Anonymous 2009-01-19 19:30

Name: Anonymous 2009-01-19 20:06

Name: Anonymous 2009-01-19 20:11

Name: Anonymous 2009-01-19 20:17

Name: Anonymous 2009-01-19 20:34

Name: Anonymous 2009-01-19 20:37

Name: Anonymous 2009-01-19 20:38

Name: Anonymous 2009-01-19 20:39

Name: Anonymous 2009-01-19 20:49

Name: Anonymous 2009-01-19 22:05

Name: Anonymous 2009-01-19 22:10

Name: Anonymous 2009-01-19 22:22

Name: DADDYUNIQ 2009-01-19 22:32

Name: Anonymous 2009-01-19 23:01

Name: Anonymous 2009-01-19 23:29

Name: Anonymous 2009-01-19 23:40

Name: Anonymous 2009-01-19 23:51

Name: Anonymous 2009-01-20 0:02

Name: Anonymous 2009-01-20 0:14

Name: Anonymous 2009-01-20 0:29

Name: Anonymous 2009-01-20 0:47

Name: Anonymous 2009-01-20 0:54

Name: Anonymous 2009-01-20 0:54

Name: Anonymous 2009-01-20 2:26

Name: Anonymous 2009-01-20 2:31

Name: Anonymous 2009-01-20 2:55

Name: Anonymous 2009-01-20 3:12

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 3:15

Name: Anonymous 2009-01-20 4:03

Name: Anonymous 2009-01-20 4:08

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 4:16

Name: Anonymous 2009-01-20 4:17

Name: Anonymous 2009-01-20 4:20

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 4:22

Name: Anonymous 2009-01-20 4:25