Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

Finding Duplicate File Namesin Many Folders

Name: Anonymous 2009-01-19 18:45

Recently, a bug was found in the software my employer makes in which the same record number was issued to two or more clients, so that there would be different records with the same number on more than one client.

The bug was fixed, but we wanted to find out if any of our customers had been affected by this bug. A normal directory compare wouldn't work, because all the software we found could only compare between two folders; however, we need to compare the folders of multiple clients to see if any of them share file names with any others, which would be an indication that the bug had occurred.

So, I came up with an algorithm and wrote the code to do just that. The C# program I wrote can find 200,000+ duplicated records in 60+ network folders in about two minutes.

Challenge: come up with an algorithm that, given multiple folders, gives a list of file names shared between two or more folders, and for each duplicated file name found, a list of folders in which it is found.

Example:

Folders:

Folder1\
        1.txt
        2.txt
        3.txt
Folder2\
        3.txt
        9.txt
        100.txt
Folder3\
        2.txt
        9.txt
        55.txt


Results:

2.txt
      Folder1
      Folder3
3.txt
      Folder1
      Folder2
9.txt
      Folder2
      Folder3

Name: Anonymous 2009-01-19 18:52

>>1
Congratulations, you know how to use a map/dictionary/hashtable. I'm real proud of you, son.

Name: Anonymous 2009-01-19 18:52

$ ls -R | uniq

Name: Anonymous 2009-01-19 19:03

>>2
Your not my father

Name: Anonymous 2009-01-19 19:06

>>1
Uh, add filenames into different sets per folder, do an intersect on each set. Bingo. Like, 5 lines of perl.
GTFO

Name: Anonymous 2009-01-19 19:30

>>5
5 lines of Perl is enough for any program.

Name: Anonymous 2009-01-19 20:06

>>5
I don't think you thought your approach through, buddy, because it is, y'know, stupid as fuck. What you should do is create a dictionary using the filenames as keys and lists of directories as values.

Name: Anonymous 2009-01-19 20:11

>>7
sup OP. nobody cares about your 200 line enterprise C#. gtfo

Name: Anonymous 2009-01-19 20:17

>>8
1/10

Name: Anonymous 2009-01-19 20:34

>>1,7,9
it sure is Tyler Durden around here

Name: Anonymous 2009-01-19 20:37

what I would do is:
have a dictionary where keys = filenames and value = list of directories
walk through directories
obtain list for current filename in dictionary
append current directory to list
store list for current filename in dictionary

after:
foreach key,value in dictionary
print key ;; filename
if value.length > 1: print value ;; directories

Name: Anonymous 2009-01-19 20:38

>>11
I'd estimate that to be roughly 3 lines of Haskell, 11 lines of perl, 25 lines of Python, 250 lines of C

Name: Anonymous 2009-01-19 20:39

>>12
250 lines of C
Excluding implementation of dictionary. Easily >1k lines of code

Name: Anonymous 2009-01-19 20:49

>>12
I'd like to see this Haskell implementation

Name: Anonymous 2009-01-19 22:05

>>12
>>3
1 line of bash

Name: Anonymous 2009-01-19 22:10

Filenames, meh, I'd rather take a little longer and use checksums (GNU uniq is better than the limited BSD version).

find . -type f -exec md5sum {} \; | sort | uniq -D -w32

[Damn, I really need to clean out my image collection, seems I have some stuff in triplicate.]

Name: Anonymous 2009-01-19 22:22

>>16
Uh, the whole point is that different records could have the same name, so a checksum wouldn't accomplish the task!

>>7
>>11
Yep, that's basically what I did at first. The final implementation is a bit different, though. I use a case-insensitive SortedDictionary<string, TreeNode>, because I display the results in a TreeView control, and I was told to sort the results in it.

>>8
More like 400 lines of C# now, not including the generated GUI designer code.

>>12
If you use a hash table, it'll be more than 3 lines of Haskell, at least if you want it to be legible, because the included hash table uses IO out the ass.

Name: DADDYUNIQ 2009-01-19 22:32

Namesin
i'll tell you if you payme enough ;)

Name: Anonymous 2009-01-19 23:01

4 lines of Haskell, if you don't count the imports.

import System
import System.Directory
import Data.Map hiding (map, filter)

main = do dirs <- getArgs
          listings <- mapM (fmap (drop 2) . getDirectoryContents) dirs
          mapM print $ filter ((>1) . length . snd) $ assocs $ fromListWith (++)
                     $ concat $ zipWith zip listings $ map (repeat . (:[])) dirs

Name: Anonymous 2009-01-19 23:29

>>19
3 lines of Haskell, if you only count the imports.

Name: Anonymous 2009-01-19 23:40

And here's a 3-liner, disregarding imports.

import System
import System.Directory
import Data.Map hiding (map, filter)

main = mapM print =<< return . filter ((>1) . length . snd) . assocs . fromListWith (++) . concat
                  =<< mapM (\d -> fmap (map (flip (,) [d]) . drop 2) $ getDirectoryContents d)
                  =<< getArgs

Name: Anonymous 2009-01-19 23:51

>>20
Om

Name: Anonymous 2009-01-20 0:02

>>21
import System
import System.Directory
import Data.Map hiding (map, filter)
main = mapM print =<< return . filter ((>1) . length . snd) . assocs . fromListWith (++) . concat =<< mapM (\d -> fmap (map (flip (,) [d]) . drop 2) $ getDirectoryContents d) =<< getArgs


4-liner, including imports

Name: Anonymous 2009-01-20 0:14

(list-dups "C:\Documents and Settings\Christopher\My documents")

One-liner in my Lisp Domain-Specific Language.

Name: Anonymous 2009-01-20 0:29

>>12
And how many lines of Java?

Seriously though, I'm interested in the expressiveness of commonly used languages. Does Python fare that bad? I thought it was similar to Perl in code size (for legible code).

Name: Anonymous 2009-01-20 0:47

>>23
Programming is not about program size but efficiency.

Name: Anonymous 2009-01-20 0:54

>>26
Shorter programs are invariably more efficient to write. The only effective way to reduce program size is to create better abstractions, which just happen to be the best way to write code more efficiently.

Name: Anonymous 2009-01-20 0:54

>>26
LOL !

Name: Anonymous 2009-01-20 2:26

>>5,6,12
One line of Perl.

Name: Anonymous 2009-01-20 2:31

you didn't seriously write 250 lines of code in C# for that, did you?

Name: Anonymous 2009-01-20 2:55

one line of any language.
well, any language except assembly or FIOC.

Name: Anonymous 2009-01-20 3:12

i dunno but this sounds stupifying simple to me, finding unique filenames across MULTIPLE folders. so basically you can just pull up an anonymous list of ALL files in ALL the folders searched and sort it uniquely

like someone already mentioned, this can be done with posix tools like ls and sort in any posix shell

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 3:15

The advantage of languages which require hundreds of lines of code is that you can optimize and improve every algorithm.
With higher abstractions you can't do nothing but use the common built-in components which cannot be optimized.

_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:03

>>33
You can use a good optimizing compilter, or you could kep 2 versions, one where you simply implemented it concisely and it worked properly, and one where you went and OMGOPTIMIZED it. It's common for some types of soft to have a routine written in a language, then have the same routine written in assembly(where speed is critical) with SIMD instructions, then the application may chose to use the optimized version if the CPU allows it. The original goal is to get everything working and implemented according to your vision/reference/whatever, then to optimize what you can when everything is already in good state. As they say: "premature optimization is the root of all evil."

Now go READ YOUR SICP

Name: Anonymous 2009-01-20 4:08

>>33
Without the ability to create decent abstractions, complex algorithms become prohibitive to implement.

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 4:16

>>35
You can call your own abstracted functions.


_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:17

>>36
Wat

Name: Anonymous 2009-01-20 4:20

5 lines in BBCode

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 4:22

>>37
Just use the functions you wrote earlier to solve the same problem. e.g. every script i use might use this one:
function tag(x,y){if(!y){return document.getElementsByTagName(x)}else{return x.getElementsByTagName(y)}};


_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:25

>>35
>>37
YHBT

Name: Anonymous 2009-01-20 5:03

What difference does it make if an algorithm finishes in a second or a millisecond? Wouldn't the fact that you can write the program in a fraction of the time matter more?

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 5:10

>>41
"What difference does it make if an algorithm finishes in a second or a millisecond?"
What difference would it make if algorithm finishes in 100hours or 6 minutes?

_________________________
orbis terrarum delenda est

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 5:12

1 second=1000milliseconds
100hours=100*60=6000minutes(about 4 days)
6 minutes=6000/1000minutes
The difference between 4 days and 6minutes is obvious,now?
_________________________
orbis terrarum delenda est

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 5:25

Another example would be 100years vs 36.5days


_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 5:41

Enjoy wasting time writing your programs to scale to billions of items and then never actually using it for that

Name: Anonymous 2009-01-20 6:20

>>42
>>43
>>44
Expert unemployed Javascripter, no-one with a meaningful job can spend so much time posting that much shit.

Name: Anonymous 2009-01-20 6:31

recursive md5

Name: Anonymous 2009-01-20 6:43

>>43
Compilers can optimize better then you.

Name: Anonymous 2009-01-20 6:50

>>48
unless you're using gcc

Name: Anonymous 2009-01-20 7:44

>>48
unless you're using yhbt

Name: Anonymous 2009-01-20 9:00

Why would a performance obsessed retard have Javascript as his favorite language?
IAGTBT

Name: Anonymous 2009-01-20 10:04

bump for trollage

Name: Anonymous 2009-01-20 10:04

>>48
[spoiler]That is true, you should use a compiler, and then do further optimizations by hand.

Name: Anonymous 2009-01-20 10:09

>>53
Failure to preview post detected.

Name: Anonymous 2009-01-20 12:40

>>30
Actually, the total program is about 350 lines, but that includes all the GUI shit. The algorithm is about 40 lines, but that includes extra shit to handle creating TreeNodes and filtering to only look at certain files.

Name: Anonymous 2009-01-20 13:14

>>55
shit, C# still sucks. Also why write a GUI for it? Isn't a command line tool enough for your needs?

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 13:26

>>56
are you going to open a command line prompt and type out the command with parameters every time you use the tool?
For single-purpose one-use program its ok,but what about more permanent solution,like a convenient GUI interface?


_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 13:29

>>57
While your main point is correct, >>1's program fits in the category of single or rare use

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 13:39

>>58
However its not confined to the domain of problem of >>1
and can be used with any duplicated files.
_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 13:47

>>59
But how often would it need to be used?

Name: Anonymous 2009-01-20 14:11

>>60
The more important question is, who will use it? I rightly assume that our clients and my manager aren't familiar enough with the command line, so I made a nice GUI. If this were purely a one-time internal tool, I might not care to write a GUI, but this tool will be used by multiple people at multiple times, to ensure that they aren't affected by the bug, and that they won't be in the future if the bug regresses.

Name: Anonymous 2009-01-20 15:22

Write a man page. Thread over.

Name: Anonymous 2009-01-20 15:58

>>57
Leave the command line open all the time. /thread

Name: Anonymous 2009-01-20 19:54

Stop replying to nonexistent posts, you dumbasses.

Name: Anonymous 2011-01-31 20:14

<-- check em dubz

Name: Anonymous 2013-01-19 23:45

/prog/ will be spammed continuously until further notice. we apologize for any inconvenience this may cause.

Don't change these.
Name: Email:
Entire Thread Thread List