Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

Halp with regex

Name: Anonymous 2008-08-04 3:54

Hey /prog/,
Lets say I have a file with many lines,
each line consists of a very very long random number (string)
Is it possible to write a regular expression to find the longest series of repeated numbers shared by all lines
aka
120123456789948761238746102347106416239847113958719385
217490812741234567890182472103571938547019112413956138756183756183756817356817356
192346112345678978932671658917612746183724681734682173649187245

(I bolded it for readability)

tldr;
how does I finded biggest repeaded series of numbers contained in all lines? (using regex or something one-line'ish)

Name: Anonymous 2008-08-04 4:07

My NP-Complete sense is tingling.

Name: Anonymous 2008-08-04 4:14

The most straight-forward way would be to find the shortest line and then to check every possible substring of it to see if it occurs in every other line.
It's not possible to do in a single regex.

Name: Anonymous 2008-08-04 4:19

>>3
O(n * m^2)

Where n is the number of lines and m is the length of the shortest line.

Name: Anonymous 2008-08-04 4:37

>>4
I said it was straight-forward, not that it was good. The worst-case performance is horrific.

Name: Anonymous 2008-08-04 6:05

>>5
O(rly)

Name: Anonymous 2008-08-04 6:39

>>5
I was thinking that if an algorithm's complexity is greater than O(n) that you cannot do it with regular expressions.  The caveat being that Perl style regular expressions are actually more powerful.

I could be wrong, I only doubt cause I never seen it explicitly said, but it seems obvious from the nature of FSA.

Name: Anonymous 2008-08-04 7:57

>>7
proof or stfu

Name: 7 2008-08-04 9:19

>>8
The proof is trivial: regular expression matching is known to be possible in O(n) with respect to the input length ( http://swtch.com/~rsc/regexp/regexp1.html ). Because you can't run algorithms with complexity greater than O(n) in O(n), you can't use a regular expression to perform such an algorithm.

Name: Anonymous 2008-08-04 9:34

>>1
this can't be done with a single regex unless you knew the longest repeated number in advance.
once you know that then regex isn't even required.

Name: Anonymous 2008-08-04 9:42

>>9
CRUSHING

Name: Anonymous 2008-08-04 10:50

>>9,11
You may want to reread that essay.

Name: Anonymous 2008-08-04 11:18

if you think of the regex as a DFA, NFA, or e-NFA, then its trivially seen to be O(n). of course regexs are much more powerful than those these days with all the lookahead stuff involved, so its obvious that tits gonna take a bit longer than O(n) if you are using the more powerful features of the regex, i.e. you are a regex-jedi.

Name: Anonymous 2008-08-04 15:22

(I bolded it for readability)

Name: Anonymous 2008-08-04 15:46

Here's an OFIOC slow as fuck solution to your problem:

import Control.Monad.Instances
import Data.Ord
import Monad
import List

main =
   do numbers <- fmap lines getContents
      putStrLn
         $ map (head numbers !!) $ map head $ maximumBy (comparing length)
         $ groupSequences $ sortBy (comparing (zipWith (-) =<< tail))
         $ map (map fst) $ filter (\l -> all ((snd (head l) ==) . snd) l)
         $ sequence $ map (zip [0..]) numbers
   where
      groupSequences matchList =
         uncurry (:)
            $ mapAccumL ((.) (uncurry $ flip (,)) . flip splitAt) matchList
            $ liftM2 (:) ((+1) . head) (zipWith (-) =<< tail)
            $ findIndices (uncurry $ (/=) . map (subtract 1))
            $ (zip =<< tail) $ matchList

Name: Anonymous 2008-08-04 15:57

What is the minimum length of the substring? 1 number seems idiotic.
I can write something for you in a non-toy-language.

Name: Anonymous 2008-08-04 16:12

>>16
Minimum is indeed 1. Write away, sir.

Name: Anonymous 2008-08-04 16:17

>>16
Minimum length is more like 50

Name: Anonymous 2008-08-04 16:25

How do i lognest comon sobstring?

Name: Anonymous 2008-08-04 16:26

>>15
Is OFIOC a fancy way of saying that it's obfuscated as fuck?

Also whoa, /prog/ has syntax highlighting now? wowza

Name: Anonymous 2008-08-04 16:29

>>20
Read SICP

Name: Anonymous 2008-08-04 16:55

>>20
Slow and obfuscated is how I roll, son.

Name: Anonymous 2008-08-04 19:24

imconfus.jpg

Name: Anonymous 2008-08-04 20:07

penis_into_penishole.jpg

Name: Anonymous 2008-08-04 21:52

wereDiscussingPenises.jpg

Name: Anonymous 2008-08-05 19:37

so what happens if you let it have error,
I mean like finding a sequence that is almost the same,
but can be one character off,
How would one go about doing that?
aka

123456789
153456789
123456729
123446789

would all be acceptable

Name: Anonymous 2008-08-05 19:58

>>26
It would be exactly the same, with a slightly different match function.  What the fuck, this isn't hard.

Name: Anonymous 2008-08-05 20:10

>>26
If each line's error is in a distinct position of the sequence, it is doable with some changes to >>15.

Name: Anonymous 2008-08-06 0:03

>>15
Holy fuck.
Did you wrote that?

Name: Anonymous 2008-08-06 3:37

use strict;
use List::Util qw(max);
use Benchmark qw(:all :hireswallclock);

my $loopcount=10;
my $linecount=(shift or 4);
my $linelength=(shift or 40000);

my @list=map{join "",map{int rand 10}1..$linelength}1..$linecount;

my $reg=qr/1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)
          |6(?=7)|7(?=8)|8(?=9)|9(?=0)|0(?=1)/x;

sub insert($$$){
    my($ref,$key,$pos)=@_;
   
    return if $ref->{$key} or length $key == 1;
    $ref->{$key}=$pos;
   
    insert($ref,(substr $key,1),$pos+1);
    insert($ref,(substr $key,0,length($key)-1),$pos);
}

sub finder(){
    my @sequences=map{
        my $matches={};
        insert($matches,$&,$-[0]) while /$reg+\d/g;
        $matches;
    } @list;
   
    my($first,@rest)=@sequences;
    my @matches=grep{
        my $key=$_;
        not grep{not $_->{$key}}@rest
    }keys %$first;

    my $len=max map{length $_}@matches;
    grep{$len==length $_}@matches
}

my @matches;
my $time=timeit($loopcount,sub{@matches=finder()});

print "Longest series of repeated numbers in $linecount lines $linelength charactes each:
    ",join " ",@matches,"\n";
print "$loopcount loops of code took:",timestr($time);


---

C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        0123 2345 4567 9012 3456 5678 7890 8901 1234 6789
10 loops of code took:5.12966 wallclock secs ( 5.13 usr +  0.00 sys =  5.13 CPU) @  1.95/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        0123 2345 4567 9012 3456 7890 5678 8901 1234 6789
10 loops of code took:5.1259 wallclock secs ( 5.13 usr +  0.00 sys =  5.13 CPU) @  1.95/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        89012
10 loops of code took:5.16758 wallclock secs ( 5.16 usr +  0.00 sys =  5.16 CPU) @  1.94/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        0123 2345 4567 9012 3456 5678 1234 6789
10 loops of code took:5.1835 wallclock secs ( 5.17 usr +  0.00 sys =  5.17 CPU) @  1.93/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        2345 4567 9012 7890 5678 1234 6789
10 loops of code took:5.03543 wallclock secs ( 5.03 usr +  0.00 sys =  5.03 CPU) @  1.99/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        9012 3456 0123 5678 7890 8901 4567 1234 6789
10 loops of code took:5.27253 wallclock secs ( 5.22 usr +  0.02 sys =  5.24 CPU) @  1.91/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
        9012 3456 0123 7890 5678 8901 2345 4567 1234
10 loops of code took:5.13019 wallclock secs ( 5.13 usr +  0.00 sys =  5.13 CPU) @  1.95/s (n=10)


could someone time >>15-kun's program?

Name: Anonymous 2008-08-06 3:41

use strict;

LOL

Name: Anonymous 2008-08-06 5:02

>>30
I came rainbows

Name: Anonymous 2008-08-06 5:12

>>30
#!/usr/bin/perl -w                                      # camel code
use strict;

                                           $_='ev
                                       al("seek\040D
           ATA,0,                  0;");foreach(1..3)
       {<DATA>;}my               @camel1hump;my$camel;
  my$Camel  ;while(             <DATA>){$_=sprintf("%-6
9s",$_);my@dromedary           1=split(//);if(defined($
_=<DATA>)){@camel1hum        p=split(//);}while(@dromeda
 ry1){my$camel1hump=0      ;my$CAMEL=3;if(defined($_=shif
        t(@dromedary1    ))&&/\S/){$camel1hump+=1<<$CAMEL;}
       $CAMEL--;if(d   efined($_=shift(@dromedary1))&&/\S/){
      $camel1hump+=1  <<$CAMEL;}$CAMEL--;if(defined($_=shift(
     @camel1hump))&&/\S/){$camel1hump+=1<<$CAMEL;}$CAMEL--;if(
     defined($_=shift(@camel1hump))&&/\S/){$camel1hump+=1<<$CAME
     L;;}$camel.=(split(//,"\040..m`{/J\047\134}L^7FX"))[$camel1h
      ump];}$camel.="\n";}@camel1hump=split(/\n/,$camel);foreach(@
      camel1hump){chomp;$Camel=$_;y/LJF7\173\175`\047/\061\062\063\
      064\065\066\067\070/;y/12345678/JL7F\175\173\047`/;$_=reverse;
       print"$_\040$Camel\n";}foreach(@camel1hump){chomp;$Camel=$_;y
        /LJF7\173\175`\047/12345678/;y/12345678/JL7F\175\173\0 47`/;
         $_=reverse;print"\040$_$Camel\n";}';;s/\s*//g;;eval;   eval
           ("seek\040DATA,0,0;");undef$/;$_=<DATA>;s/\s*//g;(   );;s
             ;^.*_;;;map{eval"print\"$_\"";}/.{4}/g; __DATA__   \124
               \1   50\145\040\165\163\145\040\157\1 46\040\1  41\0
                    40\143\141  \155\145\1 54\040\1   51\155\  141
                    \147\145\0  40\151\156 \040\141    \163\16 3\
                     157\143\   151\141\16  4\151\1     57\156
                     \040\167  \151\164\1   50\040\      120\1
                     45\162\   154\040\15    1\163\      040\14
                     1\040\1   64\162\1      41\144       \145\
                     155\14    1\162\       153\04        0\157
                      \146\     040\11     7\047\         122\1
                      45\15      1\154\1  54\171          \040
                      \046\         012\101\16            3\16
                      3\15           7\143\15             1\14
                      1\16            4\145\163           \054
                     \040            \111\156\14         3\056
                    \040\         125\163\145\14         4\040\
                    167\1        51\164\1  50\0         40\160\
                  145\162                              \155\151
                \163\163                                \151\1
              57\156\056

Name: Anonymous 2008-08-06 5:33

>>30
Moar like C:\audrey> amirite

Name: Anonymous 2008-08-06 6:11

>>34
?

Name: Anonymous 2008-08-06 9:47

>>30
could someone time >>15-kun's program?
Possibly, if it actually did what it was supposed to. Which it doesn't.

Name: Anonymous 2008-08-06 11:35

>>29
Yes, I did.

>>30
I assure you >>15 is not worth timing, it's much, much slower than >>30.

>>36
It does do what it's supposed to. Test it all you want if you don't believe me.

Name: Anonymous 2008-08-06 12:59

Wait a minute, can the common series be something like "0472619" or does it have to be [n,n+1,...,m], like "123456789"?

Name: Anonymous 2008-08-06 13:25

>>38
Any common substring.

Name: Anonymous 2008-08-06 14:08

>>39
Well, in that case, >>30 does not do what it's supposed to.

Name: Anonymous 2008-08-06 14:12

>>38
Hello my name is >>15 and I think it has to be [n,n+1,...,m], like "123456789", as bolded in >>1

Name: Anonymous 2008-08-06 14:12

I man my name is >>30, not >>15

Name: Anonymous 2008-08-06 15:14

Name: Anonymous 2008-08-06 15:14

Have fun doing it for n Strings.

Name: Anonymous 2008-08-06 20:25

>>44
I will, along with my anus.

Name: Anonymous 2008-08-06 21:30

Here's the new and improved Haskell code. It only finds substrings with lengths bigger than 2, but it's orders of magnitude faster than >>15.

import Control.Monad.Instances
import Control.Monad
import Control.Arrow
import Data.Array
import Data.List
import Data.Ord
import System
import Random

--benchMain =
main =
   do [lines, lineLength] <- fmap (map read) getArgs
      putStrLn . longestSubNumber
         =<< (replicateM lines . replicateM lineLength $ randomRIO ('0','9'))

realMain =
--main =
   putStrLn . longestSubNumber . lines =<< getContents

longestSubNumber [firstNum, secondNum] =
   maximumBy (comparing length) $ commonSubNumbers firstNum secondNum

longestSubNumber numbers =
   maximumBy (comparing length)
      $ uncurry (foldr (concatMap . commonSubNumbers))
      $ first (liftM2 commonSubNumbers head last) $ splitAt 2 numbers

commonSubNumbers firstNum secondNum =
   nubBy (flip isInfixOf)
      $ map (map (head numbers !!) . map head) $ reverse
      $ sortByArrayOn length bounds $ filter ((>2) . length)
      $ groupSequences
      $ sortByArrayOn ((.) head . zipWith (-) =<< tail) bounds
      $ sortByArrayOn head bounds
      $ concatMap sequence $ filter (all (not . null)) $ transpose
      $ map (elems . accumArray (flip (:)) [] ('0','9') . flip zip [0..]) numbers
   where
      bounds = ((,) =<< negate) $ maximum $ map length numbers
      numbers = [firstNum, secondNum]

groupSequences matchList =
   snd $ mapAccumL ((.) (uncurry $ flip (,)) . flip splitAt) matchList
       $ liftM2 (:) ((+1) . head) (zipWith (-) =<< tail)
       $ findIndices (uncurry $ (/=) . map (subtract 1))
       $ (zip =<< tail) $ matchList ++ [[]]

sortByArrayOn rankElem bounds list =
   concat $ map reverse $ filter (not . null) $ elems
          $ accumArray (flip (:)) [] bounds 
          $ map ((,) =<< rankElem) list

Don't change these.
Name: Email:
Entire Thread Thread List