Halp with regex
1
Name:
Anonymous
2008-08-04 3:54
Hey /prog/,
Lets say I have a file with many lines,
each line consists of a very very long random number (string)
Is it possible to write a regular expression to find the longest series of repeated numbers shared by all lines
aka
120123456789 948761238746102347106416239847113958719385
21749081274123456789 0182472103571938547019112413956138756183756183756817356817356
1923461123456789 78932671658917612746183724681734682173649187245
(I bolded it for readability)
tldr;
how does I finded biggest repeaded series of numbers contained in all lines? (using regex or something one-line'ish)
2
Name:
Anonymous
2008-08-04 4:07
My NP-Complete sense is tingling.
3
Name:
Anonymous
2008-08-04 4:14
The most straight-forward way would be to find the shortest line and then to check every possible substring of it to see if it occurs in every other line.
It's not possible to do in a single regex.
4
Name:
Anonymous
2008-08-04 4:19
>>3
O(n * m^2)
Where n is the number of lines and m is the length of the shortest line.
5
Name:
Anonymous
2008-08-04 4:37
>>4
I said it was straight-forward, not that it was good. The worst-case performance is horrific.
6
Name:
Anonymous
2008-08-04 6:05
7
Name:
Anonymous
2008-08-04 6:39
>>5
I was thinking that if an algorithm's complexity is greater than O(n) that you cannot do it with regular expressions. The caveat being that Perl style regular expressions are actually more powerful.
I could be wrong, I only doubt cause I never seen it explicitly said, but it seems obvious from the nature of FSA.
8
Name:
Anonymous
2008-08-04 7:57
9
Name:
7
2008-08-04 9:19
>>8
The proof is trivial: regular expression matching is known to be possible in O(n) with respect to the input length (
http://swtch.com/~rsc/regexp/regexp1.html ). Because you can't run algorithms with complexity greater than O(n) in O(n), you can't use a regular expression to perform such an algorithm.
10
Name:
Anonymous
2008-08-04 9:34
>>1
this can't be done with a single regex unless you knew the longest repeated number in advance.
once you know that then regex isn't even required.
11
Name:
Anonymous
2008-08-04 9:42
12
Name:
Anonymous
2008-08-04 10:50
>>9,11
You may want to reread that essay.
13
Name:
Anonymous
2008-08-04 11:18
if you think of the regex as a DFA, NFA, or e-NFA, then its trivially seen to be O(n). of course regexs are much more powerful than those these days with all the lookahead stuff involved, so its obvious that tits gonna take a bit longer than O(n) if you are using the more powerful features of the regex, i.e. you are a regex-jedi.
14
Name:
Anonymous
2008-08-04 15:22
(I bolded it for readability)
15
Name:
Anonymous
2008-08-04 15:46
Here's an OFIOC slow as fuck solution to your problem:
import Control.Monad.Instances
import Data.Ord
import Monad
import List
main =
do numbers <- fmap lines getContents
putStrLn
$ map (head numbers !!) $ map head $ maximumBy (comparing length)
$ groupSequences $ sortBy (comparing (zipWith (-) =<< tail))
$ map (map fst) $ filter (\l -> all ((snd (head l) ==) . snd) l)
$ sequence $ map (zip [0..]) numbers
where
groupSequences matchList =
uncurry (:)
$ mapAccumL ((.) (uncurry $ flip (,)) . flip splitAt) matchList
$ liftM2 (:) ((+1) . head) (zipWith (-) =<< tail)
$ findIndices (uncurry $ (/=) . map (subtract 1))
$ (zip =<< tail) $ matchList
16
Name:
Anonymous
2008-08-04 15:57
What is the minimum length of the substring? 1 number seems idiotic.
I can write something for you in a non-toy-language.
17
Name:
Anonymous
2008-08-04 16:12
>>16
Minimum is indeed 1. Write away, sir.
18
Name:
Anonymous
2008-08-04 16:17
>>16
Minimum length is more like 50
19
Name:
Anonymous
2008-08-04 16:25
How do i lognest comon sobstring?
20
Name:
Anonymous
2008-08-04 16:26
>>15
Is OFIOC a fancy way of saying that it's obfuscated as fuck?
Also whoa, /prog/ has syntax highlighting now? wowza
21
Name:
Anonymous
2008-08-04 16:29
22
Name:
Anonymous
2008-08-04 16:55
>>20
Slow and obfuscated is how I roll, son.
23
Name:
Anonymous
2008-08-04 19:24
imconfus.jpg
24
Name:
Anonymous
2008-08-04 20:07
penis_into_penishole.jpg
25
Name:
Anonymous
2008-08-04 21:52
wereDiscussingPenises.jpg
26
Name:
Anonymous
2008-08-05 19:37
so what happens if you let it have error,
I mean like finding a sequence that is almost the same,
but can be one character off,
How would one go about doing that?
aka
123456789
15 3456789
12345672 9
12344 6789
would all be acceptable
27
Name:
Anonymous
2008-08-05 19:58
>>26
It would be exactly the same, with a slightly different match function. What the fuck, this isn't hard.
28
Name:
Anonymous
2008-08-05 20:10
>>26
If each line's error is in a distinct position of the sequence, it is doable with some changes to
>>15 .
29
Name:
Anonymous
2008-08-06 0:03
>>15
Holy fuck.
Did you wrote that?
30
Name:
Anonymous
2008-08-06 3:37
use strict;
use List::Util qw(max);
use Benchmark qw(:all :hireswallclock);
my $loopcount=10;
my $linecount=(shift or 4);
my $linelength=(shift or 40000);
my @list=map{join "",map{int rand 10}1..$linelength}1..$linecount;
my $reg=qr/1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)
|6(?=7)|7(?=8)|8(?=9)|9(?=0)|0(?=1)/x;
sub insert($$$){
my($ref,$key,$pos)=@_;
return if $ref->{$key} or length $key == 1;
$ref->{$key}=$pos;
insert($ref,(substr $key,1),$pos+1);
insert($ref,(substr $key,0,length($key)-1),$pos);
}
sub finder(){
my @sequences=map{
my $matches={};
insert($matches,$&,$-[0]) while /$reg+\d/g;
$matches;
} @list;
my($first,@rest)=@sequences;
my @matches=grep{
my $key=$_;
not grep{not $_->{$key}}@rest
}keys %$first;
my $len=max map{length $_}@matches;
grep{$len==length $_}@matches
}
my @matches;
my $time=timeit($loopcount,sub{@matches=finder()});
print "Longest series of repeated numbers in $linecount lines $linelength charactes each:
",join " ",@matches,"\n";
print "$loopcount loops of code took:",timestr($time);
---
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
0123 2345 4567 9012 3456 5678 7890 8901 1234 6789
10 loops of code took:5.12966 wallclock secs ( 5.13 usr + 0.00 sys = 5.13 CPU) @ 1.95/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
0123 2345 4567 9012 3456 7890 5678 8901 1234 6789
10 loops of code took:5.1259 wallclock secs ( 5.13 usr + 0.00 sys = 5.13 CPU) @ 1.95/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
89012
10 loops of code took:5.16758 wallclock secs ( 5.16 usr + 0.00 sys = 5.16 CPU) @ 1.94/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
0123 2345 4567 9012 3456 5678 1234 6789
10 loops of code took:5.1835 wallclock secs ( 5.17 usr + 0.00 sys = 5.17 CPU) @ 1.93/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
2345 4567 9012 7890 5678 1234 6789
10 loops of code took:5.03543 wallclock secs ( 5.03 usr + 0.00 sys = 5.03 CPU) @ 1.99/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
9012 3456 0123 5678 7890 8901 4567 1234 6789
10 loops of code took:5.27253 wallclock secs ( 5.22 usr + 0.02 sys = 5.24 CPU) @ 1.91/s (n=10)
C:\andrey>perl go.pl 4 40000
Longest series of repeated numbers in 4 lines 40000 charactes each:
9012 3456 0123 7890 5678 8901 2345 4567 1234
10 loops of code took:5.13019 wallclock secs ( 5.13 usr + 0.00 sys = 5.13 CPU) @ 1.95/s (n=10)
could someone time
>>15- kun's program?
31
Name:
Anonymous
2008-08-06 3:41
use strict;
LOL
32
Name:
Anonymous
2008-08-06 5:02
33
Name:
Anonymous
2008-08-06 5:12
>>30
#!/usr/bin/perl -w # camel code
use strict;
$_='ev
al("seek\040D
ATA,0, 0;");foreach(1..3)
{<DATA>;}my @camel1hump;my$camel;
my$Camel ;while( <DATA>){$_=sprintf("%-6
9s",$_);my@dromedary 1=split(//);if(defined($
_=<DATA>)){@camel1hum p=split(//);}while(@dromeda
ry1){my$camel1hump=0 ;my$CAMEL=3;if(defined($_=shif
t(@dromedary1 ))&&/\S/){$camel1hump+=1<<$CAMEL;}
$CAMEL--;if(d efined($_=shift(@dromedary1))&&/\S/){
$camel1hump+=1 <<$CAMEL;}$CAMEL--;if(defined($_=shift(
@camel1hump))&&/\S/){$camel1hump+=1<<$CAMEL;}$CAMEL--;if(
defined($_=shift(@camel1hump))&&/\S/){$camel1hump+=1<<$CAME
L;;}$camel.=(split(//,"\040..m`{/J\047\134}L^7FX"))[$camel1h
ump];}$camel.="\n";}@camel1hump=split(/\n/,$camel);foreach(@
camel1hump){chomp;$Camel=$_;y/LJF7\173\175`\047/\061\062\063\
064\065\066\067\070/;y/12345678/JL7F\175\173\047`/;$_=reverse;
print"$_\040$Camel\n";}foreach(@camel1hump){chomp;$Camel=$_;y
/LJF7\173\175`\047/12345678/;y/12345678/JL7F\175\173\0 47`/;
$_=reverse;print"\040$_$Camel\n";}';;s/\s*//g;;eval; eval
("seek\040DATA,0,0;");undef$/;$_=<DATA>;s/\s*//g;( );;s
;^.*_;;;map{eval"print\"$_\"";}/.{4}/g; __DATA__ \124
\1 50\145\040\165\163\145\040\157\1 46\040\1 41\0
40\143\141 \155\145\1 54\040\1 51\155\ 141
\147\145\0 40\151\156 \040\141 \163\16 3\
157\143\ 151\141\16 4\151\1 57\156
\040\167 \151\164\1 50\040\ 120\1
45\162\ 154\040\15 1\163\ 040\14
1\040\1 64\162\1 41\144 \145\
155\14 1\162\ 153\04 0\157
\146\ 040\11 7\047\ 122\1
45\15 1\154\1 54\171 \040
\046\ 012\101\16 3\16
3\15 7\143\15 1\14
1\16 4\145\163 \054
\040 \111\156\14 3\056
\040\ 125\163\145\14 4\040\
167\1 51\164\1 50\0 40\160\
145\162 \155\151
\163\163 \151\1
57\156\056
34
Name:
Anonymous
2008-08-06 5:33
>>30
Moar like C:\audrey> amirite
35
Name:
Anonymous
2008-08-06 6:11
36
Name:
Anonymous
2008-08-06 9:47
>>30
could someone time >>15- kun's program?
Possibly, if it actually did what it was supposed to. Which it doesn't.
37
Name:
Anonymous
2008-08-06 11:35
>>29
Yes, I did.
>>30
I assure you
>>15 is not worth timing, it's much, much slower than
>>30 .
>>36
It does do what it's supposed to. Test it all you want if you don't believe me.
38
Name:
Anonymous
2008-08-06 12:59
Wait a minute, can the common series be something like "0472619" or does it have to be [n,n+1,...,m] , like "123456789"?
39
Name:
Anonymous
2008-08-06 13:25
>>38
Any common substring.
40
Name:
Anonymous
2008-08-06 14:08
>>39
Well, in that case,
>>30 does not do what it's supposed to.
41
Name:
Anonymous
2008-08-06 14:12
>>38
Hello my name is
>>15 and I think it has to be
[n,n+1,...,m] , like "123456789", as bolded in
>>1
42
Name:
Anonymous
2008-08-06 14:12
I man my name is
>>30, not
>>15
43
Name:
Anonymous
2008-08-06 15:14
44
Name:
Anonymous
2008-08-06 15:14
Have fun doing it for n Strings.
45
Name:
Anonymous
2008-08-06 20:25
>>44
I will, along with my anus.
46
Name:
Anonymous
2008-08-06 21:30
Here's the new and improved Haskell code. It only finds substrings with lengths bigger than 2, but it's orders of magnitude faster than
>>15 .
import Control.Monad.Instances
import Control.Monad
import Control.Arrow
import Data.Array
import Data.List
import Data.Ord
import System
import Random
--benchMain =
main =
do [lines, lineLength] <- fmap (map read) getArgs
putStrLn . longestSubNumber
=<< (replicateM lines . replicateM lineLength $ randomRIO ('0','9'))
realMain =
--main =
putStrLn . longestSubNumber . lines =<< getContents
longestSubNumber [firstNum, secondNum] =
maximumBy (comparing length) $ commonSubNumbers firstNum secondNum
longestSubNumber numbers =
maximumBy (comparing length)
$ uncurry (foldr (concatMap . commonSubNumbers))
$ first (liftM2 commonSubNumbers head last) $ splitAt 2 numbers
commonSubNumbers firstNum secondNum =
nubBy (flip isInfixOf)
$ map (map (head numbers !!) . map head) $ reverse
$ sortByArrayOn length bounds $ filter ((>2) . length)
$ groupSequences
$ sortByArrayOn ((.) head . zipWith (-) =<< tail) bounds
$ sortByArrayOn head bounds
$ concatMap sequence $ filter (all (not . null)) $ transpose
$ map (elems . accumArray (flip (:)) [] ('0','9') . flip zip [0..]) numbers
where
bounds = ((,) =<< negate) $ maximum $ map length numbers
numbers = [firstNum, secondNum]
groupSequences matchList =
snd $ mapAccumL ((.) (uncurry $ flip (,)) . flip splitAt) matchList
$ liftM2 (:) ((+1) . head) (zipWith (-) =<< tail)
$ findIndices (uncurry $ (/=) . map (subtract 1))
$ (zip =<< tail) $ matchList ++ [[]]
sortByArrayOn rankElem bounds list =
concat $ map reverse $ filter (not . null) $ elems
$ accumArray (flip (:)) [] bounds
$ map ((,) =<< rankElem) list