/prog/ - Text file parsing

Name: Anonymous 2005-10-11 17:08

Hi, I need help.

Though I'm looking for a specific program rather than a programming solution, I figure this is the most proper board to ask.

The situation is, I have a huge text file that goes generally like this:

----- cut -----
1 1234 - Lorem ipsum dolor sit amet
2 1235 - consectetuer adipiscing elit
3 1236 - Pellentesque pellentesque vehicula velit
4 1237 - Nunc sit amet sapien at libero euismod auctor
5 1238 - Aenean turpis
6 1239 - Ut nec ipsum
7 1240 - Lorem ipsum dolor sit amet
8 1241 - Aenean turpis
9 1242 - consectetuer adipiscing elit
10 1243 - Aenean turpis
11 1244 - Nunc sit amet sapien at libero euismod auctor
----- cut -----

Many lines are randomly repeated throughout the file. What I need to do, is to remove all duplicates, leaving only one occurence of a specific line. The numbers are there, but they are to be ignored (and they're not exactly sequential either); only text strings are to be compared.

I suppose this task could have been achieved by means of a Killer PERL One-liner Of Doom, which should be fine, as I have a Linux distro handy. But I'm rather looking for a Windows-based solution, and one that is usable to a generally programming-ignorant user. I don't mind configuring or writing a *simple* script for an all-purpose text parser, but I'd rather if it didn't required me to dive into the deepest depths of regexp syntax.

I hope you get the idea, I'm looking for an app capable of the task, that's rather easy to handle on the user side. Any help will be greatly appreciated.

Name: Anonymous 2005-10-11 17:13

Perl is available for Windows.

Name: Anonymous 2005-10-11 23:58

fuckin sed man

Name: Anonymous 2005-10-12 21:44

>>1
Uh... pretty much anything is available for Windows. This includes Perl, which you could use, and FYI, this also includes Python, PHP, awk, sed, core/text/shell/diff/*/utils, GCC, and so on.

Name: Anonymous 2005-10-12 23:52

Pretty simple in Cygwin or Bash shell:

$ sed 's/ /\t/' test.txt | sed 's/ - /\t/' | sort -ut \t -k 3
1    1234 Lorem ipsum dolor sit amet
4    1237 Nunc sit amet sapien at libero euismod auctor
3    1236 Pellentesque pellentesque vehicula velit
2    1235 consectetuer adipiscing elit

First sed converts first space to a tab, second sed converts the ' - ' to a tab, then sort takes the third field using tab as a delimeter and only outputs uniques. You can run it through sed again if you want the idiot delimeters back.

Name: Anonymous 2005-10-13 0:08

lol, actually that seems to get the wrong answer, had to do:
sed 's/ /\t/' test.txt | sed 's/ - /\t/' | sort -uk 3

instead. Something weird about setting a tab delimeter or something.

Name: Anonymous 2005-10-20 11:32

I'd use python with pyparsing module.

Name: Anonymous 2005-10-27 10:12

while (<>) {
if (/\d\s*\d\s*-\s*(.*)/) {
    unless ($mu{$1}) {
      $mu{$1} = 1;
      print "$1\n";
    }
}
}
100 hours in perl tutorials!

Name: Paul McGuire 2005-11-01 23:45

Here are two different pyparsing implementations, one filters duplicates by selecting the unique lines into a list (and more nearly matches the Perl submission above); the second program transforms the input string by suppressing lines that pass the acceptDuplicatesOnly parse action. Granted, these are significantly longer than the Perl routine above, I think the expression to be matched is a bit more readable.

-- Paul McGuire
Download pyparsing at http://pyparsing.sourceforge.net

data = """1 1234 - Lorem ipsum dolor sit amet
2 1235 - consectetuer adipiscing elit
3 1236 - Pellentesque pellentesque vehicula velit
4 1237 - Nunc sit amet sapien at libero euismod auctor
5 1238 - Aenean turpis
6 1239 - Ut nec ipsum
7 1240 - Lorem ipsum dolor sit amet
8 1241 - Aenean turpis
9 1242 - consectetuer adipiscing elit
10 1243 - Aenean turpis
11 1244 - Nunc sit amet sapien at libero euismod auctor
"""
from pyparsing import *

# 1) rip duplicates by building list of unique lines found using parseString
integer = Word(nums)
eol = LineEnd()
lineExpr = integer + integer + "-" + SkipTo(eol).setResultsName("body")

priorLines = set()
uniqueList = []
def selectUniqueLines(strg,loc,tokens):
    if not tokens.body in priorLines:
      priorLines.add( tokens.body )
      currentTextLine = line(loc,strg)
      uniqueList.append( currentTextLine )
lineExpr.setParseAction( selectUniqueLines )

# parse the data, building up our list of unique lines
OneOrMore(lineExpr).parseString(data)

print "\n".join(uniqueList)
print

# 2) rip duplicates by suppressing lines we have seen before, returning result using transformString
integer = Word(nums)
eol = LineEnd()
lineExpr = integer + integer + "-" + SkipTo(eol).setResultsName("body") + eol

priorLines = set()
def acceptDuplicateLinesOnly(strg,loc,tokens):
    if not tokens.body in priorLines:
      priorLines.add( tokens.body )
      raise ParseException(strg,loc,"")
lineExpr.setParseAction( acceptDuplicateLinesOnly )

print lineExpr.suppress().transformString(data)

Both implementations print out:
1 1234 - Lorem ipsum dolor sit amet
2 1235 - consectetuer adipiscing elit
3 1236 - Pellentesque pellentesque vehicula velit
4 1237 - Nunc sit amet sapien at libero euismod auctor
5 1238 - Aenean turpis
6 1239 - Ut nec ipsum

Name: Anonymous 2005-11-02 9:35

>>9
Seriously...if you want to parse text, do it with a proper text-centered language instead of using a hack.

Name: Anonymous 2005-11-02 19:49

USE AWK OR PERL STFU OTHERWISE

Name: Anonymous 2005-11-04 1:24

If you have windows... CMD! AAAAHHA!

Name: Anonymous 2005-11-04 4:38

>>12
1. You are a stupid troll
2. In my Windows, I can do that with Perl, PHP, Bash, AWK, or Python, and that's just what I have installed right now.

Name: Anonymous 2005-11-05 18:50

Lol perl,awk,php,bash in windows? It sounds like you are using an inferior OS to do what superior OSes do all time. Have fun with your shitty pipes you ignorant fag.

Name: Anonymous 2005-11-06 10:38

>>14
IF UNIX IS SUPERIOR THEN IT MUST BE JAPANESE

I'm not going to bother replying to a troll claiming either Windows or Unices are superior.

Name: Anonymous 2005-11-08 17:00

>>15
Yeah. If you want to be cool, get two machines, two monitors, a copy of xp and install slackware, and a KVM or use x2vnc. Whoo

I mean, you need both to be cool.

Name: Anonymous 2005-11-08 17:12

>>16

stfb

Name: Anonymous 2005-11-08 22:40

Just tell the computer to count up all the lines and dimension an array with that amount of lines. Then, tell it to put each line of text after "- " and before the newline character per line in the array. Finally, search for any duplicates and remove them. Sort your list any way you like.

Name: Anonymous 2005-11-09 1:54

I dunno, I think you could easily make a programming parser. Just look into Lex. It's short and simple to write a token parser with that.

Name: Anonymous 2005-11-09 17:39

Import into an SQL database, then SELECT DISTINCT text FROM imported_table. Easy.

Name: Anonymous 2005-11-11 16:44

USE AWK GOD

Name: PHPAdvocate !MARtiNys66 2009-11-01 18:16



function parse_file($huge_text_file)

{

    preg_match_all('/([0-9]+ [0-9]+) - ([^\r\n]+)/', $huge_text_file, $matches);

    $unique_strings = array_unique($matches[2]);

    foreach($unique_strings as $key => $string)

    {

        $parsed_text .= $matches[1] . ' - ' . $string . "\n";

    }

    return $parsed_text;

}

Name: PHPAdvocate !MARtiNys66 2009-11-01 18:17

>>23

Make that $matches[1][$key].

Name: Anonymous 2009-11-01 18:23

>>23
Thanks man, I've been waiting since 2005 for this.

Name: Anonymous 2009-11-01 18:23

MATCH MY ANUS

Name: Anonymous 2010-11-25 23:14

Name: Anonymous 2011-04-20 12:10

^{_{^{_{^{_{^{_{^{_{^{_{.*~~~*"~~~~~~~~~~~~~~~~~?L. :!~~~~~~~~~~~~~~~~~~~~~~~~~~~?#u ?~~~~~~~~~~~~~~~~~~~~~~~~~~~~!:~~~"e. . .2~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"**WP*#**IS??X#" (~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~?X '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~c '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~z" " '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~H*F 't/~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~> 4~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~i:~~~~~~~~~+ *~~~~~~~~~~~~~~~~~~~~~:~~~!~~~~~W$~~~~~~~~~~> :~~~~~~~~~~~~~~~~~~~~:u~~~X!K~~~~RMK~~~~~~~~~t !~~~~~~~~~~~~~~~~~~!d$R~:@UX$~~~4R!E$/~~~~~~~$ ~~~~~~~~~~~~~~~~~~XRNF!\E??!$~~:$!?5$B~~~~~~~> :~~~~~~~~~~~~~~~~~$?$FX@!!!!!!~4RXt$$$F~~~~~~\ o@" >~~~~~~~:~~~~~~~3$#dRXWW8X!t!M!W$$$@$#N:~~~~F dM$ s~~~~~~~~%~~~~~u5$TR*$$$#N!M@$$$$Q$*#U~~~"?@! .'"~: 'L~~~~~~~~~%U~4P&!!!!M**!!!X$!!!!!$~~~i~:~~~~~~~!~ #i:~~~~~~~i~~~X!!!!!!!!!!!!!!!!!M~~~~!@F`%be~ $TR%!~~M~~~t!!!!!!!!!!!UW!!!!@~~~:M ^i~~z!F~~~~NX!!!!!!!!!!!!!!tP~~f #M~~F~~~~W?X!!!tH$$$$$$!HF!tB #!~!!~~~~$!!!!!9$$$$R#" M!**bc. %~!!$~~~~$W!!!!!$$$F '9!!!!!!!?mu. !~!9$!~~~X?$&!!9#!$ !R!!!!!!!!!!!?$$N... :!!UR!$:~!$X!!$$%!!X '$!!!!!!!!!!!!!!!!!!!?T#RRbeL. :?tM$!!!$~>~?KXR!!!!$> '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!?$$e. ^ R!!!!!!!#**$R!!!!!tk 9K!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!TR.. @!!!!!!!!!M!!!!!!!!!R> `R!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!@c /!!!!!!!!!!!!!!!!!!!!!$: 'B!!X!!!!!!XX!!!!!!!!!!!!!!!!!!!!!!!!!9*> ?!!!!!!!!!!!!!!!!!!!!!!!Mk "K!!!!!!B$$s.C"*WWNWUUUUU!!!!!!!!!!!!!5 X!!!!!!!!!!!!!!!!!!!!!!!!!% ?&W$" ~ ~~~`N `""""$$!!!!!!!!!M ?!!!!!!!!!!!!!!!!!!!!!!!!!!$> `~ :~~~~~~~~Rr 'W!!!!!!!!!> 4X!!!!!!!!!!!!!!!!!!!!!!!!!!$k :~~~~~~~~~!$ $!!!!!!!!!> ::!!!!!!!!!!!!!!!!!!!!!!!!!!!MK. `~~~~~~~~!+ !!!!!!!!!X '!$!!!!!!!!!!!!!!!!!!!!!!!!!!!M$ `!. `~~~~~~!tL $!!!!!!!!9 !!#H!!!!!!!!!!!!!!!!!!!!!!!!!!?N ~ ~!~~~~~X* ?H!!!!!!!! !!!MX!!!!!!!!!!!!!!!!!!!!!!!!!!MN. !Xi!t 4$!!!!!!!> H!!!$W!!!!!!!!!!!!!!!!!!!!!!!!!!XN `> 4R!!!!!!! '!!!!!$K!!!!!!!!!!!!!!!!!!!!!!!!!!@ 'N 4!!!!!!!E !!!!!!$$X!!!!!!X!!!!!!!!!!!!!!!!!!k ! $!!!!!!M !!!!!!!T$NX!!!!!!!U!!!!!!!!!!!!!!!R X X!!!!!!f .!!!!!!!!$$*UX!!!!!!*X!!!!!!!!!!!!!& '% X!!!!!X R!!!!!!!!$ "NX!!!!!!!N!!!!!!!!!!XF % E!!!!X! $!!!!!!!!M> ^"$X!!!!!!?W!!!!!!!9 :+".e> !: X!!!!! ?!!!!!!!!!L ^*$WWX!!!TWX!!!tF:!\m".ued$$$$e 4!!!!! '!!!!!!!!!X ^*NW!!9X!WFX~:B$$$$$$$$$$$k d!!!!! E!!!!!!!!M $&!$$\$b$5$$$$$$$$H?!!!!!!T!Rso!!!!!P '!!!!!!!!!!!!!$c !$RMMMMMMMMMMRMMMMM$8MM$P!XXd!X# \!:!!!!!!!!!!!!#x MMMMMMMMMMMMMMMMMMMM$$MM$H!Xf# :!!!!!!!!!!!!!!?T%.. XMMMMM8MMMMMMMMMMMMMMM$BMM$b #$UX:!!!!!!!!!!!!!!!MRRMMMM$MMMMMMMMMM$MMMMM$$MM$$> ##NWX!!!!!!!!!!!@RMMMMM$$MMMMMMMMMMM$MMMMM$$8M$$k "$@eii!:X$MMMMMM$$RMMMMMMMMMMM$MMMMMM$$MM$M$c '"$$RMMMMMM$MMMMMMMMMMMMM$RMMMMMMM$BM$MM$. :@$$MMMMMM$RMMMMMMMMMMMMM$$MMMMMMMM$$M$MM$r d$$RMMMMM$$RMMMMM88$*#**#R$RRf"#$$$#$$$$$$MRN $$RMMMMMMM$MMMM8RM$T~ ""`?$$$$$b u$RRMMMMMMMM$!?#****#~ R$$$$$$. @$RMMMMMM$$$$R !$MMMRM$R J$$MMMMMM$*!!!!! ~$RMMMMM$b xRM$MMM88$~ !!!!!~~ :::~~~~~~~~~~~~~:: ':9RMMMMMM$ .$RMMB$5$! !!!!!!~~ :~~~~~~~~~~~~~~~~~~~~~~~~~~MRMMMMMM$ u$M$$$$$?~ !!!!!!!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X$MMMMMM$ d$$$$$B!~ ~~!!!!!!!!~~~~~~~~~~~~~~~~~~~~~~~~~~~X$MMMMMM$ :$$$$M8$$! ~~~!!!!!!!!~~~~~~~~~~~~~~~~~~~~~~~~~~~!MMMMMMMR $$$$RM$RM~ !!~!!!!!!!!~~~~~!~~~~~~~~~~~~~~~~~~~~~XRMMMMMM$ d$$MM$$RMM ~~~~:!!X!!!!!:!~~:/~::!!!~~~~~~~~~~~~~~~!RMMMMM$$ $$RMM$$MMMM! '!~~~~~!!!?X!!!!!!!!!!!!!!!!~~~~~~~~~~~~~~~XBMMMMMP J$MM$$$RMMMMB ~~~~~~~~~!!M8!!!!!!!!!!!!!~~~~~~~~~~~~~~~~~9$MMMM$F $$RM$$MMMMMMMB:~~~~~~~~~~!!!RMWUUX!!!!~~~~~~~~~~~~~~~~~~~~MRMMM$$ "RM$$MMMMMMMMMN!:!~~~~~~~!!!!M?$$!!~~~~~~~~~~~~~~~~~~~~~~~XM8$$$" $M$$MMMMMMMMMM$NX!!~~:~!!!!!?!!ME!!~~~~~~~~~~~~~~~~~~~~~~~W$# $$$MMMMMMMMMMMRR$W!!~!!!!!!!!!!9E~~~~~~~~~~~~~~~~~~~~~~~~! #$MMMMMMMMMMM$MMM$$!!~~~~~~~~~~MX~~~~~~~~~~~~~~~~~~~~~~~~X M$MMMMMMMMM$MMMMM$!!~~~~~~~~~~M!~~~~~~~~~~~~~~~~~~~~~~~~! *$$$$MMMMMMM$MMMMM$E~~~~~~~~~~~M!~~~~~~~~~~~~~~~~~~~~~~~~ ^"#$MMMMM$RMMMMMRN!~~~~~~~~~~$!~~~~~~~~~~~~~~~~~~~~~~~/ '$$CHAT958$$$$$!~~~~~~~~~~@!~~~~~~~~~~~~~~~~~~~~~~~! '" "" "X~~~~~~~~~~R~~~~~~~~~~~~~~~~~~~~~~~~f}}}}}}}}}}}}

Name: Anonymous 2011-06-12 0:38

MORE LIKE
TEXT FILE BUTT PLUGGING
AMIRITE LOLLLLLLLLLLLLLLLLLLLLLLLZzz!!11oNE!!1ONE1!

Name: Anonymous 2011-06-12 7:50

>Nunc sit amet sapien at libero euismod auctor
Perfect sig material. What does it mean?

Name: Anonymous 2011-06-12 9:48

perl -ne'print unless $x{/- (.*)/,()}++'

Name: Anonymous 2011-06-12 12:52

>>31
omg i executed that line and it gave my a viruss!!

Name: Anonymous 2011-06-12 13:20

Use [1] with a working perl script posted here.
It's old as fuck but it works.

1: http://tinyperl.sourceforge.net/

Name: Anonymous 2011-06-12 13:21

are you retarded >>33
he asked that question over 5 years ago..

Name: Anonymous 2011-06-12 13:31

>>34 sage field

Text file parsing

1 Name: Anonymous 2005-10-11 17:08

2 Name: Anonymous 2005-10-11 17:13

3 Name: Anonymous 2005-10-11 23:58

4 Name: Anonymous 2005-10-12 21:44

5 Name: Anonymous 2005-10-12 23:52

6 Name: Anonymous 2005-10-13 0:08

7 Name: Anonymous 2005-10-20 11:32

8 Name: Anonymous 2005-10-27 10:12

9 Name: Paul McGuire 2005-11-01 23:45

10 Name: Anonymous 2005-11-02 9:35

11 Name: Anonymous 2005-11-02 19:49

12 Name: Anonymous 2005-11-04 1:24

13 Name: Anonymous 2005-11-04 4:38

14 Name: Anonymous 2005-11-05 18:50

15 Name: Anonymous 2005-11-06 10:38

16 Name: Anonymous 2005-11-08 17:00

17 Name: Anonymous 2005-11-08 17:12

18 Name: Anonymous 2005-11-08 22:40

19 Name: Anonymous 2005-11-09 1:54

20 Name: Anonymous 2005-11-09 17:39

21 Name: Anonymous 2005-11-11 16:44

23 Name: PHPAdvocate !MARtiNys66 2009-11-01 18:16

24 Name: PHPAdvocate !MARtiNys66 2009-11-01 18:17

25 Name: Anonymous 2009-11-01 18:23

26 Name: Anonymous 2009-11-01 18:23

27 Name: Anonymous 2010-11-25 23:14

28 Name: Anonymous 2011-04-20 12:10

29 Name: Anonymous 2011-06-12 0:38

30 Name: Anonymous 2011-06-12 7:50

31 Name: Anonymous 2011-06-12 9:48

32 Name: Anonymous 2011-06-12 12:52

33 Name: Anonymous 2011-06-12 13:20

34 Name: Anonymous 2011-06-12 13:21

35 Name: Anonymous 2011-06-12 13:31

Name: Anonymous 2005-10-11 17:08

Name: Anonymous 2005-10-11 17:13

Name: Anonymous 2005-10-11 23:58

Name: Anonymous 2005-10-12 21:44

Name: Anonymous 2005-10-12 23:52

Name: Anonymous 2005-10-13 0:08

Name: Anonymous 2005-10-20 11:32

Name: Anonymous 2005-10-27 10:12

Name: Paul McGuire 2005-11-01 23:45

Name: Anonymous 2005-11-02 9:35

Name: Anonymous 2005-11-02 19:49

Name: Anonymous 2005-11-04 1:24

Name: Anonymous 2005-11-04 4:38

Name: Anonymous 2005-11-05 18:50

Name: Anonymous 2005-11-06 10:38

Name: Anonymous 2005-11-08 17:00

Name: Anonymous 2005-11-08 17:12

Name: Anonymous 2005-11-08 22:40

Name: Anonymous 2005-11-09 1:54

Name: Anonymous 2005-11-09 17:39

Name: Anonymous 2005-11-11 16:44

Name: PHPAdvocate !MARtiNys66 2009-11-01 18:16

Name: PHPAdvocate !MARtiNys66 2009-11-01 18:17

Name: Anonymous 2009-11-01 18:23

Name: Anonymous 2009-11-01 18:23

Name: Anonymous 2010-11-25 23:14

Name: Anonymous 2011-04-20 12:10

Name: Anonymous 2011-06-12 0:38

Name: Anonymous 2011-06-12 7:50

Name: Anonymous 2011-06-12 9:48

Name: Anonymous 2011-06-12 12:52

Name: Anonymous 2011-06-12 13:20

Name: Anonymous 2011-06-12 13:21

Name: Anonymous 2011-06-12 13:31