Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Text file parsing

Name: Anonymous 2005-10-11 17:08

Hi, I need help.

Though I'm looking for a specific program rather than a programming solution, I figure this is the most proper board to ask.

The situation is, I have a huge text file that goes generally like this:

----- cut -----
1 1234 - Lorem ipsum dolor sit amet
2 1235 - consectetuer adipiscing elit
3 1236 - Pellentesque pellentesque vehicula velit
4 1237 - Nunc sit amet sapien at libero euismod auctor
5 1238 - Aenean turpis
6 1239 - Ut nec ipsum
7 1240 - Lorem ipsum dolor sit amet
8 1241 - Aenean turpis
9 1242 - consectetuer adipiscing elit
10 1243 - Aenean turpis
11 1244 - Nunc sit amet sapien at libero euismod auctor
----- cut -----

Many lines are randomly repeated throughout the file. What I need to do, is to remove all duplicates, leaving only one occurence of a specific line. The numbers are there, but they are to be ignored (and they're not exactly sequential either); only text strings are to be compared.

I suppose this task could have been achieved by means of a Killer PERL One-liner Of Doom, which should be fine, as I have a Linux distro handy. But I'm rather looking for a Windows-based solution, and one that is usable to a generally programming-ignorant user. I don't mind configuring or writing a *simple* script for an all-purpose text parser, but I'd rather if it didn't required me to dive into the deepest depths of regexp syntax.

I hope you get the idea, I'm looking for an app capable of the task, that's rather easy to handle on the user side. Any help will be greatly appreciated.

Name: Paul McGuire 2005-11-01 23:45

Here are two different pyparsing implementations, one filters duplicates by selecting the unique lines into a list (and more nearly matches the Perl submission above); the second program transforms the input string by suppressing lines that pass the acceptDuplicatesOnly parse action.  Granted, these are significantly longer than the Perl routine above, I think the expression to be matched is a bit more readable.

-- Paul McGuire
Download pyparsing at http://pyparsing.sourceforge.net

data = """1 1234 - Lorem ipsum dolor sit amet
2 1235 - consectetuer adipiscing elit
3 1236 - Pellentesque pellentesque vehicula velit
4 1237 - Nunc sit amet sapien at libero euismod auctor
5 1238 - Aenean turpis
6 1239 - Ut nec ipsum
7 1240 - Lorem ipsum dolor sit amet
8 1241 - Aenean turpis
9 1242 - consectetuer adipiscing elit
10 1243 - Aenean turpis
11 1244 - Nunc sit amet sapien at libero euismod auctor
"""
from pyparsing import *

# 1) rip duplicates by building list of unique lines found using parseString
integer = Word(nums)
eol = LineEnd()
lineExpr = integer + integer + "-" + SkipTo(eol).setResultsName("body")

priorLines = set()
uniqueList = []
def selectUniqueLines(strg,loc,tokens):
    if not tokens.body in priorLines:
        priorLines.add( tokens.body )
        currentTextLine = line(loc,strg)
        uniqueList.append( currentTextLine )
lineExpr.setParseAction( selectUniqueLines )

# parse the data, building up our list of unique lines
OneOrMore(lineExpr).parseString(data)

print "\n".join(uniqueList)
print


# 2) rip duplicates by suppressing lines we have seen before, returning result using transformString
integer = Word(nums)
eol = LineEnd()
lineExpr = integer + integer + "-" + SkipTo(eol).setResultsName("body") + eol

priorLines = set()
def acceptDuplicateLinesOnly(strg,loc,tokens):
    if not tokens.body in priorLines:
        priorLines.add( tokens.body )
        raise ParseException(strg,loc,"")
lineExpr.setParseAction( acceptDuplicateLinesOnly )

print lineExpr.suppress().transformString(data)


Both implementations print out:
1 1234 - Lorem ipsum dolor sit amet
2 1235 - consectetuer adipiscing elit
3 1236 - Pellentesque pellentesque vehicula velit
4 1237 - Nunc sit amet sapien at libero euismod auctor
5 1238 - Aenean turpis
6 1239 - Ut nec ipsum

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List