Python: Remove Duplicate Groups Of Lines Of Text

June 08, 2024 Post a Comment

I know how to remove duplicate lines and duplicate characters from text, but I'm trying to accomplish something more complicated in python3. I have text files that might or might n

Solution 1:

I came up with the following solution. It might still have some unaccounted-for edge cases, and it might not be the most efficient way to do this, but at least after my preliminary testing, it seems to work.

This repost already fixes some bugs in my originally submitted version.

Any suggestions for improvement are welcome.

# Remove all but the first occurrence of the longest                                                                            # duplicated group of lines from a block of text.# In this utility, a "group" of lines is considered# to be two or more consecutive lines.                                                                             #                                                                                                                               # Much of this code has been shamelessly stolen from                                                                            # https://programmingpraxis.com/2010/12/14/longest-duplicated-substring/                                                        import sys

from itertools import starmap, takewhile, tee
from operator import eq, truth

# imap and izip no longer exist in python3 itertools.                                                                           # These are simply equivalent to map and zip in python3.                                                                        try:
    # python2 ...from itertools import imap
except ImportError:
    # python3 ...
    imap = maptry:
    # python2 ...from itertools import izip
except ImportError:
    # python3 ...
    izip = zipdefremove_longest_dup_line_group(text):
    ifnot text:
        return''# Unlike in the original code, here we're dealing                                                                           # with groups of whole lines instead of strings                                                                              # (groups of characters). So we split the incoming                                                                          # data into a list of lines, and we then apply the                                                                          # algorithm to these lines, treating a line in the# same way that the original algorithm treats an# individual character.                                                                                                       
    lines = text.split('\n')
    ld = longest_duplicate(lines)
    ifnot ld:
        return text
    tokens = text.split(ld)
    iflen(tokens) < 1:
        # Defensive programming: this shouldn't ever happen,                                                                    # but just in case ...                                                                                                  return text
    return'{}{}{}'.format(tokens[0], ld, ''.join(tokens[1:]))

defpairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a,b)

defprefix(a, b):
    count = sum(takewhile(truth, imap(eq, a, b)))
    if count < 2:
        # Blocks must consist of more than one line.return''else:
        return'{}\n'.format('\n'.join(a[:count]))

deflongest_duplicate(s):
    suffixes = (s[n:] for n inrange(len(s)))
    returnmax(starmap(prefix, pairwise(sorted(suffixes))), key=len)

if __name__ == '__main__':
    text = sys.stdin.read()
    if text:
        # Use sys.stdout.write instead of print to# avoid adding an extra newline at the end.
        sys.stdout.write(remove_longest_dup_line_group(text))
    sys.exit(0)

Solution 2:

Quick and dirty, not tested for edge cases:

#!/usr/bin/env python3from pathlib import Path

TEXT = '''Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.'''defremove_duplicate_blocks(lines):
    num_lines = len(lines)

    for idx_start inrange(num_lines):
        idx_end = num_lines

        for idx inrange(idx_end, -1, -1):
            if idx_start < idx:
                dup_candidate_block = lines[idx_start + 1: idx]
                len_dup_block = len(dup_candidate_block)
                if len_dup_block and len_dup_block < int(num_lines / 2):
                    for scan_idx inrange(idx):
                        if ((idx_start + 1) > scan_idx
                                and dup_candidate_block == lines[scan_idx: scan_idx + len_dup_block]):
                            lines[idx_start + 1: idx] = []
                            return remove_duplicate_blocks(lines)
    return lines


if __name__ == '__main__':
    clean_lines = remove_duplicate_blocks(TEXT.split('\n'))
    print('\n'.join(clean_lines))

OUTPUT:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.
And this is even different stuff.
That's all, folks.

Getting Started with Python

Python: Remove Duplicate Groups Of Lines Of Text

Solution 1:

Solution 2:

OUTPUT:

Post a Comment for "Python: Remove Duplicate Groups Of Lines Of Text"