Find Regex For Deleting Duplicates

January 04, 2024 Post a Comment

I want to find the regex, which makes following matching(Notice that there is a linebreakt!) inputString: 'a0Ew0' 'a0Ew0' 'a0Ew0s' 'a0Ew0s' output: 'a0Ew0' 'a0Ew0s'

Solution 1:

You can convert the list to a set to get rid of duplicates

See the following: https://repl.it/FFOJ/0

l = set(["a0Ew000001UD2t8EAD", "a0Ew000001UD2t8EAD", "a0Ew000001UD4AFEA1", "a0Ew000001UD4AFEA1"])
print(l)

Solution 2:

regex isn't the right tool in that case.

If the duplicate elements are consecutive you can use a simple list comprehension to achieve this:

lines=""""a0Ew000001UD2t8EAD"
"a0Ew000001UD2t8EAD"
"a0Ew000001UD4AFEA1"
"a0Ew000001UD4AFEA1"
""".splitlines()

filtered = [l for i,l in enumerate(lines) if i==0 or lines[i-1]!=l ]

it creates the element only if it's the first one (hence the index == 0 test or if previous element is different from the current one).

result:

['"a0Ew000001UD2t8EAD"', '"a0Ew000001UD4AFEA1"']

Solution 3:

You don't need regex to do that when you can do this:

from collections import OrderedDict

inputString = """"a0Ew0"
"a0Ew0"
"a0Ew0s"
"a0Ew0s"
"""

ls = inputString.split("\n") #splits the string to a listprint(*(list(OrderedDict.fromkeys(ls))))

Output:

"a0Ew0""a0Ew0s"

Solution 4:

If you really, really want to use regex, you could use a negative lookahead(?!...) to check whether the current group (".+") is not followed by a linebreak \n and itself \1 again.

>>> inpt = """"a0Ew000001UD2t8EAD"
"a0Ew000001UD2t8EAD"
"a0Ew000001UD2t8EAD"
"a0Ew000001UD4AFEA1"
"a0Ew000001UD4AFEA1"
"a0Ew000001UD2t8EAD"
""">>> re.findall(r'(".+")(?!\n\1)', inpt)
['"a0Ew000001UD2t8EAD"', '"a0Ew000001UD4AFEA1"', '"a0Ew000001UD2t8EAD"']

But instead, I would rather suggest using e.g. itertools.groupby:

>>> [key for key, group in itertools.groupby(inpt.splitlines())]['"a0Ew000001UD2t8EAD"', '"a0Ew000001UD4AFEA1"', '"a0Ew000001UD2t8EAD"']

(Note how I added another copy of the first line to the end of the data set to show that both of those solutions only consider lines to be duplicates is they appear right after each other, with nothing in between. If you also want to remove duplicates with different lines in between, I doubt that there would be a solution using regex.)

Solution 5:

REGEXP: Python

(\w+)

if consecutive, u can remove even or odd Elements.Even

[0]"a0Ew0ssss" <-- Deleted
[1]"a0Ew0ssss"
[2]"a0Ew0" <-- Deleted
[3]"a0Ew0"
[4]"a0Ew0s" <-- Deleted
[5]"a0Ew0s"

Result:

[0]"a0Ew0ssss"
[1]"a0Ew0"
[2]"a0Ew0s"

Getting Started with Python