Python To Remove Duplicates Using Only Some, Not All, Columns

September 18, 2022 Post a Comment

I have a tab-delimited input.txt file like this A B C A B D E F G E F T E F K These are tab-delimited. I want to remove duplicates only when multiple

Solution 1:

You should use itertools.groupby for this. Here I am grouping the data based on first first two columns and then using next() to get the first item from each group.

>>> from itertools import groupby                                   
>>> s = '''A    B    C                                              
A    B    D
E    F    G
E    F    T
E    F    K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
    print next(g)
...     
A    B    C
E    F    G

Simply replace s.splitlines() with file object if input is coming from a file.

Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set here.

>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A    B    C
A    B    D
E    F    G
E    F    T
E    F    K
A    B    F'''     
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
...     key = ig(line.split())
...     if key not in seen:
...         data.append(line)
...         seen.add(key)
...         
>>> data
['A    B    C', 'E    F    G']

Solution 2:

if you have access to a Unix system, sort is a nice utility that is made for your problem.

sort -u -t$'\t' --key=1,2 filein.txt

I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.

Solution 3:

from the below code, you can do it.

file_ = open('yourfile.txt')
lst = []
for each_line in file_ .read().split('\n'):
    li = each_line .split()
    lst.append(li)
dic = {}
for l in lst:
    if (l[0], l[1]) not in dic:
        dic[(l[0], l[1])] = l[2]

print dic

sorry for variable names.

Solution 4:

Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:

entries = set()
keys = set()
for row in rows:
   key = (row[0], row[1]) # Only the first two columns

   if key not in keys:
      keys.add(key)
      entries.add((row[0], row[1], row[2]))

Solution 5:

please notice that I am not an expert but I still have ideas that may help you.

There is a csv module useful for csv files, you might go see there if you find something interesting.

First I would ask how are you storing those datas ? In a list ?

something like

[[A,B,C],
[A,B,D],
[E,F,G],...]

Could be suitable. (maybe not the best choice)

Second, is it possible to go through the whole list ?

You can simply store a line, compare it to all lines.

I would do this : suposing list contains the letters.

copy = list
index_list = []
for i in range(0, len(list)-1):
    for j in range(0, len(list)-1): #and exclude i of course
     if copy[i][1] == list[j][1] and copy[i][0] == list[j][0] and i!=j:
          index_list.append(j)
for i in index_list: #just loop over the index list and remove
list.pop(index_list[i])

this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations). Edit : pop; not remove

Getting Started with Python