Skip to content Skip to sidebar Skip to footer

Python To Remove Duplicates Using Only Some, Not All, Columns

I have a tab-delimited input.txt file like this A B C A B D E F G E F T E F K These are tab-delimited. I want to remove duplicates only when multiple

Solution 1:

You should use itertools.groupby for this. Here I am grouping the data based on first first two columns and then using next() to get the first item from each group.

>>> from itertools import groupby                                   
>>> s = '''A    B    C                                              
A    B    D
E    F    G
E    F    T
E    F    K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
    print next(g)
A    B    C
E    F    G

Simply replace s.splitlines() with file object if input is coming from a file.

Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set here.

>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A    B    C
A    B    D
E    F    G
E    F    T
E    F    K
A    B    F'''     
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
...     key = ig(line.split())
...     if key not in seen:
...         data.append(line)
...         seen.add(key)
>>> data
['A    B    C', 'E    F    G']

Solution 2:

if you have access to a Unix system, sort is a nice utility that is made for your problem.

sort -u -t$'\t' --key=1,2 filein.txt

I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.

Solution 3:

from the below code, you can do it.

file_ = open('yourfile.txt')
lst = []
for each_line in file_ .read().split('\n'):
    li = each_line .split()
dic = {}
for l in lst:
    if (l[0], l[1]) not in dic:
        dic[(l[0], l[1])] = l[2]

print dic

sorry for variable names.

Solution 4:

Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:

entries = set()
keys = set()
for row in rows:
   key = (row[0], row[1]) # Only the first two columns

   if key not in keys:
      entries.add((row[0], row[1], row[2]))

Solution 5:

please notice that I am not an expert but I still have ideas that may help you.

There is a csv module useful for csv files, you might go see there if you find something interesting.

First I would ask how are you storing those datas ? In a list ?

something like


Could be suitable. (maybe not the best choice)

Second, is it possible to go through the whole list ?

You can simply store a line, compare it to all lines.

I would do this : suposing list contains the letters.

copy = list
index_list = []
for i in range(0, len(list)-1):
    for j in range(0, len(list)-1): #and exclude i of course
     if copy[i][1] == list[j][1] and copy[i][0] == list[j][0] and i!=j:
for i in index_list: #just loop over the index list and remove

this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations). Edit : pop; not remove

Post a Comment for "Python To Remove Duplicates Using Only Some, Not All, Columns"