Python To Remove Duplicates Using Only Some, Not All, Columns
Solution 1:
You should use itertools.groupby
for this. Here I am grouping the data based on first first two columns and then using next()
to get the first item from each group.
>>> from itertools import groupby
>>> s = '''A B C
A B D
E F G
E F T
E F K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
print next(g)
...
A B C
E F G
Simply replace s.splitlines()
with file object if input is coming from a file.
Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set
here.
>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A B C
A B D
E F G
E F T
E F K
A B F'''
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
... key = ig(line.split())
... if key not in seen:
... data.append(line)
... seen.add(key)
...
>>> data
['A B C', 'E F G']
Solution 2:
if you have access to a Unix system, sort is a nice utility that is made for your problem.
sort -u -t$'\t' --key=1,2 filein.txt
I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.
Solution 3:
from the below code, you can do it.
file_ = open('yourfile.txt')
lst = []
for each_line in file_ .read().split('\n'):
li = each_line .split()
lst.append(li)
dic = {}
for l in lst:
if (l[0], l[1]) not in dic:
dic[(l[0], l[1])] = l[2]
print dic
sorry for variable names.
Solution 4:
Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:
entries = set()
keys = set()
for row in rows:
key = (row[0], row[1]) # Only the first two columns
if key not in keys:
keys.add(key)
entries.add((row[0], row[1], row[2]))
Solution 5:
please notice that I am not an expert but I still have ideas that may help you.
There is a csv module useful for csv files, you might go see there if you find something interesting.
First I would ask how are you storing those datas ? In a list ?
something like
[[A,B,C],
[A,B,D],
[E,F,G],...]
Could be suitable. (maybe not the best choice)
Second, is it possible to go through the whole list ?
You can simply store a line, compare it to all lines.
I would do this : suposing list contains the letters.
copy = list
index_list = []
for i in range(0, len(list)-1):
for j in range(0, len(list)-1): #and exclude i of course
if copy[i][1] == list[j][1] and copy[i][0] == list[j][0] and i!=j:
index_list.append(j)
for i in index_list: #just loop over the index list and remove
list.pop(index_list[i])
this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations). Edit : pop; not remove
Post a Comment for "Python To Remove Duplicates Using Only Some, Not All, Columns"