Skip to content Skip to sidebar Skip to footer

Efficient Way To Verify That Records Are Unique In Python/pytables

I have a table in PyTables with ~50 million records. The combination of two fields (specifically userID and date) should be unique (i.e. a user should have at most one record per

Solution 1:

It seems that indexes in PyTables are limited to single columns.

I would suggest adding a hash column and putting an index on it. Your unique data is defined as the concatenation of other columns in the DB. Separators will ensure that there aren't two different rows that yield the same unique data. The hash column could just be this unique string, but if your data is long you will want to use a hash function. A fast hash function like md5 or sha1 is great for this application.

Compute the hashed data and check if it's in the DB. If so, you know you hit some duplicate data. If not, you can safely add it.

Solution 2:

So years later I still have the same question, but with the power of indexing and querying this problem is only slightly painful, depending on the size of your table. With the use of readWhere, or getListWhere I think that the problem is approx O(n)

Here is what I did... 1. I created a table that had two indicies..you can use multiple indicies in PyTables:

http://pytables.github.com/usersguide/optimization.html#indexed-searches

Once your table is indexed, I also use LZO compression you can do the following:

import tables
h5f = tables.openFile('filename.h5')
tbl = h5f.getNode('/data','data_table') # assumes group data andtable data_table
counter +=0forrowin tbl:
    ts =row['date'] # timestamp (ts) ordate
    uid =row['userID']
    query ='(date == %d) & (userID == "%s")'% (ts, uid)
    result= tbl.readWhere(query)
    if len(result) >1:
        # Do something here
        pass
    counter +=1
    if counter %1000==0: print '%d rows processed'

Now the code that I've written here is actually kind of slow. I am sure that there is some PyTables guru out there who could give you a better answer. But here are my thoughts on performance:

If you know that you are starting with clean data i.e. (No duplicates) then all you have to do is query the table once for the keys you are interested in finding, which means that you only have to do:

ts = row['date'] # timestamp (ts) or date
uid = row['userID']
query = '(date == %d) & (userID == "%s")' % (ts, uid)
result = tbl.getListWhere(query)
iflen(result) == 0:
    # key pair is not in table# do what you were going to dopasseliflen(result) > 1:
    # Do something here, like get a handle to the row and update instead of append.pass

If you have lots of time to check for duplicates have create a background process that crawls over the directory with your files and searches for duplicates.

I hope that this helps someone else.

Solution 3:

I don't know much about PyTables, but I would try this approach

  1. For each userID, get all (userID, date) pairs
  2. assert len(rows)==len(set(rows)) - this assertion holds true if all (userID, date) tuples contained in the rows list are unique

Post a Comment for "Efficient Way To Verify That Records Are Unique In Python/pytables"