Skip to content Skip to sidebar Skip to footer

Modifying Timestamps In Pandas To Make Index Unique

I'm working with financial data, which is recorded at irregular intervals. Some of the timestamps are duplicates, which is making analysis tricky. This is an example of the data -

Solution 1:

Here is a faster numpy version (but little less readable) which is inspired from this SO article. The idea is to use cumsum on duplicated timestamp values while resetting the cumulative sum each time a np.NaN is encountered:

# get duplicated valuesasfloatand replace 0with NaN
values= df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN

missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] =-diff

# print resultresult= df.index + np.cumsum(values).astype(np.timedelta64)
print(result)

DatetimeIndex([   '2016-08-23 00:00:14.161128',
                  '2016-08-23 00:00:14.901180',
                  '2016-08-23 00:00:17.196639',
               '2016-08-23 00:00:17.664193001',
               '2016-08-23 00:00:17.664193002',
               '2016-08-23 00:00:17.664193003',
               '2016-08-23 00:00:17.664193004',
                  '2016-08-23 00:00:26.206108',
               '2016-08-23 00:00:28.322456001',
               '2016-08-23 00:00:28.322456002'],
              dtype='datetime64[ns]', name='datetime', freq=None)

Timing this solution yields 10000 loops, best of 3: 107 µs per loop whereas the @DYZ groupby/apply approach (but more readable) is roughly 50 times slower on the dummy data with 100 loops, best of 3: 5.3 ms per loop.

Of course, you have to reset your index, finally:

df.index = result

Solution 2:

You can group the rows by the index and then add a range of sequential timedeltas to the index of each group. I am not sure if this can be done directly with the index, but you can first convert the index to an ordinary column, apply the operation to the column, and set the column as the index again:

newindex = ts.reset_index()\
             .groupby('datetime')['datetime']\
             .apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex

Solution 3:

Lets start with a vectorized benchmark since you are dealing with 1M+ rows this should be a priority:

%timeit do10000000 loops, best of 3: 20.5 ns per loop

Lets make some test data since none was provided:

rng = pd.date_range('1/1/2011', periods=72, freq='H')

df = pd.DataFrame(dict(time = rng))

Duplicate the timestamps:

df=pd.concat((df,df))df=df.sort()dfOut [296]:time02011-01-01 00:00:0002011-01-01 00:00:0012011-01-01 01:00:0012011-01-01 01:00:0022011-01-01 02:00:0022011-01-01 02:00:0032011-01-01 03:00:0032011-01-01 03:00:0042011-01-01 04:00:0042011-01-01 04:00:0052011-01-01 05:00:0052011-01-01 05:00:0062011-01-01 06:00:0062011-01-01 06:00:0072011-01-01 07:00:0072011-01-01 07:00:0082011-01-01 08:00:0082011-01-01 08:00:0092011-01-01 09:00:0092011-01-01 09:00:00

Find the locations where the difference in time from the previous row is 0 seconds

mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')

mask
Out [307]:
0False0True1False1True2False2True3False3True4False4True5False

Offset these locations : in this case I chose milliseconds

df.loc[mask,'time']=df.time[mask].apply(lambdax:x+pd.offsets.Milli(5))Out [309]:time02011-01-01 00:00:00.00002011-01-01 00:00:00.00512011-01-01 01:00:00.00012011-01-01 01:00:00.00522011-01-01 02:00:00.00022011-01-01 02:00:00.00532011-01-01 03:00:00.00032011-01-01 03:00:00.00542011-01-01 04:00:00.00042011-01-01 04:00:00.00552011-01-01 05:00:00.000

EDIT: With consecutive timestamps [This assumes 4]

consect = 4for i in range(4):
    mask = (df.time-df.time.shift(consect)) == np.timedelta64(0,'s')
    df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5+i))
    consect -= 1

Post a Comment for "Modifying Timestamps In Pandas To Make Index Unique"