Modifying Timestamps In Pandas To Make Index Unique
Solution 1:
Here is a faster numpy version (but little less readable) which is inspired from this SO article. The idea is to use cumsum
on duplicated timestamp values while resetting the cumulative sum each time a np.NaN
is encountered:
# get duplicated valuesasfloatand replace 0with NaN
values= df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] =-diff
# print resultresult= df.index + np.cumsum(values).astype(np.timedelta64)
print(result)
DatetimeIndex([ '2016-08-23 00:00:14.161128',
'2016-08-23 00:00:14.901180',
'2016-08-23 00:00:17.196639',
'2016-08-23 00:00:17.664193001',
'2016-08-23 00:00:17.664193002',
'2016-08-23 00:00:17.664193003',
'2016-08-23 00:00:17.664193004',
'2016-08-23 00:00:26.206108',
'2016-08-23 00:00:28.322456001',
'2016-08-23 00:00:28.322456002'],
dtype='datetime64[ns]', name='datetime', freq=None)
Timing this solution yields 10000 loops, best of 3: 107 µs per loop
whereas the @DYZ groupby/apply approach (but more readable) is roughly 50 times slower on the dummy data with 100 loops, best of 3: 5.3 ms per loop
.
Of course, you have to reset your index, finally:
df.index = result
Solution 2:
You can group the rows by the index and then add a range of sequential timedeltas to the index of each group. I am not sure if this can be done directly with the index, but you can first convert the index to an ordinary column, apply the operation to the column, and set the column as the index again:
newindex = ts.reset_index()\
.groupby('datetime')['datetime']\
.apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex
Solution 3:
Lets start with a vectorized benchmark since you are dealing with 1M+ rows this should be a priority:
%timeit do10000000 loops, best of 3: 20.5 ns per loop
Lets make some test data since none was provided:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
df = pd.DataFrame(dict(time = rng))
Duplicate the timestamps:
df=pd.concat((df,df))df=df.sort()dfOut [296]:time02011-01-01 00:00:0002011-01-01 00:00:0012011-01-01 01:00:0012011-01-01 01:00:0022011-01-01 02:00:0022011-01-01 02:00:0032011-01-01 03:00:0032011-01-01 03:00:0042011-01-01 04:00:0042011-01-01 04:00:0052011-01-01 05:00:0052011-01-01 05:00:0062011-01-01 06:00:0062011-01-01 06:00:0072011-01-01 07:00:0072011-01-01 07:00:0082011-01-01 08:00:0082011-01-01 08:00:0092011-01-01 09:00:0092011-01-01 09:00:00
Find the locations where the difference in time from the previous row is 0 seconds
mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')
mask
Out [307]:
0False0True1False1True2False2True3False3True4False4True5False
Offset these locations : in this case I chose milliseconds
df.loc[mask,'time']=df.time[mask].apply(lambdax:x+pd.offsets.Milli(5))Out [309]:time02011-01-01 00:00:00.00002011-01-01 00:00:00.00512011-01-01 01:00:00.00012011-01-01 01:00:00.00522011-01-01 02:00:00.00022011-01-01 02:00:00.00532011-01-01 03:00:00.00032011-01-01 03:00:00.00542011-01-01 04:00:00.00042011-01-01 04:00:00.00552011-01-01 05:00:00.000
EDIT: With consecutive timestamps [This assumes 4]
consect = 4for i in range(4):
mask = (df.time-df.time.shift(consect)) == np.timedelta64(0,'s')
df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5+i))
consect -= 1
Post a Comment for "Modifying Timestamps In Pandas To Make Index Unique"