Hours, Date, Day Count Calculation
I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login. The code is supposed to group
Solution 1:
Unfortunately i couldn't find any elegant solution.
Here is my attempt:
fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H'# 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0for i, row in r.iterrows():
# intervals overlap test# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test# i've slightly simplified the calculations of m and d# by getting rid of division by 2,# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print(r[r.LogCount > 0])
PS the less periods you will have in the report DF - r
, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)
Result:
startLogCountUniqueIDCountDayStartTimeEndTime2004-01-05 00:00:00 10732608002415Mon00:00:0001:00:002004-01-05 01:00:00 107326440055Mon01:00:0002:00:002004-01-05 02:00:00 107326800033Mon02:00:0003:00:002004-01-05 03:00:00 107327160033Mon03:00:0004:00:002004-01-05 04:00:00 107327520022Mon04:00:0005:00:002004-01-06 12:00:00 10733904002212Tue12:00:0013:00:002004-01-06 13:00:00 107339400032Tue13:00:0014:00:002004-01-06 14:00:00 107339760032Tue14:00:0015:00:002004-01-06 15:00:00 107340120032Tue15:00:0016:00:002004-01-10 16:00:00 10737504002011Sat16:00:0017:00:002004-01-14 23:00:00 107412120021869Wed23:00:0000:00:002004-01-15 00:00:00 10741248001211Thu00:00:0001:00:002004-01-15 01:00:00 107412840011Thu01:00:0002:00:00
Post a Comment for "Hours, Date, Day Count Calculation"