Skip to content Skip to sidebar Skip to footer

Count Number Of Rows Between Two Dates By Id In A Pandas Groupby Dataframe

I have the following test DataFrame: import random from datetime import timedelta import pandas as pd import datetime #create test range of dates rng=pd.date_range(datetime.date(2

Solution 1:

My usual approach for these problems is to pivot and think in terms of events changing an accumulator. Every new "stdt" we see adds +1 to the count; every "enddt" we see adds -1. (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..)

IOW, if we turn your frame to something like

>>>df.head()cidjidchangedate0110012015-01-061110112015-01-07211100-12015-01-16221101-12015-01-1717111712015-03-01

then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like

df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])

df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()

new_time = pd.date_range(df.date.min(), df.date.max())

df_parts = []
for cid, group in df.groupby("cid"):
    full_count = group[["date", "count"]].set_index("date")
    full_count = full_count.reindex(new_time)
    full_count = full_count.ffill().fillna(0)
    full_count["cid"] = cid
    df_parts.append(full_count)

df_new = pd.concat(df_parts)

which gives me something like

>>>df_new.head(15)countcid2015-01-03      012015-01-04      012015-01-05      012015-01-06      112015-01-07      212015-01-08      212015-01-09      212015-01-10      212015-01-11      212015-01-12      212015-01-13      212015-01-14      212015-01-15      212015-01-16      112015-01-17      01

There may be off-by-one differences with regards to your expectations; you may have different ideas about how you should handle multiple overlapping jids in the same time window (here they would count as 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.

Solution 2:

Here is a solution I came up with (this will loop through the permutations of unique cid's and date range getting your counts):

fromitertoolsimportproductdf_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(),df.enddt.max()))),columns=['cid','newdate'])df_new_date['cnt']=df_new_date.apply(lambdarow:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)>>>df_new_date.head(20)cidnewdatecnt012015-07-01    0112015-07-02    0212015-07-03    0312015-07-04    0412015-07-05    0512015-07-06    1612015-07-07    1712015-07-08    1812015-07-09    1912015-07-10    11012015-07-11    21112015-07-12    31212015-07-13    31312015-07-14    21412015-07-15    31512015-07-16    31612015-07-17    31712015-07-18    31812015-07-19    21912015-07-20    1

You could then drop the zeros if you don't want them. I don't think this will be much better than your original solution, however.

I would like to suggest you use the following improvement on the loop provided by the @DSM solution:

df_parts=[]forcidindf.cid.unique():full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D",method='ffill')[['cid','count']].reset_index()df_parts.append(full_count[full_count['count']!=0])df_new=pd.concat(df_parts)>>>df_newdatecidcount02015-07-06    1112015-07-07    1122015-07-08    1132015-07-09    1142015-07-10    1152015-07-11    1262015-07-12    1372015-07-13    1382015-07-14    1292015-07-15    13102015-07-16    13112015-07-17    13122015-07-18    13132015-07-19    12142015-07-20    11152015-07-21    11162015-07-22    1102015-07-01    2112015-07-02    2122015-07-03    2132015-07-04    2142015-07-05    2152015-07-06    2162015-07-07    2272015-07-08    2282015-07-09    2292015-07-10    23102015-07-11    23112015-07-12    24122015-07-13    24132015-07-14    25142015-07-15    24152015-07-16    24162015-07-17    23172015-07-18    22182015-07-19    22192015-07-20    21202015-07-21    21

Only real improvement over what @DSM provided is that this will avoid requiring the creation of a groubby object for the loop and this will also get you all the min stdt and max enddt per cid number with no zero values.

Post a Comment for "Count Number Of Rows Between Two Dates By Id In A Pandas Groupby Dataframe"