Skip to content Skip to sidebar Skip to footer

How To Do Group By On A Multiindex In Pandas?

Below is my dataframe. I made some transformations to create the category column and dropped the original column it was derived from. Now I need to do a group-by to remove the du

Solution 1:

You can create the index on the existing dataframe. With the subset of data provided, this works for me:

importpandasdf=pandas.DataFrame.from_dict(
    {
     'category': {0:'Love', 1:'Love', 2:'Fashion', 3:'Fashion', 4:'Hair', 5:'Movies', 6:'Movies', 7:'Health', 8:'Health', 9:'Celebs', 10:'Celebs', 11:'Travel', 12:'Weightloss', 13:'Diet', 14:'Bags'}, 
     'impressions': {0:380, 1:374242, 2:197, 3:13363, 4:4, 5:189, 6:60632, 7:269, 8:40189, 9:138, 10:66590, 11:2227, 12:22668, 13:21707, 14:229}, 
     'date': {0:'2013-11-04', 1:'2013-11-04', 2:'2013-11-04', 3:'2013-11-04', 4:'2013-11-04', 5:'2013-11-04', 6:'2013-11-04', 7:'2013-11-04', 8:'2013-11-04', 9:'2013-11-04', 10:'2013-11-04', 11:'2013-11-04', 12:'2013-11-04', 13:'2013-11-04', 14:'2013-11-04'}, 'cpc_cpm_revenue': {0:0.36823, 1:474.81522000000001, 2:0.19434000000000001, 3:18.264220000000002, 4:0.00080000000000000004, 5:0.23613000000000001, 6:81.391139999999993, 7:0.27171000000000001, 8:51.258200000000002, 9:0.11536, 10:83.966859999999997, 11:3.43248, 12:31.695889999999999, 13:28.459320000000002, 14:0.43524000000000002}, 'clicks': {0:0, 1:183, 2:0, 3:9, 4:0, 5:1, 6:20, 7:0, 8:21, 9:0, 10:32, 11:1, 12:12, 13:9, 14:2}, 'size': {0:'300x250', 1:'300x250', 2:'300x250', 3:'300x250', 4:'300x250', 5:'300x250', 6:'300x250', 7:'300x250', 8:'300x250', 9:'300x250', 10:'300x250', 11:'300x250', 12:'300x250', 13:'300x250', 14:'300x250'}
    }
)df.set_index(['date','category'],inplace=True)df.groupby(level=[0,1]).sum()

If you're having duplicate index issues with the full dataset, you'll need to clean up the data a bit. Remove the duplicate rows if that's amenable. If the duplicate rows are valid, then what sets them apart from each other? If you can add that to the dataframe and include it in the index, that's ideal. If not, just create a dummy column that defaults to 1, but can be 2 or 3 or ... N in the case of N duplicates -- and then include that field in the index as well.

Alternatively, I'm pretty sure you can skip the index creation and directly groupby with columns:

df.groupby(by=['date', 'category']).sum()

Again, that works on the subset of data that you posted.

Solution 2:

I usually try to do it when I try to unstack a multi-index and it fails because there are duplicate values.

Here is the simple command that I run the find the problematic items:

df.groupby(level=df.index.names).count()

Post a Comment for "How To Do Group By On A Multiindex In Pandas?"