Skip to content Skip to sidebar Skip to footer

Faster Way To Rank Rows In Subgroups In Pandas Dataframe

I have a pandas data frame that has is composed of different subgroups. df = pd.DataFrame({ 'id':[1, 2, 3, 4, 5, 6, 7, 8], 'group':['a', 'a', 'a', 'a', 'b', 'b', 'b',

Solution 1:

rank is cythonized so should be very fast. And you can pass the same options as df.rank()here are the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the method argument.

Its also possible you simply want the .cumcount() of the group.

In [12]: df.groupby('group')['value'].rank(ascending=False)
Out[12]: 
0411233243526174
dtype: float64

Solution 2:

Working with a big DataFrame (13 million lines), the method rank with groupby maxed out my 8GB of RAM an it took a really long time. I found a workaround less greedy in memory , that I put here just in case:

df.sort_values('value')
tmp = df.groupby('group').size()
rank = tmp.map(range)
rank =[item for sublist in rank for item in sublist]
df['rank'] = rank

Post a Comment for "Faster Way To Rank Rows In Subgroups In Pandas Dataframe"