Skip to content Skip to sidebar Skip to footer

Python Pandas: Using Aggregate Vs Apply To Define New Columns

Suppose I have a dataframe like so: n = 20 dim1 = np.random.randint(1, 3, size=n) dim2 = np.random.randint(3, 5, size=n) data1 = np.random.randint(10, 20, size=n) data2 = np.random

Solution 1:

To step back slightly, a faster way to do this particular "aggregation" is to just use sum (it's optimised in cython) a couple of times.

In [11]: %timeit g.apply(h)
1000 loops, best of 3: 1.79 ms per loop

In [12]: %timeit g['val1'].sum() / g['val2'].sum()
1000 loops, best of 3: 600 µs per loop

IMO The groupby code is pretty hairy, and usually lazily "blackbox" peek at what's going on, by creating a list of what values it's seeing:

def h1(x):
   a.append(x)
   return h(x)
a = []

Warning: sometimes the type of data in this list is not consistent (where pandas tries a few different things before doing whatever calculation)... as in this example!

The second aggregation gets stuck applying on each column, so the group (which raises an error):

01041681391717171911Name:val1,dtype:int64

This is subSeries of the val1 column where (a, b) = (1, 3).

This may well be a bug, after this raises perhaps it could try something else (my suspicion is that this is why the firsts version works, it's special cased to)...

For those interested the a I get is:

In[21]: aOut[21]: 
[SNDArray([125755456, 131767536,        13,        17,        17,        11]),
 Series([], name: val1, dtype: int64),
 01041681391717171911Name: val1, dtype: int64]

I've no idea what the SNDArray is all about...

Post a Comment for "Python Pandas: Using Aggregate Vs Apply To Define New Columns"