How To Select A Subset From A Multi-index Dataframe Based On Conditions From Another Dataframe
I have a dataframe like below: dates 0 numbers letters 0 a 2013-01-01 0.261092 2013-01-02 -1.267770
Solution 1:
You can also use index intersection:
In[39]: lOut[39]:
numbersletters00b11cIn[40]: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
Out[40]:
dates0numbersletters0b2013-01-01-1.515866b2013-01-020.351942b2013-01-03-0.2454631c2013-01-01-0.108325c2013-01-02-0.212350c2013-01-030.021097c2013-01-01-0.419775c2013-01-021.511700c2013-01-030.994332
Or more straightforward and faster solution from @Javier:
In [155]:df.loc[l.set_index(['numbers','letters']).index]Out[155]:dates0numbersletters0b2013-01-01 -1.515866b2013-01-02 0.351942b2013-01-03 -0.2454631c2013-01-01 -0.108325c2013-01-02 -0.212350c2013-01-03 0.021097c2013-01-01 -0.419775c2013-01-02 1.511700c2013-01-03 0.994332
Timing:
for 27.000 rows Multi-Index DF
In [156]: df = pd.concat([df.reset_index()] * 10**3, ignore_index=True).set_index(['numbers','letters'])
In [157]: df.shape
Out[157]: (27000, 2)
In [158]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 21.3 ms per loop
In [159]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
10 loops, best of 3: 20.2 ms per loop
In [160]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 27.2 ms per loop
for 270.000 rows Multi-Index DF
In [163]: %%timeit
...: q = l.apply(lambda r: "(numbers == {} and letters == '{}')".format(r.numbers, r.letters),
...: axis=1) \
...: .str.cat(sep=' or ')
...: df.query(q)
...:
10 loops, best of 3: 117 ms per loop
In [164]: %%timeit
...: df.loc[l.set_index(['numbers','letters']).index]
...:
1 loop, best of 3: 142 ms per loop
In [165]: %%timeit
...: df.loc[df.index.intersection(l.set_index(['numbers','letters']).index)]
...:
10 loops, best of 3: 185 ms per loop
Conclusion:df.query()
method which uses numexpr
module internaly seems to be faster for bigger DFs
Solution 2:
assuming you have the following DF with the values that you want to get:
In[28]: lOut[28]:
numbersletters00b11c
if you need to select all rows where numbers
is either 0
or 1
and letters
is in ['b','c']
you can use df.query()
method as follows:
In [29]:df.query("numbersin@l.numbersandlettersin@l.letters")Out[29]:dates0numbersletters0b2013-01-01 -1.515866b2013-01-02 0.351942b2013-01-03 -0.245463c2013-01-01 -0.253103c2013-01-02 -0.385411c2013-01-03 -1.7408211c2013-01-01 -0.108325c2013-01-02 -0.212350c2013-01-03 0.021097b2013-01-01 -1.922214b2013-01-02 -1.769003b2013-01-03 -0.594216c2013-01-01 -0.419775c2013-01-02 1.511700c2013-01-03 0.994332
or simply:
df.query("numbers in [0,1] and letters in ['b','c']")
UPDATE: if it must be exact matches, like (0, 'b')
and (1, 'c')
:
In [14]:q=l.apply(lambdar:"(numbers == {} and letters == '{}')".format(r.numbers,r.letters),...:axis=1)\...:.str.cat(sep='or')
...:
In [15]: q
Out[15]: "(numbers == 0 and letters == 'b')or(numbers==1andletters=='c')"In [16]:df.query(q)Out[16]:dates0numbersletters0b2013-01-01 -1.515866b2013-01-02 0.351942b2013-01-03 -0.2454631c2013-01-01 -0.108325c2013-01-02 -0.212350c2013-01-03 0.021097c2013-01-01 -0.419775c2013-01-02 1.511700c2013-01-03 0.994332
Post a Comment for "How To Select A Subset From A Multi-index Dataframe Based On Conditions From Another Dataframe"