Find Rows In A Dataframe That Partially Match Conditions
Given a DataFrame, whats the best way to find rows in the DataFrame that partially match a list of given values. Currently I have a rows of given values in a DataFrame (df1), I it
Solution 1:
Not sure if there's a way around looping over at least one DataFrame
, but here's one option that might speed things up. It does allow for the accidental comparison of FirstName with LastName, though that can be avoided by adding a unique prefix to the values (like '@' for first name and '&' for last name)
import numpy as np
s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]
Output concat_all
[ FirstName LastName Birthday ResidenceZip
0 John Doe 1/1/2000 99999
1 John Doe 1/1/2000 99999
2 John Doex 1/1/2000 99999,
FirstName LastName Birthday ResidenceZip
5 Rob A 9/9/2009 19499]
Timings
defAlollz(df1, df2):
s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]
return concat_all
defSharpObject(df1, df2):
concat_all = []
for i, row in df1.iterrows():
c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'],
'LastName': row['LastName'],'Birthday': row['Birthday']}
df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
x1 = df2[df2['count']>=3]
concat_all.append(x1)
return concat_all
%timeit Alollz(df1, df2)
#785 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit SharpObject(df1, df2)
#3.56 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And larger:
# you should never append dfs like this in a loop
for i in range(7):
df1 = df1.append(df1)
df2 = df2.append(df2)
%timeit Alollz(df1, df2)
#132 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit SharpObject(df1, df2)
#6.88 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
Using the numpy isin function:
df1_vals = df1.values
df2_vals = df2.values
df1_rows = range(df1_vals.shape[0])
concat_all = \
[df2[np.add.reduce(np.isin(df2_vals, df1_vals[row]), axis=1) >= 3] for row in df1_rows]
Here are the dataframes for setup:
df1 = pd.DataFrame({'FirstName': ['John', 'Rob'],
'LastName': ['Doe', 'A'],
'Birthday': ['1/1/2000', '9/9/2009'],
'ResidenceZip': [99999, 19499]})
df2 = pd.DataFrame({'FirstName': ['John', 'John', 'John', 'Joha', 'Joha', 'Rob'],
'LastName': ['Doe', 'Doe', 'Doex', 'Doex', 'Doex', 'A'],
'Birthday': ['1/1/2000', '1/1/2000', '1/1/2000', '1/1/2000', '9/9/2000', '9/9/2009'],
'ResidenceZip': [99999, 99999, 99999, 99999, 99999, 19499]})
Post a Comment for "Find Rows In A Dataframe That Partially Match Conditions"