Find Rows In A Dataframe That Partially Match Conditions
Given a DataFrame, whats the best way to find rows in the DataFrame that partially match a list of given values. Currently I have a rows of given values in a DataFrame (df1), I it
Solution 1:
Not sure if there's a way around looping over at least one DataFrame, but here's one option that might speed things up. It does allow for the accidental comparison of FirstName with LastName, though that can be avoided by adding a unique prefix to the values (like '@' for first name and '&' for last name)
import numpy as np
s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]
Output concat_all
[ FirstName LastName Birthday ResidenceZip
0 John Doe 1/1/2000 99999
1 John Doe 1/1/2000 99999
2 John Doex 1/1/2000 99999,
FirstName LastName Birthday ResidenceZip
5 Rob A 9/9/2009 19499]Timings
defAlollz(df1, df2):
s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]
return concat_all
defSharpObject(df1, df2):
concat_all = []
for i, row in df1.iterrows():
c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'],
'LastName': row['LastName'],'Birthday': row['Birthday']}
df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
x1 = df2[df2['count']>=3]
concat_all.append(x1)
return concat_all
%timeit Alollz(df1, df2)
#785 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit SharpObject(df1, df2)
#3.56 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)And larger:
# you should never append dfs like this in a loop
for i in range(7):
df1 = df1.append(df1)
df2 = df2.append(df2)
%timeit Alollz(df1, df2)
#132 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit SharpObject(df1, df2)
#6.88 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
Using the numpy isin function:
df1_vals = df1.values
df2_vals = df2.values
df1_rows = range(df1_vals.shape[0])
concat_all = \
[df2[np.add.reduce(np.isin(df2_vals, df1_vals[row]), axis=1) >= 3] for row in df1_rows]
Here are the dataframes for setup:
df1 = pd.DataFrame({'FirstName': ['John', 'Rob'],
'LastName': ['Doe', 'A'],
'Birthday': ['1/1/2000', '9/9/2009'],
'ResidenceZip': [99999, 19499]})
df2 = pd.DataFrame({'FirstName': ['John', 'John', 'John', 'Joha', 'Joha', 'Rob'],
'LastName': ['Doe', 'Doe', 'Doex', 'Doex', 'Doex', 'A'],
'Birthday': ['1/1/2000', '1/1/2000', '1/1/2000', '1/1/2000', '9/9/2000', '9/9/2009'],
'ResidenceZip': [99999, 99999, 99999, 99999, 99999, 19499]})
Post a Comment for "Find Rows In A Dataframe That Partially Match Conditions"