Skip to content Skip to sidebar Skip to footer

Joining 2 Dataframes On Multiple Columns Pandas

Consider 2 Dataframes and need to use joining of 2 dataframes by 2 unique columns (idA, idB) and compute sum of their col Distance . By the way (idA,idB) is equal to (idB,idA), so

Solution 1:

First convert to numeric both columns and then use add with set_index for align and sort each pair of columns per rows:

df1['Distance'] = df1['Distance'].astype(float)      
df2['Distance'] = df2['Distance'].astype(float)  

#if some data are not parseable convert them to NaNs 
#df1['Distance'] = pd.to_numeric(df1['Distance'], errors='coerce')      
#df2['Distance'] = pd.to_numeric(df2['Distance'], errors='coerce')  

df1[['idA','idB']] = np.sort(df1[['idA','idB']], axis=1)
df2[['idA','idB']] = np.sort(df2[['idA','idB']], axis=1) 

print (df1)
   Distance idA idB
00.7272731110.8272732420.1272733830.92727312print (df2)
   Distance idA idB
40.111250.101563.002470.8057

df3=df1.set_index(['idA','idB']).add(df2.set_index(['idA','idB']),fill_value=0).reset_index()
print (df3)
  idA idB  Distance
0   1   1  0.727273
1   1   2  1.037273
2   1   5  0.100000
3   2   4  3.827273
4   3   8  0.127273
5   5   7  0.800000

Another solution with concat and groupby with aggregate sum:

df3 = pd.concat([df1, df2]).groupby(['idA','idB'], as_index=False)['Distance'].sum()
print (df3)
  idA idB  Distance
0110.7272731121.0372732150.1000003243.8272734380.1272735570.800000

Solution 2:

df1.Distance=pd.to_numeric(df1.Distance)
df2.Distance=pd.to_numeric(df2.Distance)
df=pd.concat([df1.assign(key=df1.idA+df1.idB),df2.assign(key=df2.idA+df2.idB)]).\
    groupby('key').agg({'Distance':'sum','idA':'first','idB':'first'})
df
Out[672]: 
     Distance  idA  idB
key                    
2    0.727273    1    1
3    1.037273    2    1
6    3.927273    2    4
11   0.127273    3    8
12   0.800000    5    7

Updated

df1[['idA','idB']]=np.sort(df1[['idA','idB']].values)
df2[['idA','idB']]=np.sort(df2[['idA','idB']].values)

pd.concat([df1,df2]).groupby(['idA','idB'],as_index=False).Distance.sum()
Out[678]: 
   idA  idB  Distance
0110.7272731121.0372732150.1000003243.8272734380.1272735570.800000

Post a Comment for "Joining 2 Dataframes On Multiple Columns Pandas"