Skip to content Skip to sidebar Skip to footer

Python Pandas Multiply Dataframe By Weights That Vary With Category In Vectorized Fashion

My problem is very similar to the one outlined here Except for that my main data frame has a category column, as do my weights: df Out[3]: Symbol var_1 var_2 var_

Solution 1:

You can do a groupby (select by category) and then do the dot() or you can do the dot() and then select by category. The latter is faster and simpler in pandas. Note that the data I used matches the column names in the data and the weights frames.

Code for dot() and then select:

df['dot'] = df[df_wgt.columns].dot(df_wgt.T).lookup(df.index, df.Category)

Steps performed...

  1. Select the columns to use with df[df_wgt.columns]

    This uses the column labels and ordering from the weight dataframe. This is important because dot() needs the data to be in the correct order.

  2. Performing the dot product against the transposed weights dataframe with .dot(df_wgt.T)

    Transposing the weight puts them in the correct orientation for the .dot(). This does the calculation for all of the weight categories for each row of data. That means in this case we do four times as many multiplications as will be needed, but it is still likely faster then doing grouping.

  3. Select the needed dot product with .lookup(df.index, df.Category)

    By using lookup() we can gather the correct result for the category of that row.

Code for select (groupby) and then dot():

def dot(group):
    category = group['Category'].iloc[0]
    weights = df_wgt.loc[category].values
    return pd.Series(
        np.dot(group[df_wgt.columns].values, weights), index=group.index)

df['dot'] = df.groupby(['Category']).apply(dot) \
    .reset_index().set_index('Index')[0]

Test Code:

import pandas as pd
from io import StringIO

df = pd.read_fwf(StringIO(u"""
    Index          var_1      var_2     var_3     var_4    Category
    1903          0.000443  0.006928  0.000000  0.012375      A
    1904         -0.000690 -0.007873  0.000171  0.014824      A
    1905         -0.001354  0.001545  0.000007 -0.008195      C
    1906         -0.001578  0.008796 -0.000164  0.015955      D
    1907         -0.001578  0.008796 -0.000164  0.015955      A
    1909         -0.001354  0.001545  0.000007 -0.008195      B"""),
                 header=1, skiprows=0).set_index(['Index'])

df_wgt = pd.read_fwf(StringIO(u"""
     Category     var_1      var_2     var_3     var_4
        A       0.182022   0.182022  0.131243  0.182022
        B       0.534814   0.534814  0.534814  0.534814
        C       0.131243   0.534814  0.131243  0.182022
        D       0.182022   0.151921  0.151921  0.131243"""),
                 header=1, skiprows=0).set_index(['Category'])

df['dot'] = df[df_wgt.columns].dot(df_wgt.T).lookup(df.index, df.Category)
print(df)

Results:

var_1var_2var_3var_4CategorydotIndex1903   0.0004430.0069280.0000000.012375A0.0035941904  -0.000690-0.0078730.0001710.014824A0.0011621905  -0.0013540.0015450.000007-0.008195C-0.0008421906  -0.0015780.008796-0.0001640.015955D0.0031181907  -0.0015780.008796-0.0001640.015955A0.0041961909  -0.0013540.0015450.000007-0.008195B-0.004277

Solution 2:

Setup

print(df)
Out[655]: 
           var_1var_2var_3var_4CategorySymbol19030.0004430.0069280.0000000.012375A1904-0.000690-0.0078730.0001710.014824A1905-0.0013540.0015450.000007-0.008195C1906-0.0015780.008796-0.0001640.015955D1907-0.0015780.008796-0.0001640.015955A1909-0.0013540.0015450.000007-0.008195Bprint(w)
Out[656]: 
  Categoryvar_1_wgtvar_2_wgtvar_3_wgtvar_4_wgt0A0.1820220.1820220.1312430.1820221B0.5348140.5348140.5348140.5348142C0.1312430.5348140.1312430.1820223D0.1820220.1519210.1519210.131243

Solution

#convert Category to numerical encodingdf['C_Number']=df.Category.apply(lambdax:ord(x.lower())-97)#Get a dot product for each row with all category weights and the extract the weights by the category numberdf['new_var']=((df.iloc[:,:4].values).dot(w.iloc[:,-4:].values))[np.arange(len(df)),df.C_Number]Out[654]:var_1var_2var_3var_4CategoryC_Numbernew_varSymbol1903    0.0004430.0069280.0000000.012375A00.0060381904   -0.000690-0.0078730.0001710.014824A0-0.0016151905   -0.0013540.0015450.000007-0.008195C2-0.0005951906   -0.0015780.008796-0.0001640.015955D30.0064811907   -0.0015780.008796-0.0001640.015955A00.0073001909   -0.0013540.0015450.000007-0.008195B1-0.000661

Post a Comment for "Python Pandas Multiply Dataframe By Weights That Vary With Category In Vectorized Fashion"