Skip to content Skip to sidebar Skip to footer

Assigning The Value To A User Depending On The Cluster He Comes From

I have two dataframes, one with the customers who prefer songs, and my other dataframe consists of users and their cluster. DATA 1: user song A 11 A 22 B 99 B 11 C 11

Solution 1:

Not easy:

#add column and remove duplicates
df = pd.merge(df1, df2, on='user', how='left').drop_duplicates(['user','song'])

deff(x):
    #for each group reshape
    x = x.pivot('user','song','cluster')
    #get all columns values if NaNs in data  
    x = x.apply(lambda x: x.index[x.isnull()].tolist(),1)
    return x

df1 = df.groupby(['cluster']).apply(f).reset_index(level=0, drop=True).sort_index()
user
A    [33, 44]
B    [55, 66]
C        [77]
D    [11, 22]
E    [11, 99]
F        [66]
dtype: object

Similar solution:

df = pd.merge(df1, df2, on='user', how='left').drop_duplicates(['user','song'])
df1 = (df.groupby(['cluster']).apply(lambda x: x.pivot('user','song','cluster').isnull())
        .fillna(False)
        .reset_index(level=0, drop=True)
        .sort_index())

#replace eachTruebyvalueofcolumn
s = np.where(df1, ['{}, '.format(x) for x in df1.columns.astype(str)], '')
#remove emptyvalues
s1 = pd.Series([''.join(x).strip(', ') for x in s], index=df1.index)
print (s1)
user
A    33, 44
B    55, 66
C        77
D    11, 22
E    11, 99
F        66
dtype: object

Solution 2:

Use sets for comparison.

Setup

df1

#    user  song# 0     A    11# 1     A    22# 2     B    99# 3     B    11# 4     C    11# 5     D    44# 6     C    66# 7     E    66# 8     D    33# 9     E    55# 10    F    11# 11    F    77

df2

#   user  cluster# 0    A        1# 1    B        2# 2    C        3# 3    D        1# 4    E        2# 5    F        3

df3

#    cluster             songs# 0        1  [11, 22, 33, 44]# 1        2  [11, 99, 66, 55]# 2        3  [11, 66, 88, 77]

Calculation

df = df1.groupby('user')['song'].apply(set)\
        .reset_index().rename(columns={'song': 'heard'})

df['all'] = df['user'].map(df2.set_index('user')['cluster'])\
                      .map(df3.set_index('cluster')['songs'])\
                      .map(set)

df['not heard'] = df.apply(lambda row: row['all'] - row['heard'], axis=1)

Result

  user     heard               all not heard
0A  {11, 22}  {33, 11, 44, 22}  {33, 44}
1B  {11, 99}  {99, 66, 11, 55}  {66, 55}
2    C  {66, 11}  {88, 66, 11, 77}  {88, 77}
3    D  {33, 44}  {33, 11, 44, 22}  {11, 22}
4    E  {66, 55}  {99, 66, 11, 55}  {11, 99}
5    F  {11, 77}  {88, 66, 11, 77}  {88, 66}

Extract any columns you need; conversion to list is trivial, i.e. df[col] = df[col].map(list).

Explanation

There are 3 steps:

  1. Convert lists to sets and aggregate heard songs by user to sets.
  2. Perform mappings to put all data in one table.
  3. Add a column which calculates the difference between 2 sets.

Post a Comment for "Assigning The Value To A User Depending On The Cluster He Comes From"