Skip to content Skip to sidebar Skip to footer

Replace Column Values In Spark Dataframe Based On Dictionary Similar To Np.where

My data frame looks like - no city amount 1 Kenora 56% 2 Sudbury 23% 3 Kenora 71% 4 Sudbury 4

Solution 1:

The problem is that mapping_expr will return null for any city that is not contained in city_dict. A quick fix is to use coalesce to return the city if the mapping_expr returns a null value:

from pyspark.sql.functions import coalesce

#lookup and replace 
df1= df.withColumn('new_city', coalesce(mapping_expr[df['city']], df['city']))
df1.show()
#+---+--------+------+--------+#| no|    city|amount|new_city|#+---+--------+------+--------+#|  1|  Kenora|   56%|       X|#|  2| Sudbury|   23%| Sudbury|#|  3|  Kenora|   71%|       X|#|  4| Sudbury|   41%| Sudbury|#|  5|  Kenora|   33%|       X|#|  6| Niagara|   22%|       X|#|  7|Hamilton|   88%|Hamilton|#+---+--------+------+--------+

df1.groupBy('new_city').count().show()
#+--------+-----+#|new_city|count|#+--------+-----+#|       X|    4|#|Hamilton|    1|#| Sudbury|    2|#+--------+-----+

The above method will fail, however, if one of the replacement values is null.

In this case, an easier alternative may be to use pyspark.sql.DataFrame.replace():

First use withColumn to create new_city as a copy of the values from the city column.

df.withColumn("new_city", df["city"])\
    .replace(to_replace=city_dict.keys(), value=city_dict.values(), subset="new_city")\
    .groupBy('new_city').count().show()
#+--------+-----+#|new_city|count|#+--------+-----+#|       X|    4|#|Hamilton|    1|#| Sudbury|    2|#+--------+-----+

Post a Comment for "Replace Column Values In Spark Dataframe Based On Dictionary Similar To Np.where"