Replace Column Values In Spark Dataframe Based On Dictionary Similar To Np.where
My data frame looks like - no city amount 1 Kenora 56% 2 Sudbury 23% 3 Kenora 71% 4 Sudbury 4
Solution 1:
The problem is that mapping_expr
will return null
for any city that is not contained in city_dict
. A quick fix is to use coalesce
to return the city
if the mapping_expr
returns a null
value:
from pyspark.sql.functions import coalesce
#lookup and replace
df1= df.withColumn('new_city', coalesce(mapping_expr[df['city']], df['city']))
df1.show()
#+---+--------+------+--------+#| no| city|amount|new_city|#+---+--------+------+--------+#| 1| Kenora| 56%| X|#| 2| Sudbury| 23%| Sudbury|#| 3| Kenora| 71%| X|#| 4| Sudbury| 41%| Sudbury|#| 5| Kenora| 33%| X|#| 6| Niagara| 22%| X|#| 7|Hamilton| 88%|Hamilton|#+---+--------+------+--------+
df1.groupBy('new_city').count().show()
#+--------+-----+#|new_city|count|#+--------+-----+#| X| 4|#|Hamilton| 1|#| Sudbury| 2|#+--------+-----+
The above method will fail, however, if one of the replacement values is null
.
In this case, an easier alternative may be to use pyspark.sql.DataFrame.replace()
:
First use withColumn
to create new_city
as a copy of the values from the city
column.
df.withColumn("new_city", df["city"])\
.replace(to_replace=city_dict.keys(), value=city_dict.values(), subset="new_city")\
.groupBy('new_city').count().show()
#+--------+-----+#|new_city|count|#+--------+-----+#| X| 4|#|Hamilton| 1|#| Sudbury| 2|#+--------+-----+
Post a Comment for "Replace Column Values In Spark Dataframe Based On Dictionary Similar To Np.where"