Tokenizing And Ranking A String Column Into Multiple Columns In Pyspark
I have a PySpark dataframe that has a string column which contains a comma separated, unsorted list of values (up to 5 values), like this: +----+----------------------+ |col1|col2
Solution 1:
I found a solution for this. We can use a udf that sorts the list of strings in that column based on the sets. Then apply the tokenization on top of the udf function and create different columns from it.
set1 = set(['a1', 'a2', 'a3', 'a4', 'a5'])
set2 = set(['b1', 'b2', 'b3', 'b4', 'b5'])
set3 = set(['c1', 'c2', 'c3', 'c4', 'c5'])
set4 = set(['d1', 'd2', 'd3', 'd4', 'd5'])
set5 = set(['e1', 'e2', 'e3', 'e4', 'e5'])
def sortCategories(x):
resultArray = ['unknown' for i in range(5)]
tokens = x.split(',')
for token in tokens:
if token in set1:
resultArray[0] = token
elif token in set2:
resultArray[1] = token
elif token in set3:
resultArray[2] = token
elif token in set4:
resultArray[3] = token
elif token in set5:
resultArray[4] = token
return ','.join(resultArray)
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
orderUdfString = udf(lambda s: sortCategories(s), StringType())
df = df.withColumn('col2', orderUdfString('col2'))
df = df.withColumn('col_temp', split('col2', ',')) \
.select([col(c) for c in df.columns] + [col('col_temp')[i].alias('col' + str(i + 1)) for i in range(0, 5)])
Post a Comment for "Tokenizing And Ranking A String Column Into Multiple Columns In Pyspark"