Skip to content Skip to sidebar Skip to footer

Tokenizing And Ranking A String Column Into Multiple Columns In Pyspark

I have a PySpark dataframe that has a string column which contains a comma separated, unsorted list of values (up to 5 values), like this: +----+----------------------+ |col1|col2

Solution 1:

I found a solution for this. We can use a udf that sorts the list of strings in that column based on the sets. Then apply the tokenization on top of the udf function and create different columns from it.

set1 = set(['a1', 'a2', 'a3', 'a4', 'a5'])
set2 = set(['b1', 'b2', 'b3', 'b4', 'b5'])
set3 = set(['c1', 'c2', 'c3', 'c4', 'c5'])
set4 = set(['d1', 'd2', 'd3', 'd4', 'd5'])
set5 = set(['e1', 'e2', 'e3', 'e4', 'e5'])

def sortCategories(x):
    resultArray = ['unknown' for i in range(5)]
    tokens = x.split(',')
    for token in tokens:
        if token in set1:
            resultArray[0] = token
        elif token in set2:
            resultArray[1] = token
        elif token in set3:
            resultArray[2] = token
        elif token in set4:
            resultArray[3] = token
        elif token in set5:
            resultArray[4] = token
    return ','.join(resultArray)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
orderUdfString = udf(lambda s: sortCategories(s), StringType())
df = df.withColumn('col2', orderUdfString('col2'))
df = df.withColumn('col_temp', split('col2', ',')) \
  .select([col(c) for c in df.columns] + [col('col_temp')[i].alias('col' + str(i + 1)) for i in range(0, 5)])

Post a Comment for "Tokenizing And Ranking A String Column Into Multiple Columns In Pyspark"