Pyspark Converting An Array Of Struct Into String
I have the following dataframe in Pyspark +----+-------+-----+ |name|subject|score| +----+-------+-----+ | Tom| math|
Solution 1:
Per your Update and comment, for Spark 2.4.0+, here is one way to stringify an array of structs with Spark SQL builtin functions: transform and array_join:
>>> df.printSchema()
root
|-- name: string (nullable = true)|-- score_list: array (nullable = true)||-- element: struct (containsNull = true)|||-- subject: string (nullable = true)|||-- score: integer (nullable = true)>>> df.show(2,0)
+----+---------------------------+|name|score_list |+----+---------------------------+|Tom |[[math, 90], [physics, 70]]||Amy |[[math, 95]] |+----+---------------------------+>>> df1.selectExpr(
"name"
, """
array_join(
transform(score_list, x -> concat('(', x.subject, ', ', x.score, ')'))
, ' | '
) AS score_list
"""
).show(2,0)
+----+--------------------------+|name|score_list |+----+--------------------------+|Tom |(math, 90) | (physics, 70)||Amy |(math, 95) |+----+--------------------------+
Where:
- Use transform() to convert array of structs into array of strings. for each array element (the struct
x
), we useconcat('(', x.subject, ', ', x.score, ')')
to convert it into a string. - Use array_join() to join all array elements(StringType) with
|
, this will return the final string
Solution 2:
The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.
Just replace the struct
with concat_ws
. Also use concat
to add an opening and closing parentheses to get the output you desire.
from pyspark.sql.functions import concat, concat_ws, lit
df = df.groupBy('name')\
.agg(
concat_ws(
" | ",
collect_list(
concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
)
).alias('score_list')
)
df.show(truncate=False)
#+----+--------------------------+#|name|score_list |#+----+--------------------------+#|Tom |(math, 90) | (physics, 70)|#|Amy |(math, 95) |#+----+--------------------------+
Note that since the comma appears in the score_list
column, this value will be quoted when you write to csv
if you use the default arguments.
For example:
df.coalesce(1).write.csv("test.csv")
Would produce the following output file:
Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"
Post a Comment for "Pyspark Converting An Array Of Struct Into String"