Spark: How To Correctly Transform Dataframe By Mapinpandas
I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output t
Solution 1:
Sorry that in my answer to your previous question, the part that uses mapInPandas
was incorrect. I think this function below is the correct way to write the pandas function. I made a mistake last time because I previously thought iter
was an iterable of rows, but it's actually an iterable of dataframes.
defpandas_function(iter):
for df initer:
yield pd.concat(pd.DataFrame(x) for x in df['content'].map(eval))
(PS Thanks to answer from here.)
Solution 2:
Actually there is a tool that enables you to stop inside UDF and debug in VSCode, check out pyspark_xray library, its demo app demonstrates how to use pyspark_xray's wrapper_sdf_mapinpandas
function to step into Pandas UDF that are passed into mapInPandas
function.
Post a Comment for "Spark: How To Correctly Transform Dataframe By Mapinpandas"