Spark: How To Correctly Transform Dataframe By Mapinpandas

March 01, 2024 Post a Comment

I'm trying to transform spark dataframe with 10k rows by latest spark 3.0.1 function mapInPandas. Expected output: mapped pandas_function() transforms one row to three, so output t

Solution 1:

Sorry that in my answer to your previous question, the part that uses mapInPandas was incorrect. I think this function below is the correct way to write the pandas function. I made a mistake last time because I previously thought iter was an iterable of rows, but it's actually an iterable of dataframes.

defpandas_function(iter):
    for df initer:
        yield pd.concat(pd.DataFrame(x) for x in df['content'].map(eval))

(PS Thanks to answer from here.)

Solution 2:

Actually there is a tool that enables you to stop inside UDF and debug in VSCode, check out pyspark_xray library, its demo app demonstrates how to use pyspark_xray's wrapper_sdf_mapinpandas function to step into Pandas UDF that are passed into mapInPandas function.

Getting Started with Python

Spark: How To Correctly Transform Dataframe By Mapinpandas

Solution 1:

Solution 2:

Post a Comment for "Spark: How To Correctly Transform Dataframe By Mapinpandas"