Skip to content Skip to sidebar Skip to footer

What Is The Right Way To Save\load Models In Spark\pyspark

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation ) from pyspark.mllib.recomme

Solution 1:

One way to save a model (in Scala; but probably is similar in Python):

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("linReg.model")

Saved model can then be loaded as:

vallinRegModel= sc.objectFile[LinearRegressionModel]("linReg.model").first()

See also related question

For more details see (ref)

Solution 2:

As of this pull request merged on Mar 28, 2015 (a day after your question was last edited) this issue has been resolved.

You just need to clone/fetch the latest version from GitHub (git clone git://github.com/apache/spark.git -b branch-1.3) then build it (following the instructions in spark/README.md) with $ mvn -DskipTests clean package.

Note: I ran into trouble building Spark because Maven was being wonky. I resolved that issue by using $ update-alternatives --config mvn and selecting the 'path' that had Priority: 150, whatever that means. Explanation here.

Solution 3:

I run into this also -- it looks like a bug. I have reported to spark jira.

Solution 4:

Use pipeline in ML to train the model, and then use MLWriter and MLReader to save models and read them back.

from pyspark.mlimportPipelinefrom pyspark.mlimportPipelineModel

pipeTrain.write().overwrite().save(outpath)
model_in = PipelineModel.load(outpath)

Post a Comment for "What Is The Right Way To Save\load Models In Spark\pyspark"