Skip to content Skip to sidebar Skip to footer

Proper Handling Of Spark Broadcast Variables In A Python Class

I've been implementing a model with spark via a python class. I had some headaches calling class methods on a RDD defined in the class (see this question for details), but finally

Solution 1:

Assuming that variables you use here are simply scalars there is probably nothing to gain here from a performance perspective and using broadcast variables will make you code less readable but you can either pass a broadcast variable as an argument to the static method:

classmodel(object):
    @staticmethoddeffoobar(a_model, mu):
        y = a_model.y
        def_foobar(x):
            return x - mu.value + y 
        return _foobar

    def__init__(self, sc):
        self.sc = sc
        self.y = -1
        self.rdd = self.sc.parallelize([1, 2, 3])

    defget_mean(self):
        return self.rdd.mean()

    defrun_foobar(self):
        mu = self.sc.broadcast(self.get_mean())
        self.data = self.rdd.map(model.foobar(self, mu))

or initialize it there:

classmodel(object):
    @staticmethoddeffoobar(a_model):
        mu = a_model.sc.broadcast(a_model.get_mean())
        y = a_model.y
        def_foobar(x):
            return x - mu.value + y 
        return _foobar

    def__init__(self, sc):
        self.sc = sc
        self.y = -1
        self.rdd = self.sc.parallelize([1, 2, 3])

    defget_mean(self):
        return self.rdd.mean()

    defrun_foobar(self):
        self.data = self.rdd.map(model.foobar(self))

Post a Comment for "Proper Handling Of Spark Broadcast Variables In A Python Class"