Proper Handling Of Spark Broadcast Variables In A Python Class
I've been implementing a model with spark via a python class. I had some headaches calling class methods on a RDD defined in the class (see this question for details), but finally
Solution 1:
Assuming that variables you use here are simply scalars there is probably nothing to gain here from a performance perspective and using broadcast variables will make you code less readable but you can either pass a broadcast variable as an argument to the static method:
classmodel(object):
@staticmethoddeffoobar(a_model, mu):
y = a_model.y
def_foobar(x):
return x - mu.value + y
return _foobar
def__init__(self, sc):
self.sc = sc
self.y = -1
self.rdd = self.sc.parallelize([1, 2, 3])
defget_mean(self):
return self.rdd.mean()
defrun_foobar(self):
mu = self.sc.broadcast(self.get_mean())
self.data = self.rdd.map(model.foobar(self, mu))
or initialize it there:
classmodel(object):
@staticmethoddeffoobar(a_model):
mu = a_model.sc.broadcast(a_model.get_mean())
y = a_model.y
def_foobar(x):
return x - mu.value + y
return _foobar
def__init__(self, sc):
self.sc = sc
self.y = -1
self.rdd = self.sc.parallelize([1, 2, 3])
defget_mean(self):
return self.rdd.mean()
defrun_foobar(self):
self.data = self.rdd.map(model.foobar(self))
Post a Comment for "Proper Handling Of Spark Broadcast Variables In A Python Class"