Skip to content Skip to sidebar Skip to footer

Distribute Many Independent, Expensive Operations Over Multiple Cores In Python

Given a large list (1,000+) of completely independent objects that each need to be manipulated through some expensive function (~5 minutes each), what is the best way to distribute

Solution 1:

This sounds like a good use case for a multiprocessing.Pool; depending on exactly what you're doing, it could be as simple as

pool = multiprocessing.Pool(num_procs)
results = pool.map(the_function, list_of_objects)
pool.close()

This will pickle each object in the list independently. If that's a problem, there are various ways to get around that (though all with their own problems and I don't know if any of them work on Windows). Since your computation times are fairly long that's probably irrelevant.

Since you're running this for 5 minutes x 1000 items = several days / number of cores, you probably want to do some saving of partial results along the way and print out some progress information. The easiest thing to do is probably to have your function you call save its results to a file or database or whatever; if that's not practical, you could also use apply_async in a loop and handle the results as they come in.

You could also look into something like joblib to handle this for you; I'm not very familiar with it but it seems like it's approaching the same problem.

Solution 2:

If you want to run the job on a single computer, use multiprocessing.Pool() as suggested by @Dougal in his answer.

If you would like to get multiple computers working on the problem, Python can do that too. I did a Google search for "python parallel processing" and found this:

Parallel Processing in python

One of the answers recommends "mincemeat", a map/reduce solution in a single 377-line Python source file!

https://github.com/michaelfairley/mincemeatpy

I'll bet, with a little bit of work, you could use multiprocessing.Pool() to spin up a set of mincemeat clients, if you wanted to use multiple cores on multiple computers.

EDIT: I did some more research tonight, and it looks like Celery would be a good choice. Celery will already run multiple workers per machine.

http://www.celeryproject.org/

Celery was recommended here:

https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag

Post a Comment for "Distribute Many Independent, Expensive Operations Over Multiple Cores In Python"