Skip to content Skip to sidebar Skip to footer

Python How To Download Multiple Files In Parallel Using Multiprocessing.pool

I am trying to download and extract zip files using multiprocessing.Pool.But every time I execute the script only 3 zips will be downloaded and remaining files are not seen in the

Solution 1:

I've made a few minor tweeks in your function and it works fine. Please note that:

  1. the file ".../movielists_20130821.zip" appears on your list twice, so you're donwloading the same thing twice (maybe a typo?)
  2. The files ".../multiview_data_20130124.zip", ".../movielists_20130821.zip" and ".../3sources.zip", when extracted, yield a new directory. The file ".../bbcsport.zip", though, when extracted, places it's files in the root folder, your current working directory (see image below). Maybe you missed this check?
  3. I added a try/except block in the donwload function. Why? Multiprocessing works by creating new (sub)processes to run stuff. If a subprocess throws an exception, the parent process does not catch it. So if any erros occour in this subprocess, it must be logged/handled there.

import sys, os
import zipfile
import requests
from multiprocessing import Pool, cpu_count
from functools import partial
from io import BytesIO


def download_zip(url, filePath):
    try:
        file_name = url.split("/")[-1]
        response = requests.get(url)
        sourceZip = zipfile.ZipFile(BytesIO(response.content))
        print(" Downloaded {} ".format(file_name))
        sourceZip.extractall(filePath)
        print(" extracted {}".format(file_name))
        sourceZip.close()
    except Exception as e:
        print(e)


if __name__ == "__main__":
    filePath = os.path.dirname(os.path.abspath(__file__))
    print("filePath is %s " % filePath)
    # sys.path.append(filePath) # why do you need this?
    urls = ["http://mlg.ucd.ie/files/datasets/multiview_data_20130124.zip",
            "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
            "http://mlg.ucd.ie/files/datasets/bbcsport.zip",
            "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
            "http://mlg.ucd.ie/files/datasets/3sources.zip"]

    print("There are {} CPUs on this machine ".format(cpu_count()))
    pool = Pool(cpu_count())
    download_func = partial(download_zip, filePath = filePath)
    results = pool.map(download_func, urls)
    pool.close()
    pool.join()

enter image description here


Solution 2:

i suggest you do it using multithreading since it's an I/O bound like the following :

import requests, zipfile, io
import concurrent.futures 
url = ["http://mlg.ucd.ie/files/datasets/multiview_data_20130124.zip",
   "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
   "http://mlg.ucd.ie/files/datasets/bbcsport.zip",
   "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
   "http://mlg.ucd.ie/files/datasets/3sources.zip"]

def download_zips(url):
   file_name = url.split("/")[-1]
   response = requests.get(url)
   sourceZip = zipfile.ZipFile(io.BytesIO(response.content))
   print("\n Downloaded {} ".format(file_name))
   sourceZip.extractall(filePath)
   print("extracted {} \n".format(file_name))
   sourceZip.close()

with concurrent.futures.ThreadPoolExecutor() as exector : 
   exector.map(download_zip, urls)

Post a Comment for "Python How To Download Multiple Files In Parallel Using Multiprocessing.pool"