Skip to content Skip to sidebar Skip to footer

How To Read A File In Chunks/slices Using Python Multithreading And Without Using Locks

I am trying to read a file using multiple threads. I want to divide the file into chunks so that each thread could act separately on each chunk which eliminates the need of a lock

Solution 1:

There is no way to count the lines in a file without reading it (you could mmap it to allow the virtual memory subsystem to page out data under memory pressure, but you still have to read the whole file in to find the newlines). If chunks are defined as lines, you're stuck; the file must be read in one way or another to do it.

If chunks can be fixed size blocks of bytes (which may begin and end in the middle of a line), it's easier, but you need to clarify.

Alternatively, if neighboring lines aren't important to one another, instead of chunking, round robin or use a producer/consumer approach (where threads pull new data as it becomes available, rather than distributing by fiat), so the work is naturally distributed evenly.

multiprocessing.Pool (or multiprocessing.dummy.Pool if you must use threads instead of processes) makes this easy. For example:

defsomefunctionthatprocessesaline(line):
    ... do stuff with line ...
    return result_of_processing

with multiprocessing.Pool() as pool, open(filename) as f:
    results = pool.map(somefunctionthatprocessesaline, f)
... do stuff with results ...

will create a pool of worker processes matching the number of cores you have available, and have the main process feed queues that each worker pull lines from for processing, returning the results in a list for the main process to use. If you want to process the results from the workers as they become available (instead of waiting for all results to appear in a list like Pool.map does), you can use Pool.imap or Pool.imap_unordered (depending on whether the results of processing each line should be handled in the same order the lines appear) like so:

with multiprocessing.Pool() as pool, open(filename) as f:
    forresultin pool.imap_unordered(somefunctionthatprocessesaline, f):
        ... do stuff withoneresult ...

Post a Comment for "How To Read A File In Chunks/slices Using Python Multithreading And Without Using Locks"