How To Read A File In Chunks/slices Using Python Multithreading And Without Using Locks
Solution 1:
There is no way to count the lines in a file without reading it (you could mmap
it to allow the virtual memory subsystem to page out data under memory pressure, but you still have to read the whole file in to find the newlines). If chunks are defined as lines, you're stuck; the file must be read in one way or another to do it.
If chunks can be fixed size blocks of bytes (which may begin and end in the middle of a line), it's easier, but you need to clarify.
Alternatively, if neighboring lines aren't important to one another, instead of chunking, round robin or use a producer/consumer approach (where threads pull new data as it becomes available, rather than distributing by fiat), so the work is naturally distributed evenly.
multiprocessing.Pool
(or multiprocessing.dummy.Pool
if you must use threads instead of processes) makes this easy. For example:
defsomefunctionthatprocessesaline(line):
... do stuff with line ...
return result_of_processing
with multiprocessing.Pool() as pool, open(filename) as f:
results = pool.map(somefunctionthatprocessesaline, f)
... do stuff with results ...
will create a pool of worker processes matching the number of cores you have available, and have the main process feed queues that each worker pull lines from for processing, returning the results in a list
for the main process to use. If you want to process the results from the workers as they become available (instead of waiting for all results to appear in a list
like Pool.map
does), you can use Pool.imap
or Pool.imap_unordered
(depending on whether the results of processing each line should be handled in the same order the lines appear) like so:
with multiprocessing.Pool() as pool, open(filename) as f:
forresultin pool.imap_unordered(somefunctionthatprocessesaline, f):
... do stuff withoneresult ...
Post a Comment for "How To Read A File In Chunks/slices Using Python Multithreading And Without Using Locks"