Skip to content Skip to sidebar Skip to footer

Text File Reduction With Randomization In Python

I solved the following problem in bash, but I feel it's quite inefficient and very slow given the size of files I need to reduce. Was hoping somebody has an idea how to do the same

Solution 1:

The simple solution would be to sort a file by the key column e.g., sort tab-separated input by the second column:

#!/bin/bashprintf"a\tz\nb\ty\nc\tx" | sort -k 2 -t $'\t'

And then solve a simpler problem of retrieving 25% of random lines for each unique key where all lines with equal keys are adjacent with a constraint that at least one line for each unique key should be preserved:

#!/usr/bin/env pythonimport random
import sys
from itertools import chain, groupby

defchoose_random(iterator, fraction, random=random.random):
    """Lazy analog of:

        L = list(iterator)
        k = int(len(L) * fraction + .5) or 1 # get at least one
        result = random.sample(L, k)

    Note: this function doesn't randomize the order of elements
          that would require to keep selected elements in memory
          and number of output elements is not exactly k
    """# always yield at least one item if input is not empty
    item = next(iterator)
    it = (x for x in chain([item], iterator) if random() < fraction)
    for x in chain([next(it, item)], it):
        yield x

defgetkey(line):
    return line.split("\t")[1] # 2nd columnfor key, group in groupby(sys.stdin, key=getkey):
    sys.stdout.writelines(choose_random(group, fraction=0.25))

Note: the last line in the input file should contain a newline otherwise the output is corrupted if the last line is chosen.

The script accepts sorted (by the key column) input on stdin and prints reduced output to stdout. It requires to store only one line in memory at a time. It is a single-pass algorithm (O(n)).

Solution 2:

Because your problem is vague, I will give a high level solution

  1. Do not read the entire file in memory fileObj.read() or fileObj.readlines(), rather iterate through the file for line in fileObj. Why? This will be memory friednly
  2. Create your implementation of Queue based on List

    classQueue(object):
        def__init__(self, max_size):
            self.queue = []
            self.max_size = max_size
        def__getitem__(self, index):
            if0 <= index < max_size:
                return self.queue[index]
            else:
                raise IndexError
        def__iter__(self):
            returniter(self.queue)
        defpush(seq):
            ifisinstance(seq, Iterable):
                iflen(self.queue) + len(seq) > self.max_size:
                    raise Full
                self.queue = seq
            else:
                iflen(self.queue) + 1 > self.max_size:
                    raise Full
                self.queue.append(seq)
        defpop():
            if self.queue:
                return self.queue.pop(0)
    
  3. Create a dictionary of Queue with a maxsize = 2 * percentage of selected items

Something like

    PCT_SELECTED = 100
    MAXSIZE = 2 * PCT_SELECTEDKEY_START=10
    KEY_STOP = 15 
    from collection importdefaultdictqueue_dict= defaultdict(Queue(MAXSIZE))
  1. Put elements in the Queue in a non blocking fasion
  2. If Queue is FULL, it will raise the Exception Full, in which case, you randomly select 50 % of elements from the Queue and discard the rest.

something like

with open("your-file") as fin:
        for line in fin:
            key = line[KEY_START: KEY_STOP]
            try:
                queue_dict[key].push(line)
            except Full:
                queue_dict[key] = random.sample(queue_dict[key], PCT_SELECTED)
  1. Finally iterate through the dictionary and trim out the Queue randomly

    queue_dict = {key: random.sample(value, PCT_SELECTED) for key, value in queue_dict.items()}
    
  2. Now you can read through the dictinary and write to a file.

Solution 3:

For a large number of items just selecting 75% can be done by checking a random number for each one.

import random

with open('input') as f:
for line in f:
    ifrandom.random() < 0.75:
        print line

And if you need to guarantee at least one item from each key (even if it only has two lines):

import random
keys = set()

withopen('input') as f:
    for line in f:
        columns = line.split('\t')
        key = columns[0]

        ifnot key in keys:
           print line
           keys.add(key)
           continueif random.random() < 0.75:
            print line

Post a Comment for "Text File Reduction With Randomization In Python"