Text File Reduction With Randomization In Python
Solution 1:
The simple solution would be to sort a file by the key column e.g., sort tab-separated input by the second column:
#!/bin/bashprintf"a\tz\nb\ty\nc\tx" | sort -k 2 -t $'\t'
And then solve a simpler problem of retrieving 25% of random lines for each unique key where all lines with equal keys are adjacent with a constraint that at least one line for each unique key should be preserved:
#!/usr/bin/env pythonimport random
import sys
from itertools import chain, groupby
defchoose_random(iterator, fraction, random=random.random):
"""Lazy analog of:
L = list(iterator)
k = int(len(L) * fraction + .5) or 1 # get at least one
result = random.sample(L, k)
Note: this function doesn't randomize the order of elements
that would require to keep selected elements in memory
and number of output elements is not exactly k
"""# always yield at least one item if input is not empty
item = next(iterator)
it = (x for x in chain([item], iterator) if random() < fraction)
for x in chain([next(it, item)], it):
yield x
defgetkey(line):
return line.split("\t")[1] # 2nd columnfor key, group in groupby(sys.stdin, key=getkey):
sys.stdout.writelines(choose_random(group, fraction=0.25))
Note: the last line in the input file should contain a newline otherwise the output is corrupted if the last line is chosen.
The script accepts sorted (by the key column) input on stdin and prints reduced output to stdout. It requires to store only one line in memory at a time. It is a single-pass algorithm (O(n)).
Solution 2:
Because your problem is vague, I will give a high level solution
- Do not read the entire file in memory
fileObj.read()
orfileObj.readlines()
, rather iterate through the filefor line in fileObj
. Why? This will be memory friednly Create your implementation of Queue based on List
classQueue(object): def__init__(self, max_size): self.queue = [] self.max_size = max_size def__getitem__(self, index): if0 <= index < max_size: return self.queue[index] else: raise IndexError def__iter__(self): returniter(self.queue) defpush(seq): ifisinstance(seq, Iterable): iflen(self.queue) + len(seq) > self.max_size: raise Full self.queue = seq else: iflen(self.queue) + 1 > self.max_size: raise Full self.queue.append(seq) defpop(): if self.queue: return self.queue.pop(0)
Create a dictionary of Queue with a maxsize = 2 * percentage of selected items
Something like
PCT_SELECTED = 100
MAXSIZE = 2 * PCT_SELECTEDKEY_START=10
KEY_STOP = 15
from collection importdefaultdictqueue_dict= defaultdict(Queue(MAXSIZE))
- Put elements in the Queue in a non blocking fasion
- If Queue is FULL, it will raise the Exception
Full
, in which case, you randomly select 50 % of elements from the Queue and discard the rest.
something like
with open("your-file") as fin:
for line in fin:
key = line[KEY_START: KEY_STOP]
try:
queue_dict[key].push(line)
except Full:
queue_dict[key] = random.sample(queue_dict[key], PCT_SELECTED)
Finally iterate through the dictionary and trim out the Queue randomly
queue_dict = {key: random.sample(value, PCT_SELECTED) for key, value in queue_dict.items()}
Now you can read through the dictinary and write to a file.
Solution 3:
For a large number of items just selecting 75% can be done by checking a random number for each one.
import random
with open('input') as f:
for line in f:
ifrandom.random() < 0.75:
print line
And if you need to guarantee at least one item from each key (even if it only has two lines):
import random
keys = set()
withopen('input') as f:
for line in f:
columns = line.split('\t')
key = columns[0]
ifnot key in keys:
print line
keys.add(key)
continueif random.random() < 0.75:
print line
Post a Comment for "Text File Reduction With Randomization In Python"