How To Get The Reducer To Emit Only Duplicates
I have a Mapper that is going through lots of data and emitting ID numbers as keys with the value of 1. What I hope to accomplish with the MapReduce job is to get a list of all IDs
Solution 1:
You have committed mistake in few places.
This code:
if last_key and last_key != key: sys.stdout.write("%s\t%s\n" % (last_key,tot_cnt))
should be changed to:
if last_key != key: if(tot_cnt > 1): sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
You were not checking for
tot_cnt > 1
.Last 2 lines:
if last_key: sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
should be changed to:
if last_key and tot_cnt > 1: sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
Here again, you were not checking for
tot_cnt > 1
.
Following is the modified code, which works for me:
import sys
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
inData = codecs.getreader('utf-8')(sys.stdin)
(last_key, tot_cnt) = (None, 0)
for line in inData:
(key, val) = line.strip().split("\t")
if last_key != key:
if(tot_cnt > 1):
sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
(last_key, tot_cnt) = (key, int(val))
else:
(last_key, tot_cnt) = (key, tot_cnt + int(val))
if last_key and tot_cnt > 1:
sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
I get following output, for your data:
abc 2
Post a Comment for "How To Get The Reducer To Emit Only Duplicates"