Skip to content Skip to sidebar Skip to footer
Showing posts with the label Hadoop

Hadoop: How To Include Third Party Library In Python Mapreduce

I am writing MapReduce job in Python, and want to use some third libraries like chardet. I konw tha… Read more Hadoop: How To Include Third Party Library In Python Mapreduce

Loading A Defaultdict In Hadoop Using Pickle And Sys.stdin

I posted a similar question about an hour ago, but have since deleted it after realising I was aski… Read more Loading A Defaultdict In Hadoop Using Pickle And Sys.stdin

Hadoop Streaming: Where Are Application Logs?

My question is similar to : hadoop streaming: how to see application logs? (The link in the answer … Read more Hadoop Streaming: Where Are Application Logs?

Hadoop-streaming : Reduce Task In Pending State Says "no Room For Reduce Task."

My map task completes successfully and I can see the application logs, but reducer stays in pending… Read more Hadoop-streaming : Reduce Task In Pending State Says "no Room For Reduce Task."

Why Am I Getting These Strange Connection Errors When Reading Or Writing To Hadoop File System With A Python Script?

I wrote a python code to read and write to a hadoop file system with IP hdfs_ip. It takes 3 argumen… Read more Why Am I Getting These Strange Connection Errors When Reading Or Writing To Hadoop File System With A Python Script?

List All Files In Hdfs Python Without Pydoop

I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reaso… Read more List All Files In Hdfs Python Without Pydoop

How To Get The Reducer To Emit Only Duplicates

I have a Mapper that is going through lots of data and emitting ID numbers as keys with the value o… Read more How To Get The Reducer To Emit Only Duplicates

Read A Distributed Tab Delimited Csv

Inspired from this question, I wrote some code to store an RDD (which was read from a Parquet file)… Read more Read A Distributed Tab Delimited Csv