Df.topandas() - Failed To Locate The Winutils Binary In The Hadoop Binary Path

June 09, 2023 Post a Comment

I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ['SPARK_HOME'] = 'C:\spark-2.3.0-

Solution 1:

You need to change your code as follows:

spark_home = os.environ.get('SPARK_HOME', None)
os.environ["SPARK_HOME"] = "C:\spark-2.3.0-bin-hadoop2.7"import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate() 
import pandas as pd
ip = spark.read.format("csv").option("inferSchema","true").option("header","true").load(r"some other file.csv")
kw = pd.read_csv(r"some file.csv",encoding='ISO-8859-1',index_col=False,error_bad_lines=False)
for i inrange(len(kw)):
    rx = '(?i)'+kw.Keywords[i]
    ip = ip.where(~ip['Content'].rlike(rx))
op = ip.toPandas().collect()
op.to_csv(r'something.csv',encoding='utf-8')

toPandas() needs to be followed by a collect() action in PySpark for the DataFrame to materialize. This however should not be done for large datasets, as toPandas().collect() causes the data to move to driver, which might crash in case the dataset is to big to fit into driver memory.

As for this line : ip.write.csv('file.csv') I belive it should be changed to ip.write.csv('file:///home/your-user-name/file.csv') to save the file on the local linux filesystem,

ip.option("header", "true").csv("file:///C:/out.csv") to save the file on the local windows filesystem (if you are running Spark and Hadoop on Windows)

or ip.write.csv('hdfs:///user/your-user/file.csv') to save the file to HDFS

Do tell me if this solution worked out for you.

UPDATE

https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/binfollow this link and download the winutils.exe file. Create a folder named hadoop on your C drive and another folder called bin inside the hadoop folder. Place the winutils.exe you downloaded earlier into this directory. Then you need to edit the system variables and add the variable HADOOP_HOME to the list. Once this is done you wont get the error for winutils/hadoop from spark.

. Just type "Edit the system environment variables" in your windows search

Getting Started with Python

Df.topandas() - Failed To Locate The Winutils Binary In The Hadoop Binary Path

Solution 1:

Post a Comment for "Df.topandas() - Failed To Locate The Winutils Binary In The Hadoop Binary Path"