Skip to content Skip to sidebar Skip to footer

ValueError: Import Data Via Chunks Into Pandas.csv_reader()

I have a large gzip file which I would like to import into a pandas dataframe. Unfortunately, the file has an uneven number of columns. The data has roughly this format: .... Col_

Solution 1:

You could also try this:

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False):
print(chunk)

error_bad_lines would skip bad lines thought. I will see if a better alternative can be found

EDIT: In order to maintain the lines that were skipped by error_bad_lines we can go through the error and add it back to the dataframe

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

Post a Comment for "ValueError: Import Data Via Chunks Into Pandas.csv_reader()"