Skip to content Skip to sidebar Skip to footer

Pickling Pandas Dataframe Multiplies By 5 The File Size

I am reading a 800 Mb CSV file with pandas.read_csv, and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplie

Solution 1:

Not sure why you think pickling compresses the data size, pickling creates a string version of your python object so that it can be loaded back as a python object:

In [388]:

import sys
import os
df = pd.DataFrame({'a':np.arange(5)})
df.to_pickle(r'c:\data\df.pkl')
print(sys.getsizeof(df))
statinfo = os.stat(r'c:\data\df.pkl')
print(statinfo.st_size)
withopen(r'c:\data\df.pkl', 'rb') as f:
    print(f.read())
56700b'\x80\x04\x95\xb1\x02\x00\x00\x00\x00\x00\x00\x8c\x11pandas.core.frame\x94\x8c\tDataFrame\x94\x93\x94)}\x94\x92\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)}\x94\x92\x94(]\x94(\x8c\x11pandas.core.index\x94\x8c\n_new_Index\x94\x93\x94h\x0b\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94\x8c\x01a\x94at\x94b\x8c\x04name\x94Nu\x86\x94R\x94h\rh\x0b\x8c\nInt64Index\x94\x93\x94}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x05\x85\x94h\x1f\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C(\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x94t\x94bh(Nu\x86\x94R\x94e]\x94h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01K\x05\x86\x94h\x1f\x8c\x02i4\x94K\x00K\x01\x87\x94R\x94(K\x03h5NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x14\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x94t\x94ba]\x94h\rh\x0f}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01\x85\x94h"\x89]\x94h&at\x94bh(Nu\x86\x94R\x94a}\x94\x8c\x060.14.1\x94}\x94(\x8c\x06blocks\x94]\x94}\x94(\x8c\x06values\x94h>\x8c\x08mgr_locs\x94\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94ua\x8c\x04axes\x94h\nust\x94bb.'

The method to_csv does support compression as a kwarg, 'gzip' and 'bz2':

In [390]:

df.to_csv(r'c:\data\df.zip', compression='bz2')
statinfo = os.stat(r'c:\data\df.zip')
print(statinfo.st_size)
29

Solution 2:

It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.

File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.

Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.

You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:

from pandas.io import sql
importsqlite3conn= sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"

results_df = sql.read_frame(query, con=conn)
...

Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.

Solution 3:

Dont load 800MB file to memory. It will increase your loading time. Pickle objects too takes more time to load. Instead store the csv file as a sqlite3 (which comes along with python) table. And then query the table every time depending upon your need.

Solution 4:

You can also use panda's pickle methods which should compress your data.

Save a dataframe:

df.to_pickle(filename)

Load it:

df = pd.read_pickle(filename)

Post a Comment for "Pickling Pandas Dataframe Multiplies By 5 The File Size"